scholarly journals Implanting Rational Knowledge into Distributed Representation at Morpheme Level

Author(s):  
Zi Lin ◽  
Yang Liu

Previously, researchers paid no attention to the creation of unambiguous morpheme embeddings independent from the corpus, while such information plays an important role in expressing the exact meanings of words for parataxis languages like Chinese. In this paper, after constructing the Chinese lexical and semantic ontology based on word-formation, we propose a novel approach to implanting the structured rational knowledge into distributed representation at morpheme level, naturally avoiding heavy disambiguation in the corpus. We design a template to create the instances as pseudo-sentences merely from the pieces of knowledge of morphemes built in the lexicon. To exploit hierarchical information and tackle the data sparseness problem, the instance proliferation technique is applied based on similarity to expand the collection of pseudo-sentences. The distributed representation for morphemes can then be trained on these pseudo-sentences using word2vec. For evaluation, we validate the paradigmatic and syntagmatic relations of morpheme embeddings, and apply the obtained embeddings to word similarity measurement, achieving significant improvements over the classical models by more than 5 Spearman scores or 8 percentage points, which shows very promising prospects for adoption of the new source of knowledge.

Information ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Yang Yuan ◽  
Xiao Li ◽  
Ya-Ting Yang

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.


2015 ◽  
Vol 24 (02) ◽  
pp. 1540010 ◽  
Author(s):  
Patrick Arnold ◽  
Erhard Rahm

We introduce a novel approach to extract semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. These relations are used to build up a large and up-to-date thesaurus providing background knowledge for tasks such as determining semantic ontology mappings. Our automatic approach uses a comprehensive set of semantic patterns, finite state machines and NLP techniques to extract millions of relations between concepts. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. We also illustrate the value of the newly found relations for improving existing ontology mappings.


2020 ◽  
Vol 8 ◽  
pp. 311-329
Author(s):  
Kushal Arora ◽  
Aishik Chakraborty ◽  
Jackie C. K. Cheung

In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model. 1 Code available at https://github.com/aishikchakraborty/LexSub .


2005 ◽  
Vol 06 (03) ◽  
pp. 209-228 ◽  
Author(s):  
QUSAY H. MAHMOUD ◽  
WASSAM ZAHREDDINE

The modularity of web services has left an open problem in composition, a scenario that involves an amalgamation of two or more web services to fulfill a request that no one web service is able to provide. This paper presents a framework for adaptive and dynamic composition of web services, enabling web services to be discovered either statically or dynamically by utilizing a semantic ontology to describe web services and their methods. This novel approach gives greater control on how web services are dynamically discovered by allowing the application developer to specify how matches are made, which goes beyond the present techniques of semantically matching inputs and outputs along with classification taxonomies. We utilize the Composite Capabilities/Preferences Profiles (CC/PP) to adapt the interface and content to be compatible with virtually any device. A proof of concept implementation has been constructed that enables users of any device to dynamically discover context-based services that will be dynamically composed to satisfy a user's request. In addition, we have designed and implemented a UDDI-like registry to support context-based adaptive composition of web services. Existing web services can be easily adapted and new web services can be effortlessly deployed.


2011 ◽  
Vol 18 (4) ◽  
pp. 521-548 ◽  
Author(s):  
SANDRA KÜBLER ◽  
EMAD MOHAMED

AbstractThis paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.


1999 ◽  
Vol 5 (2) ◽  
pp. 157-170
Author(s):  
JEONG-MI CHO ◽  
JUNGYUN SEO ◽  
GIL CHANG KIM

This paper presents a system for automatic verb sense disambiguation using a small corpus and a Machine-Readable Dictionary (MRD) in Korean. The system learns a set of typical uses listed in the MRD usage examples for each of the senses of a polysemous verb in the MRD definitions using verb-object co-occurrences acquired from the corpus. This paper concentrates on the problem of data sparseness in two ways. First, by extending word similarity measures from direct co-occurrences to co-occurrences of co-occurring words, we compute the word similarities using non co-occurring words but co-occurring clusters. Secondly, we acquire IS-A relations of nouns from the MRD definitions. It is possible to roughly cluster the nouns by the identification of the IS-A relationship. Using these methods, two words may be considered similar even if they do not share any word elements. Experiments show that this method can learn from a very small training corpus, achieving over an 86% correct disambiguation performance without any restriction on a word's senses.


2000 ◽  
Vol 36 (5) ◽  
pp. 717-736 ◽  
Author(s):  
El-Sayed Atlam ◽  
Masao Fuketa ◽  
Kazuhiro Morita ◽  
Jun-ichi Aoe

Sign in / Sign up

Export Citation Format

Share Document