scholarly journals Creating Welsh Language Word Embeddings

2021 ◽  
Vol 11 (15) ◽  
pp. 6896
Author(s):  
Padraig Corcoran ◽  
Geraint Palmer ◽  
Laura Arman ◽  
Dawn Knight ◽  
Irena Spasić

Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings.

2020 ◽  
Vol 8 ◽  
pp. 311-329
Author(s):  
Kushal Arora ◽  
Aishik Chakraborty ◽  
Jackie C. K. Cheung

In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model. 1 Code available at https://github.com/aishikchakraborty/LexSub .


Author(s):  
Andrey Indukaev

AbstractThis chapter applies computational methods of textual analysis to a large corpus of media texts to study ideational change. The empirical focus of the chapter is on the ideas of the political role of innovation, technology, and economic development that were introduced into Russian politics during Medvedev’s presidency. The chapter uses topic modeling, shows the limitations of the method, and provides a more nuanced analysis with the help of word embeddings. The latter method is used to analyze semantic change and to capture complex semantic relationships between the studied concepts.


Author(s):  
Marko Robnik-Šikonja ◽  
Kristjan Reba ◽  
Igor Mozetič

Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by mapping one language’s vector space to the vector space of another language or by construction of a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.


2018 ◽  
Vol 24 (1) ◽  
pp. 553-562 ◽  
Author(s):  
Shusen Liu ◽  
Peer-Timo Bremer ◽  
Jayaraman J. Thiagarajan ◽  
Vivek Srikumar ◽  
Bei Wang ◽  
...  

2011 ◽  
Vol 69 (11) ◽  
pp. 2763-2770 ◽  
Author(s):  
Bettina Hohlweg-Majert ◽  
Christoph Pautke ◽  
H. Deppe ◽  
Marc C. Metzger ◽  
Katharina Wagner ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document