corpus size
Recently Published Documents


TOTAL DOCUMENTS

57
(FIVE YEARS 32)

H-INDEX

5
(FIVE YEARS 3)

Author(s):  
Sunita Warjri ◽  
Partha Pakray ◽  
Saralin A. Lyngdoh ◽  
Arnab Kumar Maji

Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.


PLoS ONE ◽  
2022 ◽  
Vol 17 (1) ◽  
pp. e0260210
Author(s):  
Shan Wang ◽  
Ruhan Liu ◽  
Chu-Ren Huang

Leech’s corpus-based comparison of English modal verbs from 1961 to 1992 showed the steep decline of all modal verbs together, which he ascribed to continuing changes towards a more equal and less authority-driven society. This study inspired many diachronic and synchronic studies, mostly on English modal verbs and largely assuming the correlation between the use of modal verbs and power relations. Yet, there are continuing debates on sampling design and the choices of corpora. In addition, this hypothesis has not been attested in any other language with comparable corpus size or examined with longitudinal studies. This study tracks the use of Chinese modal verbs from 1901 to 2009, covering the historical events of the New Culture Movement, the establishment of the PRC, the implementation of simplified characters and the completion and finalization of simplification of the Chinese writing system. We found that the usage of modal verbs did rise and fall during the last century, and for more complex reasons. We also demonstrated that our longitudinal end-to-end approach produces convincing analysis on English modal verbs that reconciles conflicting results in the literature adopting Leech’s point-to-point approach.


2021 ◽  
Author(s):  
Riyazahmed K

Abstract In this study, I examine the risk-adjusted return of mutual funds in India. A data set of 4220 mutual funds is used for the analysis. Sharpe ratio, a metric of risk-adjusted return (Sharpe, 1994) and Information ratio, a metric of outperformance than a fund’s benchmark (Goodwin, 1998) were analyzed. Regression analysis is used to estimate the impact of fund characteristics like fund category, fund type, fund access type, corpus size on the dependent variables i.e., Sharpe Ratio and the Information Ratio. All the funds underperformed in both the Sharpe ratio and Information ratio. Liquid funds found worst. Fund type and corpus size do not impact fund performance. Fund access type was found to be significant on fund performance. The results add to the literature by examining the post-pandemic period.


Author(s):  
Mieradilijiang Maimaiti ◽  
Yang Liu ◽  
Huanbo Luan ◽  
Zegao Pan ◽  
Maosong Sun

Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating pseudo data mainly lay in word omission, random sampling, or replacing some words in the text. However, previous methods barely guarantee the quality of augmented data. In this work, we try to build the data by using paraphrase embedding and POS-Tagging. Namely, we generate the fake monolingual corpus by replacing the main four POS-Tagging labels, such as noun, adjective, adverb, and verb, based on both the paraphrase table and their similarity. We select the bigger corpus size of the paraphrase table with word level and obtain the word embedding of each word in the table, then calculate the cosine similarity between these words and tagged words in the original sequence. In addition, we exploit the ranking algorithm to choose highly similar words to reduce semantic errors and leverage the POS-Tagging replacement to mitigate syntactic error to some extent. Experimental results show that our augmentation method consistently outperforms all previous SOTA methods on the low-resource language pairs in seven language pairs from four corpora by 1.16 to 2.39 BLEU points.


Author(s):  
Rajesh Kumar Mundotiya ◽  
Manish Kumar Singh ◽  
Rahul Kapur ◽  
Swasti Mishra ◽  
Anil Kumar Singh

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.


2021 ◽  
Author(s):  
◽  
Samuel Hindmarsh

<p>Assistive technologies aim to provide assistance to those who are unable to perform various tasks in their day-to-day lives without tremendous difficulty. This includes — amongst other things — communicating with others. Augmentative and adaptive communication (AAC) is a branch of assistive technologies which aims to make communicating easier for people with disabilities which would otherwise prevent them from communicating efficiently (or, in some cases, at all). The input rate of these communication aids, however, is often constrained by the limited number of inputs found on the devices and the speed at which the user can toggle these inputs. A similar restriction is also often found on smaller devices such as mobile phones: these devices also often require the user to input text with a smaller input set, which often results in slower typing speeds.  Several technologies exist with the purpose of improving the text input rates of these devices. These technologies include ambiguous keyboards, which allow users to input text using a single keypress for each character and trying to predict the desired word; word prediction systems, which attempt to predict the word the user is attempting to input before he or she has completed it; and word auto-completion systems, which complete the entry of predicted words before all the corresponding inputs have been pressed.  This thesis discusses the design and implementation of a system incorporating the three aforementioned assistive input methods, and presents several questions regarding the nature of these technologies. The designed system is found to outperform a standard computer keyboard in many situations, which is a vast improvement over many other AAC technologies. A set of experiments was designed and performed to answer the proposed questions, and the results of the experiments determine that the corpus used to train the system — along with other tuning parameters — have a great impact on the performance of the system. Finally, the thesis also discusses the impact that corpus size has on the memory usage and response time of the system.</p>


2021 ◽  
Author(s):  
◽  
Samuel Hindmarsh

<p>Assistive technologies aim to provide assistance to those who are unable to perform various tasks in their day-to-day lives without tremendous difficulty. This includes — amongst other things — communicating with others. Augmentative and adaptive communication (AAC) is a branch of assistive technologies which aims to make communicating easier for people with disabilities which would otherwise prevent them from communicating efficiently (or, in some cases, at all). The input rate of these communication aids, however, is often constrained by the limited number of inputs found on the devices and the speed at which the user can toggle these inputs. A similar restriction is also often found on smaller devices such as mobile phones: these devices also often require the user to input text with a smaller input set, which often results in slower typing speeds.  Several technologies exist with the purpose of improving the text input rates of these devices. These technologies include ambiguous keyboards, which allow users to input text using a single keypress for each character and trying to predict the desired word; word prediction systems, which attempt to predict the word the user is attempting to input before he or she has completed it; and word auto-completion systems, which complete the entry of predicted words before all the corresponding inputs have been pressed.  This thesis discusses the design and implementation of a system incorporating the three aforementioned assistive input methods, and presents several questions regarding the nature of these technologies. The designed system is found to outperform a standard computer keyboard in many situations, which is a vast improvement over many other AAC technologies. A set of experiments was designed and performed to answer the proposed questions, and the results of the experiments determine that the corpus used to train the system — along with other tuning parameters — have a great impact on the performance of the system. Finally, the thesis also discusses the impact that corpus size has on the memory usage and response time of the system.</p>


Author(s):  
Reuben Ng ◽  
Yi Wen Tan

The current media studies of COVID-19 devote asymmetrical attention to social media; in contrast, newspapers have received comparatively less attention. Newspapers are an integral source of current information that are syndicated and amplified by social media to a wide global audience. This is one of the first known studies to operationalize news media diversity and examine its association with cultural values during the pandemic. We tracked the global diversity of COVID-19 coverage in a news media database of 12 billion words, collated from 28 million articles over 7000 news websites, across 8 months. Media diversity was measured weekly by the number of unique descriptors of 10 target terms of the pandemic (e.g., COVID-19, coronavirus) and normalized by the corpus size for the respective countries per week. Government Stringency was taken from the Oxford COVID-19 Government Response Tracker and cultural scores were taken from Hofstede’s Cultural Values global database. Results showed that Media Diversity Rate increased 6.7 times over 8 months, from the baseline period (October–December 2019) to during the pandemic (January–May 2020). Mixed effects modelling revealed that higher COVID-19 prevalence rates and governmental stringency predicted this increase. Interestingly, collectivist cultures are linked to more diverse media coverage during COVID-19. It is possible that news outlets in collectivist societies are motivated to present a diverse array of topics given the impact of COVID-19 on every segment of society. Of broader significance, we provided a framework to design targeted public health communications that are culturally nuanced.


2021 ◽  
Vol 12 (4) ◽  
pp. 612-648
Author(s):  
Johannes Scherling

Abstract For a few decades now and most prominently promoted by the US, neoliberal economics have been on the rise, epitomized in recent austerity policies with regard to countries that have met financial trouble. In particular the drive for privatization of core public services relating to basic human needs, such as water, social services or pensions, has been increasingly criticized because of a perceived incompatibility between the profit motive and social solidarity. This article uses a corpus-based analysis of the discourse on privatization in the US of proponents supporting, respectively opposing it, with an overall corpus size of about 230,000 tokens. It examines how the two groups conceptualize privatization differently and which strategies are applied to fore- or background particular aspects of it.


2021 ◽  
Author(s):  
Wilson Wongso ◽  
Henry Lucky ◽  
Derwin Suhartono

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.


Sign in / Sign up

Export Citation Format

Share Document