document level
Recently Published Documents


TOTAL DOCUMENTS

335
(FIVE YEARS 213)

H-INDEX

19
(FIVE YEARS 6)

Author(s):  
Xiaomian Kang ◽  
Yang Zhao ◽  
Jiajun Zhang ◽  
Chengqing Zong

Document-level neural machine translation (DocNMT) has yielded attractive improvements. In this article, we systematically analyze the discourse phenomena in Chinese-to-English translation, and focus on the most obvious ones, namely lexical translation consistency. To alleviate the lexical inconsistency, we propose an effective approach that is aware of the words which need to be translated consistently and constrains the model to produce more consistent translations. Specifically, we first introduce a global context extractor to extract the document context and consistency context, respectively. Then, the two types of global context are integrated into a encoder enhancer and a decoder enhancer to improve the lexical translation consistency. We create a test set to evaluate the lexical consistency automatically. Experiments demonstrate that our approach can significantly alleviate the lexical translation inconsistency. In addition, our approach can also substantially improve the translation quality compared to sentence-level Transformer.


2021 ◽  
pp. 1-42
Author(s):  
Tirthankar Ghosal ◽  
Tanik Saikh ◽  
Tameesh Biswas ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya

Abstract The quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the web, there is an accompanying menace of redundancy. A considerable portion of the web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant ones. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts towards the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multi-premise entailment task is one close approximation towards identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state-of-the-art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further researchon document-level novelty detection.


2021 ◽  
Author(s):  
Joshua Eykens ◽  
Raf Guns ◽  
Raf Vanderstraeten

In this study we explore the disciplinary diversity present within subject specialties in the social sciences and humanities. Subject specialties are operationalized as textually coherent clusters of documents. We apply topic modelling to textual information on the individual document level (titles and abstracts) to cluster a multilingual set of roughly 45,000 documents into subject specialties. The dataset includes the metadata of journal articles, conference proceedings, book chapters, and monographs. We make use of two indicators, namely, the organizational affiliation based on the departmental address of the authors and the cognitive orientation based on the disciplinary classifications at the publication level. First, we study the disciplinary diversity of the clusters by calculating a Hill-type diversity index. We draw an overall picture of the distribution of subject specialties over diversity scores and contrast the two indicators with each other. The goal is to discover whether some subject specialties are inherently multi- or interdisciplinary in nature, and whether the different indicators are telling a well-aligned, similar story. Second, for each cluster of documents we calculate the dominance, i.e. the relative size of the largest discipline. This proxy of disciplinary concentration gives an idea of the extent to which a specialty is disciplined. The results show that all subject specialties analyzed serve as interdisciplinary trading grounds, with outliers in both directions of the disciplinary-interdisciplinary continuum. For a large share of specialties, the dominant cognitive and organizational disciplinary classification were found to be well aligned. We present a typology of subject specialties by contrasting the organizational and cognitive diversity scores.


2021 ◽  
Vol 11 (23) ◽  
pp. 11091
Author(s):  
Akhmedov Farkhod ◽  
Akmalbek Abdusalomov ◽  
Fazliddin Makhmudov ◽  
Young Im Cho

Customer reviews on the Internet reflect users’ sentiments about the product, service, and social events. As sentiments can be divided into positive, negative, and neutral forms, sentiment analysis processes identify the polarity of information in the source materials toward an entity. Most studies have focused on document-level sentiment classification. In this study, we apply an unsupervised machine learning approach to discover sentiment polarity not only at the document level but also at the word level. The proposed topic document sentence (TDS) model is based on joint sentiment topic (JST) and latent Dirichlet allocation (LDA) topic modeling techniques. The IMDB dataset, comprising user reviews, was used for data analysis. First, we applied the LDA model to discover topics from the reviews; then, the TDS model was implemented to identify the polarity of the sentiment from topic to document, and from document to word levels. The LDAvis tool was used for data visualization. The experimental results show that the analysis not only obtained good topic partitioning results, but also achieved high sentiment analysis accuracy in document- and word-level sentiment classifications.


2021 ◽  
Vol 21 (S7) ◽  
Author(s):  
Tao Li ◽  
Ying Xiong ◽  
Xiaolong Wang ◽  
Qingcai Chen ◽  
Buzhou Tang

Abstract Objective Relation extraction (RE) is a fundamental task of natural language processing, which always draws plenty of attention from researchers, especially RE at the document-level. We aim to explore an effective novel method for document-level medical relation extraction. Methods We propose a novel edge-oriented graph neural network based on document structure and external knowledge for document-level medical RE, called SKEoG. This network has the ability to take full advantage of document structure and external knowledge. Results We evaluate SKEoG on two public datasets, that is, Chemical-Disease Relation (CDR) dataset and Chemical Reactions dataset (CHR) dataset, by comparing it with other state-of-the-art methods. SKEoG achieves the highest F1-score of 70.7 on the CDR dataset and F1-score of 91.4 on the CHR dataset. Conclusion The proposed SKEoG method achieves new state-of-the-art performance. Both document structure and external knowledge can bring performance improvement in the EoG framework. Selecting proper methods for knowledge node representation is also very important.


Entropy ◽  
2021 ◽  
Vol 23 (11) ◽  
pp. 1449
Author(s):  
Tianbo Ji ◽  
Chenyang Lyu ◽  
Zhichao Cao ◽  
Peng Cheng

Neural auto-regressive sequence-to-sequence models have been dominant in text generation tasks, especially the question generation task. However, neural generation models suffer from the global and local semantic semantic drift problems. Hence, we propose the hierarchical encoding–decoding mechanism that aims at encoding rich structure information of the input passages and reducing the variance in the decoding phase. In the encoder, we hierarchically encode the input passages according to its structure at four granularity-levels: [word, chunk, sentence, document]-level. Second, we progressively select the context vector from the document-level representations to the word-level representations at each decoding time step. At each time-step in the decoding phase, we progressively select the context vector from the document-level representations to word-level. We also propose the context switch mechanism that enables the decoder to use the context vector from the last step when generating the current word at each time-step.It provides a means of improving the stability of the text generation process during the decoding phase when generating a set of consecutive words. Additionally, we inject syntactic parsing knowledge to enrich the word representations. Experimental results show that our proposed model substantially improves the performance and outperforms previous baselines according to both automatic and human evaluation. Besides, we implement a deep and comprehensive analysis of generated questions based on their types.


Author(s):  
Zhenhao Wu ◽  
Jianbo Gao ◽  
Qingshan Li ◽  
Zhi Guan ◽  
Zhong Chen

Sign in / Sign up

Export Citation Format

Share Document