scholarly journals Detecting Anomalies in Sequences of Short Text Using Iterative Language Models

Author(s):  
Cynthia Freeman ◽  
Ian Beaver ◽  
Abdullah Mueen

Business managers using Intelligent Virtual Assistants (IVAs) to enhance their company's customer service need ways to accurately and efficiently detect anomalies in conversations between the IVA and customers, vital for customer retention and satisfaction. Unfortunately, anomaly detection is a challenging problem because of the subjective nature of what is defined as anomalous. Detecting anomalies in sequences of short texts, common in chat settings, is even more difficult because independently generated texts are similar only at a semantic level, resulting in an abundance of false positives. In addition, literature for detecting anomalies in time ordered sequences of short text is shallow considering the abundance of such data sets in online settings. We introduce a technique for detecting anomalies in sequences of short textual data by adaptively and iteratively learning low perplexity language models. Our algorithm defines a short textual item as anomalous when its cross-entropy exceeds the upper confidence interval of a trained additive regression model. We demonstrate successful case studies and bridge the gap between theory and practice by finding anomalies in sequences of real conversations with virtual chat agents. Empirical evaluation shows that our method achieves, on average, 31% higher max F1 scores than the baseline method of non-negative matrix factorization across three large human-annotated sequences of short texts.

Author(s):  
О.Ю. Бушуева

Распространенные и зачастую сочетающиеся кардио- и цереброваскулярные заболевания (КЦВЗ), включающие артериальную гипертензию (АГ), ишемическую болезнь сердца (ИБС) и мозговой инсульт (МИ), представляют собой основную причину смертности во всем мире. Окислительный стресс имеет множество патологических эффектов на сосудистый гомеостаз и в настоящее время рассматривается как один из общих механизмов развития КЦВЗ. Целью исследования было изучение ассоциации однонуклеотидных полиморфизмов генов редокс-гомеостаза rs2070424 SOD1, rs4880 SOD2, rs769214 CAT, rs713041 GPX4, rs41303970 GCLM, rs17883901 GCLC, rs854560 PON1, rs7493 PON2, rs1695 GSTP1, rs2266782 FMO3 с развитием изолированных и сочетанных форм КЦВЗ. Материалом для исследования послужила выборка неродственных индивидов славянского происхождения, общей численностью 2702 человека. В исследование вошли 1815 пациентов с различными кардио- и цереброваскулярными заболеваниями и их сочетаниями: с изолированной АГ (иАГ), с изолированной ишемической болезнью сердца (иИБС), с сочетанием АГ и ИБС (АГ+ИБС), с мозговым инсультом (МИ) на фоне АГ (АГ+МИ); с коморбидной кардио- и цереброваскулярной патологией (АГ+ИБС+МИ). Из общей выборки здоровых лиц (N=887) были сформированы 5 контрольных групп, соответствующих по полу и возрасту каждой из групп нозологических форм заболеваний. Генотипирование SNP проводили методом ПЦР в режиме реального времени путем дискриминации аллелей с помощью TaqMan-зондов. Для анализа ассоциаций генотипов с развитием заболеваний пользовались лог-аддитивной регрессионной моделью. Все расчеты выполнены относительно минорного аллеля; введены поправки на пол и возраст. SNP rs1695 GSTP1 был связан исключительно с развитием иАГ (OR=1,19, 95%CI=1,01-1,39, р=0,034). SNP rs7493 PON2 был связан с развитием всех исследованных коморбидных кардио- и цереброваскулярных заболеваний: АГ+ИБС (adjOR=1,32, adj95%CI=1,07-1,63, adjp=0,01); АГ+МИ (adjOR=1,79, adj95%CI=1,45-2,21, adjp<0,0001); АГ+ИБС+МИ (adjOR=1,51, adj95%CI=1,09-2,09, adjp=0,01), а также с укорочением протромбинового времени (adjDifference=-0,35; adjp=0,01). SNP rs2266782 FMO3 был связан с фенотипом АГ+МИ (adjOR=1,24, adj95%CI=1,02-1,51, adjp=0,03), а также снижал возраст манифестации МИ (adjDifference=-2,31; adjp=0,03). Таким образом, установлено, что однонуклеотидные полиморфизмы генов редокс-гомеостаза могут представлять важную генетическую компоненту формирования дифференцированности кардио- и цереброваскулярных фенотипов. Common and often comorbid cardio- and cerebrovascular diseases (CCVD), including arterial hypertension (AH), coronary heart disease (CHD), and cerebral stroke (CS), are the leading cause of death worldwide. Oxidative stress has many pathological effects on vascular homeostasis and is currently regarded as one of the common mechanisms for the development of CCVD. The aim of our study was to investigate the association of single nucleotide polymorphisms of the redox-homeostasis genes rs2070424 SOD1, rs4880 SOD2, rs769214 CAT, rs713041 GPX4, rs41303970 GCLM, rs17883901 GCLC, rs854560 PON1, rs7493 PON2, rs1695 GSTP1, rs2266782 FMO3 with the development of isolated and comorbid CCVD. A total 2702 individuals of Slavic origin were included for this study. The patients group included 1815 subjects with various CCVD and their combinations: isolated AH (IAH); isolated IHD (IIHD), combination of AH and IHD (AH+IHD); combination of AH and CS (AH+CS); comorbid cardio- and cerebrovascular pathology (AH+IHD+CS). From the total sample of healthy individuals (N=887), 5 sex- and age-matched control groups were formed. Genotyping was performed using TaqMan-based PCR. To analyze the associations of genotypes with the risk of diseases, a log-additive regression model was used. All calculations were performed relative to the minor allele; corrections for gender and age have been introduced. SNP rs1695 GSTP1 was associated with IAH exclusively (OR=1.19, 95%CI=1.01-1.39, P=0.034). SNP rs7493 PON2 was associated with the development of all studied comorbid CCVD: AH+IHD (adjOR=1.32, adj95%CI=1.07-1.63, adjP=0.01); AH+CS (adjOR=1.79, adj95%CI=1.45-2.21, adjP<0.0001); AH+IHD+CS (adjOR=1.51, adj95%CI=1.09-2.09, adjP=0.01), as well as shortening of prothrombin time (adjDifference=-0.35; adjP=0.01). SNP rs2266782 FMO3 was associated with the development of AH+CS (adjOR=1.24, adj95%CI=1.02-1.51, adjP=0.03), as well as decreased age of manifestation of CS (adjDifference=-2.31; adjP=0.03). Thus, it was found that genes involved in regulation of redox-homeostasis, can represent an important genetic component in the formation of differentiation of cardio- and cerebrovascular phenotypes.


2018 ◽  
Vol 6 ◽  
pp. 451-465 ◽  
Author(s):  
Daniela Gerz ◽  
Ivan Vulić ◽  
Edoardo Ponti ◽  
Jason Naradowsky ◽  
Roi Reichart ◽  
...  

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.


2021 ◽  
Author(s):  
Céline Marquet ◽  
Michael Heinzinger ◽  
Tobias Olenyi ◽  
Christian Dallago ◽  
Kyra Erckert ◽  
...  

AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.


2021 ◽  
pp. 1-13
Author(s):  
Xia Li ◽  
Qinghua Wen ◽  
Zengtao Jiao ◽  
Jiangtao Zhang

Abstract The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.


Kavkaz-forum ◽  
2020 ◽  
Author(s):  
Л.Б. ДЗАПАРОВА

Проблемы теории и практики художественного перевода как феномена межкультурной коммуникации в наше время актуализируются в современной филологической науке. Расширяется исследовательское поле в этой области научного знания для переводоведов и всех, кто интересуется проблемами диалога культур. Выбор темы исследования обусловлен и прошедшим в этом году 95-летим юбилеем известного осетинского поэта, драматурга, литературоведа Нафи Григорьевича Джусойты. В статье рассматривается вклад народного писателя Осетии в теоретическое осмысление проблем художественного перевода; впер­вые анализируется одно из самых сложных для перевода стихотворений А.С. Пушкина «Пророк» в интерпретации Н. Джусойты. В частности, автором в статье представлен анализ опубликованных Джусойты на страницах центральных литературных журналов дискуссионных работ по наиболее актуальным проблемам переводоведения. В них Джусойты фокусирует внимание на вопросах верности перевода оригиналу, повышения качества подстрочников, новаторства и модер­низации классических произведений; обуславливает важность переводческого чтения в процессе постижения оригинала; определяет специфику поэтического перевода; выступает против украшательства в переводе, демонстрации на всесоюзном уровне слабых оригиналов и несовершенных переводов. В целом, Джусойты, полемизируя с известными теоретиками, предлагает свою концепцию перевода, начиная от выбора произведения и до конечного результата – текста на другом языке. Перед нами круг проблем, которые до сих пор волнуют специалистов по художественно­му переводу. Автором статьи также представлен сравнительно-сопоставительный анализ стихотворения А. Пушкина в оригинале и в переводческой интерпретации Н.Г. Джусойты. Сличение текстов на смысловом уровне показало стремление Джусойты найти художественные средства, которые помогают раскрыть основной образ. Но не везде раскрыт двуплановый смысл, запечатленный в лексических единицах исходного языка. The problems of the theory and practice of literary translation as a phenomenon of intercultural communication are still actualized in modern philological science. The research field in this area of scientific knowledge is expanding for translators and those interested in the problems of the dialogue of cultures. The choice of the research topic was also conditioned by the 95th anniversary of the famous Ossetian poet, playwright, literary critic Nafi Grigorievich Jusoyty, celebrated this year. The article examines the contribution of the people’s writer of Ossetia to the theoretical understanding of the problems of literary translation; for the first time one of the most difficult poems for translation by A.S. Pushkin’s «Prophet» in the interpretation of N. Jusoyty is reviewed. In particular, the author in the article presents an analysis of the discussion papers published by Jusoyty on the pages of central literary journals on the most pressing problems of translation studies. In them, Jusoyty focuses on the issues of closeness to the original, improving the quality of interlinear translations, innovation and modernization of classical works; determines the importance of translation reading in the process of comprehending the original; determines the specifics of poetic translation; opposes embellishment in translation, the demonstration of weak originals and imperfect translations at the all-Union level. In general, Jusoyty, arguing with well-known theorists, offers his own concept of translation, starting from the choice of a work and up to the final result – a text in another language. We are faced with a range of problems that still concern specialists in literary translation. The author of the article also presents a comparative analysis of A. Pushkin’s poem in the original and in t/he translation interpretation of N.G. Jusoyty. Comparison of the texts at the semantic level showed Jusoyty’s desire to find close artistic means that help to reveal the main image. But not everywhere is disclosed the two-dimensional meaning embodied in the lexical units of the source language.


2017 ◽  
Vol 27 (1) ◽  
pp. 164-186 ◽  
Author(s):  
Hyunju Shin ◽  
Alexander E. Ellinger ◽  
David L. Mothersbaugh ◽  
Kristy E. Reynolds

Purpose Services marketing research continues to be largely focused on firms’ reactive interactions for recovering from service failure rather than on proactive customer interactions that may prevent service failure from occurring in the first place. Building on previous studies that assess the efficacy of implementing proactive interaction in service provision contexts, the purpose of this paper is to compare the influences of proactive interaction to prevent service failure and reactive interaction to correct service failure on customer emotion and patronage behavior. Since proactive interaction for service failure prevention is a relatively underexplored and resource-intensive approach, the authors also assess the moderating influences of customer and firm-related characteristics. Design/methodology/approach The study hypotheses are tested with survey data from two scenario-based experiments conducted in a retail setting. Findings The findings reveal that customers prefer service providers that take the initiative to get to them before they have to initiate contact for themselves. The findings also identify the moderating influences of relationship quality, situational involvement, and contact person status and motive. Originality/value The research contributes to the development of service provision theory and practice by expanding on previous studies which report that proactive efforts to prepare customers for the adverse effects of service failure are favorably received. The results also shed light on moderating factors that may further inform the exploitation of resource-intensive proactive interaction for service failure prevention. An agenda is proposed to stimulate future research on proactive customer interaction to prevent service failure in service provision contexts.


Author(s):  
Ming Hao ◽  
Weijing Wang ◽  
Fang Zhou

Short text classification is an important foundation for natural language processing (NLP) tasks. Though, the text classification based on deep language models (DLMs) has made a significant headway, in practical applications however, some texts are ambiguous and hard to classify in multi-class classification especially, for short texts whose context length is limited. The mainstream method improves the distinction of ambiguous text by adding context information. However, these methods rely only the text representation, and ignore that the categories overlap and are not completely independent of each other. In this paper, we establish a new general method to solve the problem of ambiguous text classification by introducing label embedding to represent each category, which makes measurable difference between the categories. Further, a new compositional loss function is proposed to train the model, which makes the text representation closer to the ground-truth label and farther away from others. Finally, a constraint is obtained by calculating the similarity between the text representation and label embedding. Errors caused by ambiguous text can be corrected by adding constraints to the output layer of the model. We apply the method to three classical models and conduct experiments on six public datasets. Experiments show that our method can effectively improve the classification accuracy of the ambiguous texts. In addition, combining our method with BERT, we obtain the state-of-the-art results on the CNT dataset.


2021 ◽  
Vol 15 ◽  
Author(s):  
Jianwei Zhang ◽  
Xubin Zhang ◽  
Lei Lv ◽  
Yining Di ◽  
Wei Chen

Background: Learning discriminative representation from large-scale data sets has made a breakthrough in decades. However, it is still a thorny problem to generate representative embedding from limited examples, for example, a class containing only one image. Recently, deep learning-based Few-Shot Learning (FSL) has been proposed. It tackles this problem by leveraging prior knowledge in various ways. Objective: In this work, we review recent advances of FSL from the perspective of high-dimensional representation learning. The results of the analysis can provide insights and directions for future work. Methods: We first present the definition of general FSL. Then we propose a general framework for the FSL problem and give the taxonomy under the framework. We survey two FSL directions: learning policy and meta-learning. Results: We review the advanced applications of FSL, including image classification, object detection, image segmentation and other tasks etc., as well as the corresponding benchmarks to provide an overview of recent progress. Conclusion: FSL needs to be further studied in medical images, language models, and reinforcement learning in future work. In addition, cross-domain FSL, successive FSL, and associated FSL are more challenging and valuable research directions.


2021 ◽  
Author(s):  
Thomas Gläßle ◽  
Kerstin Rau ◽  
Thomas Scholten ◽  
Philipp Hennig

&lt;p&gt;Gaussian Processes provide a theoretically well-understood regression framework that is widely used in the context of Digital Soil Mapping. Among the reasons to use Gaussian Process Regression (GPR) are its interpretability, its builtin support for uncertainty quantification, and its ability to handle unevenly spaced and correlated training samples through a user-specified covariance kernel. The base case of GPR is performed with covariance models that are specified functions of Euclidean distance. In order to incorporate information other than the relative positions, regression-kriging extends GPR by an additive regression model of choice, and co-kriging considers a covariance model between covariates and the target variable. In this work, we use the alternative approach of incorporating topographic information directly into the kernel function by use of a non-Euclidean, non-stationary distance function. In particular, we devise kernels based on a path of least effort, where &lt;em&gt;effort&lt;/em&gt; is locally specified as a function constructed from prior knowledge. It can e.g. be derived from local topographic variables. We demonstrate that our candidate models improve prediction accuracy over the base model. This shows that domain knowledge can be integrated into the model by means of handcrafted kernel functions. The approach is not per se restricted to topographic variables, but could be used for any covariate quantity that is available at output resolution.&lt;/p&gt;


Sign in / Sign up

Export Citation Format

Share Document