Dirichlet Multinomial Mixture with Variational Manifold Regularization: Topic Modeling over Short Texts

Conventional topic models suffer from a severe sparsity problem when facing extremely short texts such as social media posts. The family of Dirichlet multinomial mixture (DMM) can handle the sparsity problem, however, they are still very sensitive to ordinary and noisy words, resulting in inaccurate topic representations at the document level. In this paper, we alleviate this problem by preserving local neighborhood structure of short texts, enabling to spread topical signals among neighboring documents, so as to correct the inaccurate topic representations. This is achieved by using variational manifold regularization, constraining the close short texts should have similar variational topic representations. Upon this idea, we propose a novel Laplacian DMM (LapDMM) topic model. During the document graph construction, we further use the word mover’s distance with word embeddings to measure document similarities at the semantic level. To evaluate LapDMM, we compare it against the state-of-theart short text topic models on several traditional tasks. Experimental results demonstrate that our LapDMM achieves very significant performance gains over baseline models, e.g., achieving even about 0.2 higher scores on clustering and classification tasks in many cases.

Download Full-text

A Gamma-Poisson Mixture Topic Model for Short Text

Mathematical Problems in Engineering ◽

10.1155/2020/4728095 ◽

2020 ◽

Vol 2020 ◽

pp. 1-17

Author(s):

Jocelyn Mazarura ◽

Alta de Waal ◽

Pieter de Villiers

Keyword(s):

Poisson Distribution ◽

Mixture Model ◽

Gibbs Sampler ◽

Topic Model ◽

Topic Models ◽

Multinomial Distribution ◽

Topic Modelling ◽

Poisson Mixture ◽

Short Text ◽

Poisson Mixture Model

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.

Download Full-text

Document Informed Neural Autoregressive Topic Models with Distributional Prior

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016505 ◽

2019 ◽

Vol 33 ◽

pp. 6505-6512 ◽

Cited By ~ 2

Author(s):

Pankaj Gupta ◽

Yatin Chaudhary ◽

Florian Buettner ◽

Hinrich Schütze

Keyword(s):

Topic Model ◽

State Of The Art ◽

Topic Models ◽

Language Modeling ◽

Context Information ◽

Short Text ◽

Neuron Networks ◽

Proposed Model ◽

Actual Meaning ◽

Biological Neuron

We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., “networks” used in the contexts artificial neural networks vs. biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADEe and iDocNADEe. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 7 long-text and 8 short-text datasets from diverse domains.

Download Full-text

Topic Models: A Tutorial with R

International Journal of Semantic Computing ◽

10.1142/s1793351x14500044 ◽

2014 ◽

Vol 08 (01) ◽

pp. 85-98 ◽

Cited By ~ 3

Author(s):

G. Manning Richardson ◽

Janet Bowers ◽

A. John Woodill ◽

Joseph R. Barr ◽

Jean Mark Gawron ◽

...

Keyword(s):

Topic Model ◽

Topic Models ◽

Text Analytics ◽

Text Documents ◽

Short Text ◽

Document Databases

This tutorial presents topic models for organizing and comparing documents. The technique and corresponding discussion focuses on analysis of short text documents, particularly micro-blogs. However, the base topic model and R implementation are generally applicable to text analytics of document databases.

Download Full-text

A Study on Bestseller Short Text Semantics Analysis Using Topic Model

The Journal of Image and Cultural Contents ◽

10.24174/jicc.2018.10.15.101 ◽

2018 ◽

Vol 15 ◽

pp. 101-112

Author(s):

So-Hyun Park ◽

Ae-Rin Song ◽

Young-Ho Park ◽

Sun-Young Ihm

Keyword(s):

Topic Model ◽

Short Text

Download Full-text

Author–Subject–Topic model for reviewer recommendation

Journal of Information Science ◽

10.1177/0165551518806116 ◽

2018 ◽

Vol 45 (4) ◽

pp. 554-570 ◽

Cited By ~ 1

Author(s):

Jian Jin ◽

Qian Geng ◽

Haikun Mou ◽

Chong Chen

Keyword(s):

Information System ◽

Topic Model ◽

Academic Library ◽

Topic Models ◽

Interdisciplinary Studies ◽

Distribution Analysis ◽

Topic Distribution ◽

Research Domains

Interdisciplinary studies are becoming increasingly popular, and research domains of many experts are becoming diverse. This phenomenon brings difficulty in recommending experts to review interdisciplinary submissions. In this study, an Author–Subject–Topic (AST) model is proposed with two versions. In the model, reviewers’ subject information is embedded to analyse topic distributions of submissions and reviewers’ publications. The major difference between the AST and Author–Topic models lies in the introduction of a ‘Subject’ layer, which supervises the generation of hierarchical topics and allows sharing of subjects among authors. To evaluate the performance of the AST model, papers in Information System and Management (a typical interdisciplinary domain) in a famous Chinese academic library are investigated. Comparative experiments are conducted, which show the effectiveness of the AST model in topic distribution analysis and reviewer recommendation for interdisciplinary studies.

Download Full-text

IDENTIFICATION OF ARGUMENTATIVE RELATIONS IN POPULAR SCIENCE TEXTS

System Informatics ◽

10.31144/si.2307-6410.2020.n16.p149-164 ◽

2020 ◽

Author(s):

Natalia Vasilievna Salomatina ◽

◽

Irina Semenovna Kononenko ◽

Elena Anatolvna Sidorova ◽

Ivan Sergeevich Pimenov ◽

...

Keyword(s):

Topic Model ◽

Topic Models ◽

Popular Science ◽

Automatic Recognition ◽

Scan Statistics ◽

Science Texts

The presented work describes the analysis of argumentative statements included into the same text topic fragment as a recognition feature in terms of its efficiency. This study is performed with the purpose of using this feature in automatic recognition of argumentative structures presented in the popular science texts written in Russian. The topic model of a text is constructed based on superphrasal units (text fragments united by one topic) that are identified by detecting clusters of words and word-combinations with the use of scan statistics. Potential relations, extracted from topic models, are verified through the use of texts with manually annotated argumentation structures. The comparison between potential (based on topic models) and manually constructed relations is performed automatically. Macro-average scores of precision and recall are equal to 48.6% and 76.2% correspondingly.

Download Full-text

Learning to Classify Short Text with Topic Model and External Knowledge

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39787-5_41 ◽

2013 ◽

pp. 493-503 ◽

Cited By ~ 12

Author(s):

Ying Zhu ◽

Li Li ◽

Le Luo

Keyword(s):

Topic Model ◽

Short Text ◽

External Knowledge

Download Full-text

Topic-BERT: Detecting harmful information from social media

Intelligent Decision Technologies ◽

10.3233/idt-200094 ◽

2021 ◽

pp. 1-10

Author(s):

Wang Gao ◽

Hongtao Deng ◽

Xun Zhu ◽

Yuan Fang

Keyword(s):

Social Media ◽

Language Processing ◽

Topic Model ◽

Classification Performance ◽

Critical Research ◽

Short Text ◽

Additional Information ◽

Proposed Model ◽

Weight Calculation ◽

Two Stages

Harmful information identification is a critical research topic in natural language processing. Existing approaches have been focused either on rule-based methods or harmful text identification of normal documents. In this paper, we propose a BERT-based model to identify harmful information from social media, called Topic-BERT. Firstly, Topic-BERT utilizes BERT to take additional information as input to alleviate the sparseness of short texts. The GPU-DMM topic model is used to capture hidden topics of short texts for attention weight calculation. Secondly, the proposed model divides harmful short text identification into two stages, and different granularity labels are identified by two similar sub-models. Finally, we conduct extensive experiments on a real-world social media dataset to evaluate our model. Experimental results demonstrate that our model can significantly improve the classification performance compared with baseline methods.

Download Full-text

Neural Topic Models for Short Text Using Pretrained Word Embeddings and Its Application To Real Data

10.1109/ickii51822.2021.9574752 ◽

2021 ◽

Author(s):

Riki Murakami ◽

Basabi Chakraborty

Keyword(s):

Topic Models ◽

Real Data ◽

Word Embeddings ◽

Short Text

Download Full-text

Improving Text Analysis Using Sentence Conjunctions and Punctuation

Marketing Science ◽

10.1287/mksc.2019.1214 ◽

2020 ◽

Vol 39 (4) ◽

pp. 727-742 ◽

Cited By ~ 1

Author(s):

Joachim Büschken ◽

Greg M. Allenby

Keyword(s):

Text Analysis ◽

Latent Variable ◽

Topic Model ◽

Topic Models ◽

Future Research ◽

Multiple Data ◽

Variable Approach ◽

Multiple Data Sets ◽

Carry Over ◽

High Level

User-generated content in the form of customer reviews, blogs, and tweets is an emerging and rich source of data for marketers. Topic models have been successfully applied to such data, demonstrating that empirical text analysis benefits greatly from a latent variable approach that summarizes high-level interactions among words. We propose a new topic model that allows for serial dependency of topics in text. That is, topics may carry over from word to word in a document, violating the bag-of-words assumption in traditional topic models. In the proposed model, topic carryover is informed by sentence conjunctions and punctuation. Typically, such observed information is eliminated prior to analyzing text data (i.e., preprocessing) because words such as “and” and “but” do not differentiate topics. We find that these elements of grammar contain information relevant to topic changes. We examine the performance of our models using multiple data sets and establish boundary conditions for when our model leads to improved inference about customer evaluations. Implications and opportunities for future research are discussed.

Download Full-text