A Novel Tagging Augmented LDA Model for Clustering

2019 ◽  
Vol 16 (3) ◽  
pp. 59-77
Author(s):  
Yi Zhao ◽  
Yu Qiao ◽  
Keqing He

Clustering has become an increasingly important task in the analysis of large documents. Clustering aims to organize these documents, and facilitate better search and knowledge extraction. Most existing clustering methods that use user-generated tags only consider their positive influence for improving automatic clustering performance. The authors argue that not all user-generated tags can provide useful information for clustering. In this article, the authors propose a new solution for clustering, named HRT-LDA (High Representation Tags Latent Dirichlet Allocation), which considers the effects of different tags on clustering performance. For this, the authors perform a tag filtering strategy and a tag appending strategy based on transfer learning, Word2vec, TF-IDF and semantic computing. Extensive experiments on real-world datasets demonstrate that HRT-LDA outperforms the state-of-the-art tagging augmented LDA methods for clustering.

Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 771
Author(s):  
Qiang Wei ◽  
Guangmin Hu

Collected network data are often incomplete, with both missing nodes and missing edges. Thus, network completion that infers the unobserved part of the network is essential for downstream tasks. Despite the emerging literature related to network recovery, the potential information has not been effectively exploited. In this paper, we propose a novel unified deep graph convolutional network that infers missing edges by leveraging node labels, features, and distances. Specifically, we first construct an estimated network topology for the unobserved part using node labels, then jointly refine the network topology and learn the edge likelihood with node labels, node features and distances. Extensive experiments using several real-world datasets show the superiority of our method compared with the state-of-the-art approaches.


Author(s):  
Masoud Hamedani ◽  
Sang-Wook Kim

In this paper, we propose SimAndro-Plus as an improved variant of the state-of-the-art method, SimAndro, to compute the similarity of Android applications (apps) regarding their functionalities. SimAndro-Plus has two major differences with SimAndro: 1) it exploits two beneficial features to similarity computation, which are totally disregarded by SimAndro; 2) to compute the similarity score of an app-pair based on strings and package name features, SimAndro-Plus considers not only those terms co-appearing in both apps but also considers those terms appearing in one app while missing in the other one. The results of our extensive ex periments with three real-world datasets and a dataset constructed by human experts demonstrate that 1) each of the two aforementioned differences is really effective to achieve better accuracy and 2) SimAndro-Plus outperforms SimAndro in similarity computation by 14% in average.


Author(s):  
Shoujin Wang ◽  
Liang Hu ◽  
Yan Wang ◽  
Quan Z. Sheng ◽  
Mehmet Orgun ◽  
...  

User purchase behaviours are complex and dynamic, which are usually observed as multiple choice actions across a sequence of shopping baskets. Most of the existing next-basket prediction approaches model user actions as homogeneous sequence data without considering complex and heterogeneous user intentions, impeding deep under-standing of user behaviours from the perspective of human inside drivers and thus reducing the prediction performance. Psychological theories have indicated that user actions are essentially driven by certain underlying intentions (e.g., diet and entertainment). Moreover, different intentions may influence each other while different choices usually have different utilities to accomplish an intention. Inspired by such psychological insights, we formalize the next-basket prediction as an Intention Recognition, Modelling and Accomplishing problem and further design the Intention2Basket (Int2Ba in short) model. In Int2Ba, an Intention Recognizer, a Coupled Intention Chain Net, and a Dynamic Basket Planner are specifically designed to respectively recognize, model and accomplish the heterogeneous intentions behind a sequence of baskets to better plan the next-basket. Extensive experiments on real-world datasets show the superiority of Int2Ba over the state-of-the-art approaches.


Author(s):  
Sen Su ◽  
Li Sun ◽  
Zhongbao Zhang ◽  
Gen Li ◽  
Jielun Qu

Recently, reconciling social networks receives significant attention. Most of the existing studies have limitations in the following three aspects: multiplicity, comprehensiveness and robustness. To address these three limitations, we rethink this problem and propose the MASTER framework, i.e., across Multiple social networks, integrate Attribute and STructure Embedding for Reconciliation. In this framework, we first design a novel Constrained Dual Embedding model by simultaneously embedding and reconciling multiple social networks to formulate our problem into a unified optimization. To address this optimization, we then design an effective algorithm called NS-Alternating. We also prove that this algorithm converges to KKT points. Through extensive experiments on real-world datasets, we demonstrate that MASTER outperforms the state-of-the-art approaches.


2021 ◽  
Vol 15 (5) ◽  
pp. 1-32
Author(s):  
Quang-huy Duong ◽  
Heri Ramampiaro ◽  
Kjetil Nørvåg ◽  
Thu-lan Dam

Dense subregion (subgraph & subtensor) detection is a well-studied area, with a wide range of applications, and numerous efficient approaches and algorithms have been proposed. Approximation approaches are commonly used for detecting dense subregions due to the complexity of the exact methods. Existing algorithms are generally efficient for dense subtensor and subgraph detection, and can perform well in many applications. However, most of the existing works utilize the state-or-the-art greedy 2-approximation algorithm to capably provide solutions with a loose theoretical density guarantee. The main drawback of most of these algorithms is that they can estimate only one subtensor, or subgraph, at a time, with a low guarantee on its density. While some methods can, on the other hand, estimate multiple subtensors, they can give a guarantee on the density with respect to the input tensor for the first estimated subsensor only. We address these drawbacks by providing both theoretical and practical solution for estimating multiple dense subtensors in tensor data and giving a higher lower bound of the density. In particular, we guarantee and prove a higher bound of the lower-bound density of the estimated subgraph and subtensors. We also propose a novel approach to show that there are multiple dense subtensors with a guarantee on its density that is greater than the lower bound used in the state-of-the-art algorithms. We evaluate our approach with extensive experiments on several real-world datasets, which demonstrates its efficiency and feasibility.


2020 ◽  
Vol 34 (01) ◽  
pp. 19-26 ◽  
Author(s):  
Chong Chen ◽  
Min Zhang ◽  
Yongfeng Zhang ◽  
Weizhi Ma ◽  
Yiqun Liu ◽  
...  

Recent studies on recommendation have largely focused on exploring state-of-the-art neural networks to improve the expressiveness of models, while typically apply the Negative Sampling (NS) strategy for efficient learning. Despite effectiveness, two important issues have not been well-considered in existing methods: 1) NS suffers from dramatic fluctuation, making sampling-based methods difficult to achieve the optimal ranking performance in practical applications; 2) although heterogeneous feedback (e.g., view, click, and purchase) is widespread in many online systems, most existing methods leverage only one primary type of user feedback such as purchase. In this work, we propose a novel non-sampling transfer learning solution, named Efficient Heterogeneous Collaborative Filtering (EHCF) for Top-N recommendation. It can not only model fine-grained user-item relations, but also efficiently learn model parameters from the whole heterogeneous data (including all unlabeled data) with a rather low time complexity. Extensive experiments on three real-world datasets show that EHCF significantly outperforms state-of-the-art recommendation methods in both traditional (single-behavior) and heterogeneous scenarios. Moreover, EHCF shows significant improvements in training efficiency, making it more applicable to real-world large-scale systems. Our implementation has been released 1 to facilitate further developments on efficient whole-data based neural methods.


Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 407 ◽  
Author(s):  
Dominik Weikert ◽  
Sebastian Mai ◽  
Sanaz Mostaghim

In this article, we present a new algorithm called Particle Swarm Contour Search (PSCS)—a Particle Swarm Optimisation inspired algorithm to find object contours in 2D environments. Currently, most contour-finding algorithms are based on image processing and require a complete overview of the search space in which the contour is to be found. However, for real-world applications this would require a complete knowledge about the search space, which may not be always feasible or possible. The proposed algorithm removes this requirement and is only based on the local information of the particles to accurately identify a contour. Particles search for the contour of an object and then traverse alongside using their known information about positions in- and out-side of the object. Our experiments show that the proposed PSCS algorithm can deliver comparable results as the state-of-the-art.


2021 ◽  
Vol 8 (2) ◽  
pp. 273-287
Author(s):  
Xuewei Bian ◽  
Chaoqun Wang ◽  
Weize Quan ◽  
Juntao Ye ◽  
Xiaopeng Zhang ◽  
...  

AbstractRecent learning-based approaches show promising performance improvement for the scene text removal task but usually leave several remnants of text and provide visually unpleasant results. In this work, a novel end-to-end framework is proposed based on accurate text stroke detection. Specifically, the text removal problem is decoupled into text stroke detection and stroke removal; we design separate networks to solve these two subproblems, the latter being a generative network. These two networks are combined as a processing unit, which is cascaded to obtain our final model for text removal. Experimental results demonstrate that the proposed method substantially outperforms the state-of-the-art for locating and erasing scene text. A new large-scale real-world dataset with 12,120 images has been constructed and is being made available to facilitate research, as current publicly available datasets are mainly synthetic so cannot properly measure the performance of different methods.


Author(s):  
Andrés Camero ◽  
Jamal Toutouh ◽  
Javier Ferrer ◽  
Enrique Alba

The unsustainable development of countries has created a problem due to the unstoppable waste generation. Moreover, waste collection is carried out following a pre-defined route that does not take into account the actual level of the containers collected. Therefore, optimizing the way the waste is collected presents an interesting opportunity. In this study, we tackle the problem of predicting the waste generation ratio in real-world conditions, i.e., under uncertainty. Particularly, we use a deep neuroevolutionary technique to automatically design a recurrent network that captures the filling level of all waste containers in a city at once, and we study the suitability of our proposal when faced to noisy and faulty data. We validate our proposal using a real-world case study, consisting of more than two hundred waste containers located in a city in Spain, and we compare our results to the state-of-the-art. The results show that our approach exceeds all its competitors and that its accuracy in a real-world scenario, i.e., under uncertain data, is good enough for optimizing the waste collection planning.


2019 ◽  
Vol 3 (3) ◽  
pp. 165-186 ◽  
Author(s):  
Chenliang Li ◽  
Shiqian Chen ◽  
Yan Qi

Abstract Filtering out irrelevant documents and classifying the relevant ones into topical categories is a de facto task in many applications. However, supervised learning solutions require extravagant human efforts on document labeling. In this paper, we propose a novel seed-guided topic model for dataless short text classification and filtering, named SSCF. Without using any labeled documents, SSCF takes a few “seed words” for each category of interest, and conducts short text filtering and classification in a weakly supervised manner. To overcome the issues of data sparsity and imbalance, the short text collection is mapped to a collection of pseudodocuments, one for each word. SSCF infers two kinds of topics on pseudo-documents: category-topics and general-topics. Each category-topic is associated with one category of interest, covering the meaning of the latter. In SSCF, we devise a novel word relevance estimation process based on the seed words, for hidden topic inference. The dominating topic of a short text is identified through post inference and then used for filtering and classification. On two real-world datasets in two languages, experimental results show that our proposed SSCF consistently achieves better classification accuracy than state-of-the-art baselines. We also observe that SSCF can even achieve superior performance than the supervised classifiers supervised latent dirichlet allocation (sLDA) and support vector machine (SVM) on some testing tasks.


Sign in / Sign up

Export Citation Format

Share Document