scholarly journals A Unified Formulation of k-Means, Fuzzy c-Means and Gaussian Mixture Model by the Kolmogorov–Nagumo Average

Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 518
Author(s):  
Osamu Komori ◽  
Shinto Eguchi

Clustering is a major unsupervised learning algorithm and is widely applied in data mining and statistical data analyses. Typical examples include k-means, fuzzy c-means, and Gaussian mixture models, which are categorized into hard, soft, and model-based clusterings, respectively. We propose a new clustering, called Pareto clustering, based on the Kolmogorov–Nagumo average, which is defined by a survival function of the Pareto distribution. The proposed algorithm incorporates all the aforementioned clusterings plus maximum-entropy clustering. We introduce a probabilistic framework for the proposed method, in which the underlying distribution to give consistency is discussed. We build the minorize-maximization algorithm to estimate the parameters in Pareto clustering. We compare the performance with existing methods in simulation studies and in benchmark dataset analyses to demonstrate its highly practical utilities.

Author(s):  
Zachary R. McCaw ◽  
Hanna Julienne ◽  
Hugues Aschard

AbstractAlthough missing data are prevalent in applications, existing implementations of Gaussian mixture models (GMMs) require complete data. Standard practice is to perform complete case analysis or imputation prior to model fitting. Both approaches have serious drawbacks, potentially resulting in biased and unstable parameter estimates. Here we present MGMM, an R package for fitting GMMs in the presence of missing data. Using three case studies on real and simulated data sets, we demonstrate that, when the underlying distribution is near-to a GMM, MGMM is more effective at recovering the true cluster assignments than state of the art imputation followed by standard GMM. Moreover, MGMM provides an accurate assessment of cluster assignment uncertainty even when the generative distribution is not a GMM. This assessment may be used to identify unassignable observations. MGMM is available as an R package on CRAN: https://CRAN.R-project.org/package=MGMM.


Author(s):  
Rasmus Froberg Brøndum ◽  
Thomas Yssing Michaelsen ◽  
Martin Bøgsted

Abstract Outcome regressed on class labels identified by unsupervised clustering is custom in many applications. However, it is common to ignore the misclassification of class labels caused by the learning algorithm, which potentially leads to serious bias of the estimated effect parameters. Due to their generality we suggest to address the problem by use of regression calibration or the misclassification simulation and extrapolation method. Performance is illustrated by simulated data from Gaussian mixture models, documenting a reduced bias and improved coverage of confidence intervals when adjusting for misclassification with either method. Finally, we apply our method to data from a previous study, which regressed overall survival on class labels derived from unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients.


Author(s):  
Thomas Dierckx ◽  
Jesse Davis ◽  
Wim Schoutens

AbstractThe theory of Narrative Economics suggests that narratives present in media influence market participants and drive economic events. In this chapter, we investigate how financial news narratives relate to movements in the CBOE Volatility Index. To this end, we first introduce an uncharted dataset where news articles are described by a set of financial keywords. We then perform topic modeling to extract news themes, comparing the canonical latent Dirichlet analysis to a technique combining doc2vec and Gaussian mixture models. Finally, using the state-of-the-art XGBoost (Extreme Gradient Boosted Trees) machine learning algorithm, we show that the obtained news features outperform a simple baseline when predicting CBOE Volatility Index movements on different time horizons.


2020 ◽  
Vol 61 ◽  
pp. 53-62
Author(s):  
Tamás Szeniczey

Ancestry can be estimated in a probabilistic framework using facial morphological signatures of the different population histories. While the assessment of qualitative traits requires more experience, the measurement of the most distinctive quantitative characters usually demands anthropological tools seldom available. However, some of the facial measurements related to ancestry can be easily recorded with a caliper. The study aims to present the efficiency of simotic chord (SC) and nasomalar angle (M77) in ancestry estimation by applying supervised and unsupervised algorithms to classify skulls with European and Asian ancestry. Linear discriminant analysis and Gaussian Mixture models were applied to a subset of the Howells’ craniometric dataset comprised of individuals with European and Asian ancestry. Prediction of ancestry was carried out on a set of craniometric traits describing height, length and width parameters of the crania and then repeated on this set supplemented with M77 and SC measurements. An increased percentage of true positive prediction of ancestry was achieved by involving the M77 and SC measurements.


Mathematics ◽  
2021 ◽  
Vol 9 (9) ◽  
pp. 957
Author(s):  
Branislav Popović ◽  
Lenka Cepova ◽  
Robert Cep ◽  
Marko Janev ◽  
Lidija Krstanović

In this work, we deliver a novel measure of similarity between Gaussian mixture models (GMMs) by neighborhood preserving embedding (NPE) of the parameter space, that projects components of GMMs, which by our assumption lie close to lower dimensional manifold. By doing so, we obtain a transformation from the original high-dimensional parameter space, into a much lower-dimensional resulting parameter space. Therefore, resolving the distance between two GMMs is reduced to (taking the account of the corresponding weights) calculating the distance between sets of lower-dimensional Euclidean vectors. Much better trade-off between the recognition accuracy and the computational complexity is achieved in comparison to measures utilizing distances between Gaussian components evaluated in the original parameter space. The proposed measure is much more efficient in machine learning tasks that operate on large data sets, as in such tasks, the required number of overall Gaussian components is always large. Artificial, as well as real-world experiments are conducted, showing much better trade-off between recognition accuracy and computational complexity of the proposed measure, in comparison to all baseline measures of similarity between GMMs tested in this paper.


Sign in / Sign up

Export Citation Format

Share Document