Reduced Large Datasets by Fuzzy C-Mean Clustering Using Minimal Enclosing Ball

Author(s):  
Lachachi Nour-Eddine ◽  
Adla Abdelkader
Keyword(s):  
2017 ◽  
Vol 25 (2) ◽  
pp. 927-960
Author(s):  
Jarod Jacobs

In this article, I discuss three statistical tools that have proven pivotal in linguistic research, particularly those studies that seek to evaluate large datasets. These tools are the Gaussian Curve, significance tests, and hierarchical clustering. I present a brief description of these tools and their general uses. Then, I apply them to an analysis of the variations between the “biblical” DSS and our other witnesses, focusing upon variations involving particles. Finally, I engage the recent debate surrounding the diachronic study of Biblical Hebrew. This article serves a dual function. First, it presents statistical tools that are useful for many linguistic studies. Second, it develops an analysis of the he-locale, as it is used in the “biblical” Dead Sea Scrolls, Masoretic Text, and Samaritan Pentateuch. Through that analysis, this article highlights the value of inferential statistical tools as we attempt to better understand the Hebrew of our ancient witnesses.


2018 ◽  
Author(s):  
Andrew Dalke ◽  
Jerome Hert ◽  
Christian Kramer

We present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. mmpdb is freely available.


2012 ◽  
Vol 38 (11) ◽  
pp. 1831
Author(s):  
Wen-Jun HU ◽  
Shi-Tong WANG ◽  
Juan WANG ◽  
Wen-Hao YING

2020 ◽  
Vol 196 ◽  
pp. 105777
Author(s):  
Jadson Jose Monteiro Oliveira ◽  
Robson Leonardo Ferreira Cordeiro

2019 ◽  
Vol 7 (2) ◽  
pp. 24
Author(s):  
Aju J. Fenn ◽  
Lucas Gerdes ◽  
Samuel Rothstein

Using data from 2005 to 2016, this paper examines if players in the National Hockey League (NHL) are being paid a positive differential for their services due to the competition from the Kontinental Hockey League (KHL) and the Swedish Hockey League (SHL). In order to control for performance, we use two different large datasets, (N = 4046) and (N = 1717). In keeping with the existing literature, we use lagged performance statistics and dummy variables to control for the type of NHL contract. The first dataset contains lagged career performance statistics, while the performance statistics are based on the statistics generated during the years under the player’s previous contract. Fixed effects least squares (FELS) and quantile regression results suggest that player production statistics, contract status, and country of origin are significant determinants of NHL player salaries.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Danqing Xu ◽  
Chen Wang ◽  
Atlas Khan ◽  
Ning Shang ◽  
Zihuai He ◽  
...  

AbstractLabeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.


Mathematics ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. 891
Author(s):  
Aurea Grané ◽  
Alpha A. Sow-Barry

This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error.


Author(s):  
Roman Flury ◽  
Reinhard Furrer

AbstractWe discuss the experiences and results of the AppStatUZH team’s participation in the comprehensive and unbiased comparison of different spatial approximations conducted in the Competition for Spatial Statistics for Large Datasets. In each of the different sub-competitions, we estimated parameters of the covariance model based on a likelihood function and predicted missing observations with simple kriging. We approximated the covariance model either with covariance tapering or a compactly supported Wendland covariance function.


Sign in / Sign up

Export Citation Format

Share Document