Reduced Large Datasets by Fuzzy C-Mean Clustering Using Minimal Enclosing Ball

In this article, I discuss three statistical tools that have proven pivotal in linguistic research, particularly those studies that seek to evaluate large datasets. These tools are the Gaussian Curve, significance tests, and hierarchical clustering. I present a brief description of these tools and their general uses. Then, I apply them to an analysis of the variations between the “biblical” DSS and our other witnesses, focusing upon variations involving particles. Finally, I engage the recent debate surrounding the diachronic study of Biblical Hebrew. This article serves a dual function. First, it presents statistical tools that are useful for many linguistic studies. Second, it develops an analysis of the he-locale, as it is used in the “biblical” Dead Sea Scrolls, Masoretic Text, and Samaritan Pentateuch. Through that analysis, this article highlights the value of inferential statistical tools as we attempt to better understand the Hebrew of our ancient witnesses.

Download Full-text

mmpdb: An Open Source Matched Molecular Pair Platform for Large Multi-Property Datasets

10.26434/chemrxiv.5999375 ◽

2018 ◽

Author(s):

Andrew Dalke ◽

Jerome Hert ◽

Christian Kramer

Keyword(s):

Open Source ◽

Large Datasets ◽

Molecular Pair ◽

New Algorithms

We present mmpdb, an open source Matched Molecular Pair (MMP) platform to create, compile, store, retrieve, and use MMP rules. mmpdb is suitable for the large datasets typically found in pharmaceutical and agrochemical companies and provides new algorithms for fragment canonicalization and stereochemistry handling. The platform is written in Python and based on the RDKit toolkit. mmpdb is freely available.

Download Full-text

Fast Learning of Generalized Minimum Enclosing Ball for Large Datasets

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2012.01831 ◽

2012 ◽

Vol 38 (11) ◽

pp. 1831

Author(s):

Wen-Jun HU ◽

Shi-Tong WANG ◽

Juan WANG ◽

Wen-Hao YING

Keyword(s):

Large Datasets ◽

Fast Learning ◽

Minimum Enclosing Ball

Download Full-text

Variable Reduction and Variable Selection Methods Using Small, Medium and Large Datasets: A Forecast Comparison for the PEEIs

SSRN Electronic Journal ◽

10.2139/ssrn.2444421 ◽

2014 ◽

Author(s):

George Kapetanios ◽

Massimiliano Giuseppe Marcellino ◽

Fotis Papailias

Keyword(s):

Variable Selection ◽

Large Datasets ◽

Selection Methods ◽

Variable Reduction ◽

Forecast Comparison

Download Full-text

Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

Knowledge-Based Systems ◽

10.1016/j.knosys.2020.105777 ◽

2020 ◽

Vol 196 ◽

pp. 105777

Author(s):

Jadson Jose Monteiro Oliveira ◽

Robson Leonardo Ferreira Cordeiro

Keyword(s):

Dimensionality Reduction ◽

Large Datasets ◽

Very Large Datasets ◽

The Right

Download Full-text

Country of Origin Effects on the Average Annual Values of NHL Player Contracts

International Journal of Financial Studies ◽

10.3390/ijfs7020024 ◽

2019 ◽

Vol 7 (2) ◽

pp. 24

Author(s):

Aju J. Fenn ◽

Lucas Gerdes ◽

Samuel Rothstein

Keyword(s):

Quantile Regression ◽

Fixed Effects ◽

Country Of Origin ◽

Large Datasets ◽

National Hockey League ◽

Dummy Variables ◽

Country Of Origin Effects ◽

Performance Statistics ◽

Career Performance ◽

Using Data

Using data from 2005 to 2016, this paper examines if players in the National Hockey League (NHL) are being paid a positive differential for their services due to the competition from the Kontinental Hockey League (KHL) and the Swedish Hockey League (SHL). In order to control for performance, we use two different large datasets, (N = 4046) and (N = 1717). In keeping with the existing literature, we use lagged performance statistics and dummy variables to control for the type of NHL contract. The first dataset contains lagged career performance statistics, while the performance statistics are based on the statistics generated during the years under the player’s previous contract. Fixed effects least squares (FELS) and quantile regression results suggest that player production statistics, contract status, and country of origin are significant determinants of NHL player salaries.

Download Full-text

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

npj Digital Medicine ◽

10.1038/s41746-021-00488-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Danqing Xu ◽

Chen Wang ◽

Atlas Khan ◽

Ning Shang ◽

Zihuai He ◽

...

Keyword(s):

Risk Stratification ◽

Disease Risk ◽

Association Studies ◽

Large Datasets ◽

Risk Scores ◽

Sequencing Data ◽

Case Definitions ◽

Phenotypic Data ◽

Clinical Risk ◽

Phenotypic Features

AbstractLabeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

Download Full-text

Visualizing Profiles of Large Datasets of Weighted and Mixed Data

Mathematics ◽

10.3390/math9080891 ◽

2021 ◽

Vol 9 (8) ◽

pp. 891

Author(s):

Aurea Grané ◽

Alpha A. Sow-Barry

Keyword(s):

Multidimensional Scaling ◽

Random Sample ◽

Simulation Study ◽

Clustering Algorithm ◽

Computational Cost ◽

Interpolation Formula ◽

Large Datasets ◽

Mixed Data ◽

Multivariate Techniques ◽

High Computational Cost

This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error.

Download Full-text

Discussion on Competition for Spatial Statistics for Large Datasets

Journal of Agricultural Biological and Environmental Statistics ◽

10.1007/s13253-021-00461-3 ◽

2021 ◽

Author(s):

Roman Flury ◽

Reinhard Furrer

Keyword(s):

Spatial Statistics ◽

Covariance Function ◽

Likelihood Function ◽

Large Datasets ◽

Missing Observations ◽

Covariance Model ◽

Compactly Supported ◽

Covariance Tapering ◽

Simple Kriging ◽

Estimated Parameters

AbstractWe discuss the experiences and results of the AppStatUZH team’s participation in the comprehensive and unbiased comparison of different spatial approximations conducted in the Competition for Spatial Statistics for Large Datasets. In each of the different sub-competitions, we estimated parameters of the covariance model based on a likelihood function and predicted missing observations with simple kriging. We approximated the covariance model either with covariance tapering or a compactly supported Wendland covariance function.

Download Full-text