Biostatistics | ScienceGate

Summary High-dimensional biological data collection across heterogeneous groups of samples has become increasingly common, creating high demand for dimensionality reduction techniques that capture underlying structure of the data. Discovering low-dimensional embeddings that describe the separation of any underlying discrete latent structure in data is an important motivation for applying these techniques since these latent classes can represent important sources of unwanted variability, such as batch effects, or interesting sources of signal such as unknown cell types. The features that define this discrete latent structure are often hard to identify in high-dimensional data. Principal component analysis (PCA) is one of the most widely used methods as an unsupervised step for dimensionality reduction. This reduction technique finds linear transformations of the data which explain total variance. When the goal is detecting discrete structure, PCA is applied with the assumption that classes will be separated in directions of maximum variance. However, PCA will fail to accurately find discrete latent structure if this assumption does not hold. Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), attempt to mitigate these problems with PCA by creating a low-dimensional space where similar objects are modeled by nearby points in the low-dimensional embedding and dissimilar objects are modeled by distant points with high probability. However, since t-SNE and UMAP are computationally expensive, often a PCA reduction is done before applying them which makes it sensitive to PCAs downfalls. Also, tSNE is limited to only two or three dimensions as a visualization tool, which may not be adequate for retaining discriminatory information. The linear transformations of PCA are preferable to non-linear transformations provided by methods like t-SNE and UMAP for interpretable feature weights. Here, we propose iterative discriminant analysis (iDA), a dimensionality reduction technique designed to mitigate these limitations. iDA produces an embedding that carries discriminatory information which optimally separates latent clusters using linear transformations that permit post hoc analysis to determine features that define these latent structures.

Download Full-text

Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank

Biostatistics ◽

10.1093/biostatistics/kxab019 ◽

2021 ◽

Author(s):

Ruilin Li ◽

Christopher Chang ◽

Johanne M Justesen ◽

Yosuke Tanigawa ◽

Junyang Qian ◽

...

Keyword(s):

Large Scale ◽

Cox Model ◽

Uk Biobank ◽

Lasso Method

Download Full-text

Simultaneous differential network analysis and classification for matrix-variate data with application to brain connectivity

Biostatistics ◽

10.1093/biostatistics/kxab007 ◽

2021 ◽

Author(s):

Hao Chen ◽

Ying Guo ◽

Yong He ◽

Jiadong Ji ◽

Lei Liu ◽

...

Keyword(s):

Network Analysis ◽

Medical Diagnosis ◽

Brain Connectivity ◽

Interaction Patterns ◽

Medical Diagnoses ◽

Differential Network Analysis ◽

Differential Interaction ◽

Clinical Biomarkers ◽

Differential Network ◽

The Matrix

Summary Growing evidence has shown that the brain connectivity network experiences alterations for complex diseases such as Alzheimer’s disease (AD). Network comparison, also known as differential network analysis, is thus particularly powerful to reveal the disease pathologies and identify clinical biomarkers for medical diagnoses (classification). Data from neurophysiological measurements are multidimensional and in matrix-form. Naive vectorization method is not sufficient as it ignores the structural information within the matrix. In the article, we adopt the Kronecker product covariance matrices framework to capture both spatial and temporal correlations of the matrix-variate data while the temporal covariance matrix is treated as a nuisance parameter. By recognizing that the strengths of network connections may vary across subjects, we develop an ensemble-learning procedure, which identifies the differential interaction patterns of brain regions between the case group and the control group and conducts medical diagnosis (classification) of the disease simultaneously. Simulation studies are conducted to assess the performance of the proposed method. We apply the proposed procedure to the functional connectivity analysis of an functional magnetic resonance imaging study on AD. The hub nodes and differential interaction patterns identified are consistent with existing experimental studies, and satisfactory out-of-sample classification performance is achieved for medical diagnosis of AD.

Download Full-text

Structure-preserving integrated analysis for risk stratification with application to cancer staging

Biostatistics ◽

10.1093/biostatistics/kxab005 ◽

2021 ◽

Author(s):

Tianjie Wang ◽

Rui Chen ◽

Wenshuo Liu ◽

Menggang Yu

Keyword(s):

Risk Factors ◽

Survival Data ◽

Regression Tree ◽

Cancer Staging ◽

Classification And Regression Tree ◽

Integrated Analysis ◽

Data Sets ◽

Cancer Data ◽

Node Involvement ◽

Tree Method

Summary To provide appropriate and practical level of health care, it is critical to group patients into relatively few strata that have distinct prognosis. Such grouping or stratification is typically based on well-established risk factors and clinical outcomes. A well-known example is the American Joint Committee on Cancer staging for cancer that uses tumor size, node involvement, and metastasis status. We consider a statistical method for such grouping based on individual patient data from multiple studies. The method encourages a common grouping structure as a basis for borrowing information, but acknowledges data heterogeneity including unbalanced data structures across multiple studies. We build on the “lasso-tree” method that is more versatile than the well-known classification and regression tree method in generating possible grouping patterns. In addition, the parametrization of the lasso-tree method makes it very natural to incorporate the underlying order information in the risk factors. In this article, we also strengthen the lasso-tree method by establishing its theoretical properties for which Lin and others (2013. Lasso tree for cancer staging with survival data. Biostatistics 14, 327–339) did not pursue. We evaluate our method in extensive simulation studies and an analysis of multiple breast cancer data sets.

Download Full-text

Testing calibration of phenotyping models using positive-only electronic health record data

Biostatistics ◽

10.1093/biostatistics/kxab003 ◽

2021 ◽

Author(s):

Lingjiao Zhang ◽

Yanyuan Ma ◽

Daniel Herman ◽

Jinbo Chen

Keyword(s):

Gold Standard ◽

Model Calibration ◽

Electronic Health Record Data ◽

Standard Case ◽

Clinical Assessments ◽

Model Free ◽

Record Data ◽

Chi Squared ◽

Electronic Health ◽

And Control

Summary Validation of phenotyping models using Electronic Health Records (EHRs) data conventionally requires gold-standard case and control labels. The labeling process requires clinical experts to retrospectively review patients’ medical charts, therefore is labor intensive and time consuming. For some disease conditions, it is prohibitive to identify the gold-standard controls because routine clinical assessments are performed for selective patients who are deemed to possibly have the condition. To build a model for phenotyping patients in EHRs, the most readily accessible data are often for a cohort consisting of a set of gold-standard cases and a large number of unlabeled patients. Hereby, we propose methods for assessing model calibration and discrimination using such “positive-only” EHR data that does not require gold-standard controls, provided that the labeled cases are representative of all cases. For model calibration, we propose a novel statistic that aggregates differences between model-free and model-based estimated numbers of cases across risk subgroups, which asymptotically follows a Chi-squared distribution. We additionally demonstrate that the calibration slope can also be estimated using such “positive-only” data. We propose consistent estimators for discrimination measures and derive their large sample properties. We demonstrate performances of the proposed methods through extensive simulation studies and apply them to Penn Medicine EHRs to validate two preliminary models for predicting the risk of primary aldosteronism.

Download Full-text

Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data

Biostatistics ◽

10.1093/biostatistics/kxab001 ◽

2021 ◽

Author(s):

Tien Vo ◽

Akshay Mishra ◽

Vamsi Ithapu ◽

Vikas Singh ◽

Michael A Newton

Keyword(s):

Empirical Bayes ◽

Large Scale ◽

Image Data ◽

Imaging Data ◽

False Discovery Rates ◽

Testing Procedures ◽

False Discovery ◽

Large Scale Testing ◽

Brain Changes ◽

Connected Subgraphs

Summary For large-scale testing with graph-associated data, we present an empirical Bayes mixture technique to score local false-discovery rates (FDRs). Compared to procedures that ignore the graph, the proposed Graph-based Mixture Model (GraphMM) method gains power in settings where non-null cases form connected subgraphs, and it does so by regularizing parameter contrasts between testing units. Simulations show that GraphMM controls the FDR in a variety of settings, though it may lose control with excessive regularization. On magnetic resonance imaging data from a study of brain changes associated with the onset of Alzheimer’s disease, GraphMM produces greater yield than conventional large-scale testing procedures.

Download Full-text

OUP accepted manuscript

Biostatistics ◽

10.1093/biostatistics/kxab013 ◽

2021 ◽

Download Full-text

OUP accepted manuscript

Biostatistics ◽

10.1093/biostatistics/kxab012 ◽

2021 ◽

Download Full-text

Biostatistics
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

OUP accepted manuscript

OUP accepted manuscript

Capturing discrete latent structures: choose LDs over PCs

Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank

Simultaneous differential network analysis and classification for matrix-variate data with application to brain connectivity

Structure-preserving integrated analysis for risk stratification with application to cancer staging

Testing calibration of phenotyping models using positive-only electronic health record data

Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data

OUP accepted manuscript

OUP accepted manuscript

Export Citation Format

BiostatisticsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

OUP accepted manuscript

OUP accepted manuscript

Capturing discrete latent structures: choose LDs over PCs

Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank

Simultaneous differential network analysis and classification for matrix-variate data with application to brain connectivity

Structure-preserving integrated analysis for risk stratification with application to cancer staging

Testing calibration of phenotyping models using positive-only electronic health record data

Dimension constraints improve hypothesis testing for large-scale, graph-associated, brain-image data

OUP accepted manuscript

OUP accepted manuscript

Biostatistics
Latest Publications