Modern Clinical Text Mining: A Guide and Review

Bethany Percha

doi:10.1146/annurev-biodatasci-030421-030931

Modern Clinical Text Mining: A Guide and Review

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-030421-030931 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Bethany Percha

Keyword(s):

Machine Learning ◽

Text Mining ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Clinical Text ◽

Quality Improvement Research ◽

Comprehensive Survey ◽

Technical Advances

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Modern Clinical Text Mining: A Guide and Review

10.20944/preprints202010.0649.v1 ◽

2020 ◽

Author(s):

Bethany Percha

Keyword(s):

Machine Learning ◽

Text Mining ◽

Health Records ◽

Clinical Text ◽

Quality Improvement Research ◽

The Past ◽

New Methods ◽

Comprehensive Survey ◽

Technical Advances ◽

First Time

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g. physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, it describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation at health systems and in industry.

Download Full-text

Modern Clinical Text Mining: A Guide and Review

10.20944/preprints202010.0649.v2 ◽

2021 ◽

Author(s):

Bethany Percha

Keyword(s):

Machine Learning ◽

Text Mining ◽

Health Records ◽

Clinical Text ◽

Quality Improvement Research ◽

The Past ◽

New Methods ◽

Comprehensive Survey ◽

Technical Advances ◽

First Time

Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g. physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, it describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation at health systems and in industry.

Download Full-text

Probabilistic Machine Learning for Healthcare

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-092820-033938 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Irene Y. Chen ◽

Shalmali Joshi ◽

Marzyeh Ghassemi ◽

Rajesh Ranganath

Keyword(s):

Machine Learning ◽

Data Science ◽

Model Building ◽

Generative Models ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Learning Models ◽

Probabilistic Machine Learning ◽

Machine Learning Models

Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial, including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Perspectives on Allele-Specific Expression

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-021621-122219 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Siobhan Cleary ◽

Cathal Seoighe

Keyword(s):

Gene Expression ◽

Genetic Variants ◽

Data Science ◽

Genetic Diseases ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Specific Expression ◽

Cis Acting ◽

Gene Copies

Diploidy has profound implications for population genetics and susceptibility to genetic diseases. Although two copies are present for most genes in the human genome, they are not necessarily both active or active at the same level in a given individual. Genomic imprinting, resulting in exclusive or biased expression in favor of the allele of paternal or maternal origin, is now believed to affect hundreds of human genes. A far greater number of genes display unequal expression of gene copies due to cis-acting genetic variants that perturb gene expression. The availability of data generated by RNA sequencing applied to large numbers of individuals and tissue types has generated unprecedented opportunities to assess the contribution of genetic variation to allelic imbalance in gene expression. Here we review the insights gained through the analysis of these data about the extent of the genetic contribution to allelic expression imbalance, the tools and statistical models for gene expression imbalance, and what the results obtained reveal about the contribution of genetic variants that alter gene expression to complex human diseases and phenotypes. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Data Science in the Food Industry

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-020221-123602 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

George-John Nychas ◽

Emma Sims ◽

Panagiotis Tsakanikas ◽

Fady Mohareb

Keyword(s):

Food Safety ◽

Food Chain ◽

Food Industry ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Constant State ◽

Food Integrity ◽

Multi Stakeholder

Food safety is one of the main challenges of the agri-food industry that is expected to be addressed in the current environment of tremendous technological progress, where consumers’ lifestyles and preferences are in a constant state of flux. Food chain transparency and trust are drivers for food integrity control and for improvements in efficiency and economic growth. Similarly, the circular economy has great potential to reduce wastage and improve the efficiency of operations in multi-stakeholder ecosystems. Throughout the food chain cycle, all food commodities are exposed to multiple hazards, resulting in a high likelihood of contamination. Such biological or chemical hazards may be naturally present at any stage of food production, whether accidentally introduced or fraudulently imposed, risking consumers’ health and their faith in the food industry. Nowadays, a massive amount of data is generated, not only from the next generation of food safety monitoring systems and along the entire food chain (primary production included) but also from the internet of things, media, and other devices. These data should be used for the benefit of society, and the scientific field of data science should be a vital player in helping to make this possible. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Integration of Multimodal Data for Deciphering Brain Disorders

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-092820-020354 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Jingqi Chen ◽

Guiying Dong ◽

Liting Song ◽

Xingzhong Zhao ◽

Jixin Cao ◽

...

Keyword(s):

Human Brain ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Brain Disorders ◽

Multimodal Data ◽

Future Data ◽

The Brain ◽

Shed Light

The accumulation of vast amounts of multimodal data for the human brain, in both normal and disease conditions, has provided unprecedented opportunities for understanding why and how brain disorders arise. Compared with traditional analyses of single datasets, the integration of multimodal datasets covering different types of data (i.e., genomics, transcriptomics, imaging, etc.) has shed light on the mechanisms underlying brain disorders in greater detail across both the microscopic and macroscopic levels. In this review, we first briefly introduce the popular large datasets for the brain. Then, we discuss in detail how integration of multimodal human brain datasets can reveal the genetic predispositions and the abnormal molecular pathways of brain disorders. Finally, we present an outlook on how future data integration efforts may advance the diagnosis and treatment of brain disorders. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Phenotyping Neurodegeneration in Human iPSCs

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-092820-025214 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Jonathan Li ◽

Ernest Fraenkel

Keyword(s):

Neurodegenerative Diseases ◽

High Throughput Screening ◽

Data Science ◽

Disease Modeling ◽

Patient Specific ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Cellular Models ◽

Human Ipscs

Induced pluripotent stem cell (iPSC) technology holds promise for modeling neurodegenerative diseases. Traditional approaches for disease modeling using animal and cellular models require knowledge of disease mutations. However, many patients with neurodegenerative diseases do not have a known genetic cause. iPSCs offer a way to generate patient-specific models and study pathways of dysfunction in an in vitro setting in order to understand the causes and subtypes of neurodegeneration. Furthermore, iPSC-based models can be used to search for candidate therapeutics using high-throughput screening. Here we review how iPSC-based models are currently being used to further our understanding of neurodegenerative diseases, as well as discuss their challenges and future directions. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Metatranscriptomics for the Human Microbiome and Microbial Community Functional Profiling

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-031121-103035 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Yancong Zhang ◽

Kelsey N. Thompson ◽

Tobyn Branck ◽

Yan Yan ◽

Long H. Nguyen ◽

...

Keyword(s):

Microbial Community ◽

Data Science ◽

New Technologies ◽

Biomarker Discovery ◽

Human Microbiome ◽

Community Context ◽

Energy Harvest ◽

Annual Review ◽

Publication Date ◽

Biomedical Data

Shotgun metatranscriptomics (MTX) is an increasingly practical way to survey microbial community gene function and regulation at scale. This review begins by summarizing the motivations for community transcriptomics and the history of the field. We then explore the principles, best practices, and challenges of contemporary MTX workflows: beginning with laboratory methods for isolation and sequencing of community RNA, followed by informatics methods for quantifying RNA features, and finally statistical methods for detecting differential expression in a community context. In the second half of the review, we survey important biological findings from the MTX literature, drawing examples from the human microbiome, other (nonhuman) host-associated microbiomes, and the environment. Across these examples, MTX methods prove invaluable for probing microbe–microbe and host–microbe interactions, the dynamics of energy harvest and chemical cycling, and responses to environmental stresses. We conclude with a review of open challenges in the MTX field, including making assays and analyses more robust, accessible, and adaptable to new technologies; deciphering roles for millions of uncharacterized microbial transcripts; and solving applied problems such as biomarker discovery and development of microbial therapeutics. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Neoantigen Controversies

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-092820-112713 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Andrea Castro ◽

Maurizio Zanetti ◽

Hannah Carter

Keyword(s):

Immune System ◽

Data Science ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

Success Rates ◽

Sequencing Technologies ◽

Conflicting Evidence ◽

Key Aspects

Next-generation sequencing technologies have revolutionized our ability to catalog the landscape of somatic mutations in tumor genomes. These mutations can sometimes create so-called neoantigens, which allow the immune system to detect and eliminate tumor cells. However, efforts that stimulate the immune system to eliminate tumors based on their molecular differences have had less success than has been hoped for, and there are conflicting reports about the role of neoantigens in the success of this approach. Here we review some of the conflicting evidence in the literature and highlight key aspects of the tumor–immune interface that are emerging as major determinants of whether mutation-derived neoantigens will contribute to an immunotherapy response. Accounting for these factors is expected to improve success rates of future immunotherapy approaches. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Single-Cell Analysis for Whole-Organism Datasets

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-092820-031008 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Angela Oliveira Pisco ◽

Bruno Tojo ◽

Aaron McGeever

Keyword(s):

Single Cell ◽

Cell Fate ◽

Cell Biology ◽

Data Science ◽

Single Cell Analysis ◽

Cell Types ◽

Annual Review ◽

Publication Date ◽

Biomedical Data ◽

A Cell

Cell atlases are essential companions to the genome as they elucidate how genes are used in a cell type–specific manner or how the usage of genes changes over the lifetime of an organism. This review explores recent advances in whole-organism single-cell atlases, which enable understanding of cell heterogeneity and tissue and cell fate, both in health and disease. Here we provide an overview of recent efforts to build cell atlases across species and discuss the challenges that the field is currently facing. Moreover, we propose the concept of having a knowledgebase that can scale with the number of experiments and computational approaches and a new feedback loop for development and benchmarking of computational methods that includes contributions from the users. These two aspects are key for community efforts in single-cell biology that will help produce a comprehensive annotated map of cell types and states with unparalleled resolution. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 4 is July 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text