Characterization of Cancer Types by Applying Machine Learning Methods on Blood RNA-Sequencing Data

Author(s):  
Cem Bugra Alkan ◽  
Zerrin Isik
2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Qidong Cai ◽  
Boxue He ◽  
Pengfei Zhang ◽  
Zhenyu Zhao ◽  
Xiong Peng ◽  
...  

Abstract Background Alternative splicing (AS) plays critical roles in generating protein diversity and complexity. Dysregulation of AS underlies the initiation and progression of tumors. Machine learning approaches have emerged as efficient tools to identify promising biomarkers. It is meaningful to explore pivotal AS events (ASEs) to deepen understanding and improve prognostic assessments of lung adenocarcinoma (LUAD) via machine learning algorithms. Method RNA sequencing data and AS data were extracted from The Cancer Genome Atlas (TCGA) database and TCGA SpliceSeq database. Using several machine learning methods, we identified 24 pairs of LUAD-related ASEs implicated in splicing switches and a random forest-based classifiers for identifying lymph node metastasis (LNM) consisting of 12 ASEs. Furthermore, we identified key prognosis-related ASEs and established a 16-ASE-based prognostic model to predict overall survival for LUAD patients using Cox regression model, random survival forest analysis, and forward selection model. Bioinformatics analyses were also applied to identify underlying mechanisms and associated upstream splicing factors (SFs). Results Each pair of ASEs was spliced from the same parent gene, and exhibited perfect inverse intrapair correlation (correlation coefficient = − 1). The 12-ASE-based classifier showed robust ability to evaluate LNM status of LUAD patients with the area under the receiver operating characteristic (ROC) curve (AUC) more than 0.7 in fivefold cross-validation. The prognostic model performed well at 1, 3, 5, and 10 years in both the training cohort and internal test cohort. Univariate and multivariate Cox regression indicated the prognostic model could be used as an independent prognostic factor for patients with LUAD. Further analysis revealed correlations between the prognostic model and American Joint Committee on Cancer stage, T stage, N stage, and living status. The splicing network constructed of survival-related SFs and ASEs depicts regulatory relationships between them. Conclusion In summary, our study provides insight into LUAD researches and managements based on these AS biomarkers.


2018 ◽  
Author(s):  
Parth Patel ◽  
Sandra Mathioni ◽  
Atul Kakrana ◽  
Hagit Shatkay ◽  
Blake C. Meyers

Summary and keywordsLittle is known about the characteristics and function of reproductive phased, secondary, small interfering RNAs (phasiRNAs) in the Poaceae, despite the availability of significant genomic resources, experimental data, and a growing number of computational tools. We utilized machine-learning methods to identify sequence-based and structural features that distinguish phasiRNAs in rice and maize from other small RNAs (sRNAs).We developed Random Forest classifiers that can distinguish reproductive phasiRNAs from other sRNAs in complex sets of sequencing data, utilizing sequence-based (k-mers) and features describing position-specific sequence biases.The classification performance attained is >80% in accuracy, sensitivity, specificity, and positive predicted value. Feature selection identified important features in both ends of phasiRNAs. We demonstrated that phasiRNAs have strand specificity and position-specific nucleotide biases potentially influencing AGO sorting; we also predicted targets to infer functions of phasiRNAs, and computationally-assessed their sequence characteristics relative to other sRNAs.Our results demonstrate that machine-learning methods effectively identify phasiRNAs despite the lack of characteristic features typically present in precursor loci of other small RNAs, such as sequence conservation or structural motifs. The 5’-end features we identified provide insights into AGO-phasiRNA interactions; we describe a hypothetical model of competition for AGO loading between phasiRNAs of different nucleotide compositions.


2020 ◽  
Vol 25 (40) ◽  
pp. 4264-4273 ◽  
Author(s):  
Dan Zhang ◽  
Zheng-Xing Guan ◽  
Zi-Mei Zhang ◽  
Shi-Hao Li ◽  
Fu-Ying Dao ◽  
...  

Bioluminescent Proteins (BLPs) are widely distributed in many living organisms that act as a key role of light emission in bioluminescence. Bioluminescence serves various functions in finding food and protecting the organisms from predators. With the routine biotechnological application of bioluminescence, it is recognized to be essential for many medical, commercial and other general technological advances. Therefore, the prediction and characterization of BLPs are significant and can help to explore more secrets about bioluminescence and promote the development of application of bioluminescence. Since the experimental methods are money and time-consuming for BLPs identification, bioinformatics tools have played important role in fast and accurate prediction of BLPs by combining their sequences information with machine learning methods. In this review, we summarized and compared the application of machine learning methods in the prediction of BLPs from different aspects. We wish that this review will provide insights and inspirations for researches on BLPs.


Risk Analysis ◽  
2018 ◽  
Author(s):  
Patrick Murigu Kamau Njage ◽  
Clementine Henri ◽  
Pimlapas Leekitcharoenphon ◽  
Michel‐Yves Mistou ◽  
Rene S. Hendriksen ◽  
...  

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Yu Wang ◽  
Mengru Sun ◽  
Yifan Duan

The human health status can be assessed by the means of research and analysis of the human microbiome. Acne is a common skin disease whose morbidity increases year by year. The lipids which influence acne to a large extent are studied by metagenomic methods in recent years. In this paper, machine learning methods are used to analyze metagenomic sequencing data of acne, i.e., all kinds of lipids in the face skin. Firstly, lipids data of the diseased skin (DS) samples and the healthy skin (HS) samples of acne patients and the normal control (NC) samples of healthy person are, respectively, analyzed by using principal component analysis (PCA) and kernel principal component analysis (KPCA). Then, the lipids which have main influence on each kind of sample are obtained. In addition, a multiset canonical correlation analysis (MCCA) is utilized to get lipids which can differentiate the face skins of the above three samples. The experimental results show the machine learning methods can effectively analyze metagenomic sequencing data of acne. According to the results, lipids which only influence one of the three samples or the lipids which simultaneously have different degree of influence on these three samples can be used as indicators to judge skin statuses.


Blood ◽  
2021 ◽  
Author(s):  
Hassan Awada ◽  
Arda Durmaz ◽  
Carmelo Gurnari ◽  
Ashwin Kishtagari ◽  
Manja Meggendorfer ◽  
...  

While genomic alterations drive the pathogenesis of acute myeloid leukemia (AML), traditional classifications are largely based on morphology and prototypic genetic founder lesions define only a small proportion of AML patients. The historical subdivision of primary/de novo AML (pAML) and secondary AML (sAML) has shown to variably correlate with genetic patterns. Perhaps, the combinatorial complexity and heterogeneity of AML genomic architecture have precluded, so far, the genomic-based subclassification to identify distinct molecularly-defined subtypes more reflective of shared pathogenesis. We integrated cytogenetic and gene sequencing data from a multicenter cohort of 6,788 AML patients that were analyzed using standard and machine learning methods to generate a novel AML molecular subclassification with biological correlates corresponding to underlying pathogenesis. Standard supervised analyses resulted in modest cross-validation accuracy when attempting to use molecular patterns to predict traditional pathomorphological AML classifications. We performed unsupervised analysis by applying Bayesian Latent Class method that identified 4 unique genomic clusters of distinct prognoses. Invariant genomic features driving each cluster were extracted and resulted in 97% cross-validation accuracy when used for genomic subclassification. Subclasses of AML defined by molecular signatures overlapped current pathomorphological and clinically-defined AML subtypes. We internally and externally validated our results and share an open-access molecular classification scheme for AML patients. Although the heterogeneity inherent in the genomic changes across nearly 7,000 AML patients is too vast for traditional prediction methods, however, machine learning methods allowed for the definition of novel genomic AML subclasses indicating that traditional pathomorphological definitions may be less reflective of overlapping pathogenesis.


Sign in / Sign up

Export Citation Format

Share Document