Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures

To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

10.20944/preprints202012.0377.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Data ◽

Integrative Approach ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. For gene expression data analysis, most of the existing feature selection methods rely on expression values alone to select the genes; and biological knowledge is integrated at the end of the analysis in order to gain biological insights or to support the initial findings. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. Since the integrative approach attracted attention in the gene expression domain, lately the gene selection process shifted from being purely data-centric to more incorporative analysis with additional biological knowledge.

Download Full-text

A COMPARATIVE STUDY ON GENE SELECTION METHODS FOR TISSUES CLASSIFICATION ON LARGE SCALE GENE EXPRESSION DATA

Jurnal Teknologi ◽

10.11113/jt.v78.8843 ◽

2016 ◽

Vol 78 (5-10) ◽

Author(s):

Farzana Kabir Ahmad

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Large Scale ◽

Gene Selection ◽

Support Vector ◽

Breast Cancer Dataset ◽

Expression Data ◽

Selection Methods ◽

Normal Tissues

Deoxyribonucleic acid (DNA) microarray technology is the recent invention that provided colossal opportunities to measure a large scale of gene expressions simultaneously. However, interpreting large scale of gene expression data remain a challenging issue due to their innate nature of “high dimensional low sample size”. Microarray data mainly involved thousands of genes, n in a very small size sample, p which complicates the data analysis process. For such a reason, feature selection methods also known as gene selection methods have become apparently need to select significant genes that present the maximum discriminative power between cancerous and normal tissues. Feature selection methods can be structured into three basic factions; a) filter methods; b) wrapper methods and c) embedded methods. Among these methods, filter gene selection methods provide easy way to calculate the informative genes and can simplify reduce the large scale microarray datasets. Although filter based gene selection techniques have been commonly used in analyzing microarray dataset, these techniques have been tested separately in different studies. Therefore, this study aims to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. In this experiment, common classifiers, Support Vector Machine (SVM) is used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM. Furthermore, this study has shown SVM performance remained moderately unaffected unless a very small size of genes was selected.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

Entropy ◽

10.3390/e23010002 ◽

2020 ◽

Vol 23 (1) ◽

pp. 2

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Biological Data ◽

Biological Information ◽

Background Information ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.

Download Full-text

Inference of Genetic Networks From Time-Series and Static Gene Expression Data: Combining a Random-Forest-Based Inference Method With Feature Selection Methods

Frontiers in Genetics ◽

10.3389/fgene.2020.595912 ◽

2020 ◽

Vol 11 ◽

Author(s):

Shuhei Kimura ◽

Ryo Fukutomi ◽

Masato Tokuhisa ◽

Mariko Okada

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Random Forest ◽

Gene Expression Data ◽

Computational Cost ◽

Expression Data ◽

Selection Methods ◽

Inference Method ◽

Combined Application ◽

Inference Methods

Several researchers have focused on random-forest-based inference methods because of their excellent performance. Some of these inference methods also have a useful ability to analyze both time-series and static gene expression data. However, they are only of use in ranking all of the candidate regulations by assigning them confidence values. None have been capable of detecting the regulations that actually affect a gene of interest. In this study, we propose a method to remove unpromising candidate regulations by combining the random-forest-based inference method with a series of feature selection methods. In addition to detecting unpromising regulations, our proposed method uses outputs from the feature selection methods to adjust the confidence values of all of the candidate regulations that have been computed by the random-forest-based inference method. Numerical experiments showed that the combined application with the feature selection methods improved the performance of the random-forest-based inference method on 99 of the 100 trials performed on the artificial problems. However, the improvement tends to be small, since our combined method succeeded in removing only 19% of the candidate regulations at most. The combined application with the feature selection methods moreover makes the computational cost higher. While a bigger improvement at a lower computational cost would be ideal, we see no impediments to our investigation, given that our aim is to extract as much useful information as possible from a limited amount of gene expression data.

Download Full-text

A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification

IEEE Access ◽

10.1109/access.2019.2922987 ◽

2019 ◽

Vol 7 ◽

pp. 78533-78548 ◽

Cited By ~ 21

Author(s):

Nada Almugren ◽

Hala Alshamlan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

Cancer Classification ◽

Expression Data ◽

Selection Methods ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

CLASSIFYING TEMPORAL MICROARRAY DATA BY SELECTING INFORMATIVE GENES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013410060 ◽

2013 ◽

Vol 11 (03) ◽

pp. 1341006

Author(s):

QIANG LOU ◽

ZORAN OBRADOVIC

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Microarray Data ◽

Data Sets ◽

Temporal Data ◽

Expression Data ◽

Selection Methods ◽

Temporal Gene Expression ◽

Single Matrix

In order to more accurately predict an individual's health status, in clinical applications it is often important to perform analysis of high-dimensional gene expression data that varies with time. A major challenge in predicting from such temporal microarray data is that the number of biomarkers used as features is typically much larger than the number of labeled subjects. One way to address this challenge is to perform feature selection as a preprocessing step and then apply a classification method on selected features. However, traditional feature selection methods cannot handle multivariate temporal data without applying techniques that flatten temporal data into a single matrix in advance. In this study, a feature selection filter that can directly select informative features from temporal gene expression data is proposed. In our approach, we measure the distance between multivariate temporal data from two subjects. Based on this distance, we define the objective function of temporal margin based feature selection to maximize each subject's temporal margin in its own relevant subspace. The experimental results on synthetic and two real flu data sets provide evidence that our method outperforms the alternatives, which flatten the temporal data in advance.

Download Full-text

Hybrid feature selection methods for the Classification of Cancer in Micro-array Gene expression data: a Survey

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2020/275952020 ◽

2020 ◽

Vol 9 (5) ◽

pp. 8819-8827

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Expression Data ◽

Selection Methods ◽

Micro Array

Download Full-text

Methods for Gene Selection and Classification of Microarray Dataset

Handbook of Research on Biomimicry in Information Retrieval and Knowledge Management - Advances in Web Technologies and Engineering ◽

10.4018/978-1-5225-3004-6.ch004 ◽

2018 ◽

pp. 66-77

Author(s):

Mekour Norreddine

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Expression Data ◽

Gene Selection ◽

Information Gain ◽

Microarray Dataset ◽

Data Sets ◽

Expression Data ◽

Wrapper Approach ◽

Filter Approach

One of the problems that gene expression data resolved is feature selection. There is an important process for choosing which features are important for prediction; there are two general approaches for feature selection: filter approach and wrapper approach. In this chapter, the authors combine the filter approach with method ranked information gain and wrapper approach with a searching method of the genetic algorithm. The authors evaluate their approach on two data sets of gene expression data: Leukemia, and the Central Nervous System. The classifier Decision tree (C4.5) is used for improving the classification performance.

Download Full-text