Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 25;22(11):1205.
doi: 10.3390/e22111205.

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

Affiliations

Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

Samarendra Das et al. Entropy (Basel). .

Abstract

Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.

Keywords: MRMR; SVM; biological relevance; bootstrap; gene expression; subject classification.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Operational procedure for data integration and the use of proposed BSM approach. (A) Outlines for the data integration used in this study for the application of BSM approach. The first step indicates the integration and meta-analysis of GE datasets obtained from various GE studies. Then gene selection methods are applied on the meta GE data. (B) Flowchart depicting the implemented algorithm of BSM approach. Wi(S)’s and Wi(M)’s are the N-dimensional vectors of weights computed through SVM and MRMR approach, respectively. Gi’s and Ri’s are the N-dimensional vectors of gene lists and corresponding gene rank scores. SVM and MRMR stand for Maximum Relevance and Minimum Redundancy and support vector machine algorithms. pi-value is statistical significance value for ith gene. α is the desired level of statistical significance.
Figure 2
Figure 2
Graphical analysis of the proposed BSM approach with SVM-MRMR approach for abiotic stress datasets. Distribution of gene weights computed from SVM-MRMR approach for the abiotic stresses. The distributions of gene weights from the SVM-MRMR are shown for (A) salinity; (B) cold; and (C) drought stress datasets in rice. Distribution of adjusted p-values computed from the proposed BSM approach for the abiotic stresses. The distributions of the adjusted p-values are shown for (A1) salinity; (B1) cold; and (C1) drought stress datasets.
Figure 3
Figure 3
Classification-based comparative performance analysis of gene selection methods through SVM-LBF and SVM-PBF classifiers for abiotic stress datasets. The horizontal axis represents the gene selection methods. The vertical axis represents post selection classification accuracy obtained by using varying sliding window size technique. The classification accuracies over the window sizes are presented as boxes. The bars on the boxes represent the standard errors. The distributions of classification accuracies are shown for cold stress with SVM-LBF (A1), and SVM-PBF (A2) classifiers. The distributions of classification accuracies are shown for salinity stress with SVM-LBF (B1) and SVM-PBF (B2) classifiers. The distributions of classification accuracies are shown for drought stress with SVM-LBF (C1) and SVM-PBF (C2) classifiers.
Figure 4
Figure 4
Classification-based comparative performance analysis of gene selection methods through SVM-RBF and SVM-SBF classifiers for abiotic stress datasets. The horizontal axis represents the gene selection methods. The vertical axis represents post selection classification accuracy obtained by using varying sliding window size technique. The classification accuracies over the window sizes are presented as boxes. The distributions of classification accuracies are shown for cold stress with SVM-RBF (A1) and SVM-SBF (A2) classifiers. The distributions of classification accuracies are shown for salinity stress with SVM-RBF (B1) and SVM-SBF (B2) classifiers. The distributions of classification accuracies are shown for drought stress with SVM-RBF (C1) and SVM-SBF (C2) classifiers.
Figure 5
Figure 5
Comparative performance analysis of gene selection methods through distribution of Qstat statistic. The horizontal axis represents the informative gene sets obtained through gene selection methods. The vertical axis represents the value of Qstat statistic. The distribution of Qstat statistic are shown for (A) salinity; (B) cold; (C) drought; (D) bacterial; (E) fungal and (F) insect stress datasets in rice. The lines in different colors represent different gene selection methods.
Figure 6
Figure 6
Comparative performance analysis of gene selection methods through distribution of p-values from QTL-hypergeometric test. The horizontal axis represents the gene sets obtained through gene selection methods. The vertical axis represents the value of −log10(p-value) from QTL-hypergeometric test. The distribution of −log10(p-value) are shown for (A) salinity; (B) cold; (C) drought; (D) bacterial; (E) fungal, and (F) insect stress datasets in rice. The lines in different colors represent different gene selection methods.

Similar articles

Cited by

References

    1. Reuter J.A., Spacek D.V., Snyder M.P. High-Throughput Sequencing Technologies. Mol. Cell. 2015;58:586–597. doi: 10.1016/j.molcel.2015.05.004. - DOI - PMC - PubMed
    1. Trevino V., Falciani F., Barrera-Saldaña H.A. DNA Microarrays: A Powerful Genomic Tool for Biomedical and Clinical Research. Mol. Med. 2007;13:527–541. doi: 10.2119/2006-00107.Trevino. - DOI - PMC - PubMed
    1. Charpe A.M. Advances in Biotechnology. Springer; New Delhi, India: 2014. DNA Microarray; pp. 71–104. - DOI
    1. Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M. NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res. 2012;41:D991–D995. doi: 10.1093/nar/gks1193. - DOI - PMC - PubMed
    1. Das S., Meher P.K., Rai A., Bhar L.M., Mandal B.N. Statistical approaches for gene selection, hub gene identification and module interaction in gene co-expression network analysis: An application to aluminum stress in soybean (Glycine max L.) PLoS ONE. 2017;12:e0169605. doi: 10.1371/journal.pone.0169605. - DOI - PMC - PubMed

LinkOut - more resources