Classification based upon gene expression data: bias and precision of error rates
- PMID: 17392326
- DOI: 10.1093/bioinformatics/btm117
Classification based upon gene expression data: bias and precision of error rates
Abstract
Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean.
Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors.
Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp
Similar articles
-
Bias in error estimation when using cross-validation for model selection.BMC Bioinformatics. 2006 Feb 23;7:91. doi: 10.1186/1471-2105-7-91. BMC Bioinformatics. 2006. PMID: 16504092 Free PMC article.
-
What should be expected from feature selection in small-sample settings.Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26. Bioinformatics. 2006. PMID: 16870934
-
The ties problem resulting from counting-based error estimators and its impact on gene selection algorithms.Bioinformatics. 2006 Oct 15;22(20):2507-15. doi: 10.1093/bioinformatics/btl438. Epub 2006 Aug 14. Bioinformatics. 2006. PMID: 16908500
-
How does gene expression clustering work?Nat Biotechnol. 2005 Dec;23(12):1499-501. doi: 10.1038/nbt1205-1499. Nat Biotechnol. 2005. PMID: 16333293 Review.
-
Towards knowledge-based gene expression data mining.J Biomed Inform. 2007 Dec;40(6):787-802. doi: 10.1016/j.jbi.2007.06.005. Epub 2007 Jun 21. J Biomed Inform. 2007. PMID: 17683991 Review.
Cited by
-
Simpler evaluation of predictions and signature stability for gene expression data.J Biomed Biotechnol. 2009;2009:587405. doi: 10.1155/2009/587405. Epub 2010 Jan 10. J Biomed Biotechnol. 2009. PMID: 20111740 Free PMC article.
-
Evaluating microarray-based classifiers: an overview.Cancer Inform. 2008;6:77-97. doi: 10.4137/cin.s408. Epub 2008 Feb 29. Cancer Inform. 2008. PMID: 19259405 Free PMC article.
-
Challenges and standards in reporting diagnostic and prognostic biomarker studies.Clin Transl Sci. 2009 Apr;2(2):156-61. doi: 10.1111/j.1752-8062.2008.00075.x. Clin Transl Sci. 2009. PMID: 20443882 Free PMC article. Review.
-
A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.Bioinformatics. 2019 Nov 1;35(22):4656-4663. doi: 10.1093/bioinformatics/btz293. Bioinformatics. 2019. PMID: 31070704 Free PMC article.
-
Classification of schizophrenia patients based on resting-state functional network connectivity.Front Neurosci. 2013 Jul 30;7:133. doi: 10.3389/fnins.2013.00133. eCollection 2013. Front Neurosci. 2013. PMID: 23966903 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources