Classification based upon gene expression data: bias and precision of error rates

doi:10.1093/bioinformatics/btm117

Review

. 2007 Jun 1;23(11):1363-70.

doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28.

Classification based upon gene expression data: bias and precision of error rates

Ian A Wood¹, Peter M Visscher, Kerrie L Mengersen

Affiliations

PMID: 17392326
DOI: 10.1093/bioinformatics/btm117

Review

Classification based upon gene expression data: bias and precision of error rates

Ian A Wood et al. Bioinformatics. 2007.

. 2007 Jun 1;23(11):1363-70.

doi: 10.1093/bioinformatics/btm117. Epub 2007 Mar 28.

Authors

Ian A Wood¹, Peter M Visscher, Kerrie L Mengersen

Affiliation

¹ School of Mathematical Sciences, Queensland University of Technology, Gardens Point, Brisbane, QLD, Australia. i.wood@qut.edu.au

PMID: 17392326
DOI: 10.1093/bioinformatics/btm117

Abstract

Motivation: Gene expression data offer a large number of potentially useful predictors for the classification of tissue samples into classes, such as diseased and non-diseased. The predictive error rate of classifiers can be estimated using methods such as cross-validation. We have investigated issues of interpretation and potential bias in the reporting of error rate estimates. The issues considered here are optimization and selection biases, sampling effects, measures of misclassification rate, baseline error rates, two-level external cross-validation and a novel proposal for detection of bias using the permutation mean.

Results: Reporting an optimal estimated error rate incurs an optimization bias. Downward bias of 3-5% was found in an existing study of classification based on gene expression data and may be endemic in similar studies. Using a simulated non-informative dataset and two example datasets from existing studies, we show how bias can be detected through the use of label permutations and avoided using two-level external cross-validation. Some studies avoid optimization bias by using single-level cross-validation and a test set, but error rates can be more accurately estimated via two-level cross-validation. In addition to estimating the simple overall error rate, we recommend reporting class error rates plus where possible the conditional risk incorporating prior class probabilities and a misclassification cost matrix. We also describe baseline error rates derived from three trivial classifiers which ignore the predictors.

Availability: R code which implements two-level external cross-validation with the PAMR package, experiment code, dataset details and additional figures are freely available for non-commercial use from http://www.maths.qut.edu.au/profiles/wood/permr.jsp

PubMed Disclaimer

Cited by

Simpler evaluation of predictions and signature stability for gene expression data.
Pittelkow YE, Wilson SR. Pittelkow YE, et al. J Biomed Biotechnol. 2009;2009:587405. doi: 10.1155/2009/587405. Epub 2010 Jan 10. J Biomed Biotechnol. 2009. PMID: 20111740 Free PMC article.
Evaluating microarray-based classifiers: an overview.
Boulesteix AL, Strobl C, Augustin T, Daumer M. Boulesteix AL, et al. Cancer Inform. 2008;6:77-97. doi: 10.4137/cin.s408. Epub 2008 Feb 29. Cancer Inform. 2008. PMID: 19259405 Free PMC article.
Challenges and standards in reporting diagnostic and prognostic biomarker studies.
Azuaje F, Devaux Y, Wagner D. Azuaje F, et al. Clin Transl Sci. 2009 Apr;2(2):156-61. doi: 10.1111/j.1752-8062.2008.00075.x. Clin Transl Sci. 2009. PMID: 20443882 Free PMC article. Review.
A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.
Watson OP, Cortes-Ciriano I, Taylor AR, Watson JA. Watson OP, et al. Bioinformatics. 2019 Nov 1;35(22):4656-4663. doi: 10.1093/bioinformatics/btz293. Bioinformatics. 2019. PMID: 31070704 Free PMC article.
Classification of schizophrenia patients based on resting-state functional network connectivity.
Arbabshirani MR, Kiehl KA, Pearlson GD, Calhoun VD. Arbabshirani MR, et al. Front Neurosci. 2013 Jul 30;7:133. doi: 10.3389/fnins.2013.00133. eCollection 2013. Front Neurosci. 2013. PMID: 23966903 Free PMC article.

See all "Cited by" articles

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Ovid Technologies, Inc.
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Classification based upon gene expression data: bias and precision of error rates

Affiliation

Classification based upon gene expression data: bias and precision of error rates

Authors

Affiliation

Abstract

Similar articles

Cited by

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources