Instance-based concept learning from multiclass DNA microarray data

doi:10.1186/1471-2105-7-73

. 2006 Feb 16:7:73.

doi: 10.1186/1471-2105-7-73.

Instance-based concept learning from multiclass DNA microarray data

Daniel Berrar¹, Ian Bradbury, Werner Dubitzky

Affiliations

PMID: 16483361
PMCID: PMC1402330
DOI: 10.1186/1471-2105-7-73

Instance-based concept learning from multiclass DNA microarray data

Daniel Berrar et al. BMC Bioinformatics. 2006.

. 2006 Feb 16:7:73.

doi: 10.1186/1471-2105-7-73.

Authors

Daniel Berrar¹, Ian Bradbury, Werner Dubitzky

Affiliation

¹ School of Biomedical Sciences, University of Ulster at Coleraine, Cromore Road, Northern Ireland, UK. dp.berrar@ulster.ac.uk

PMID: 16483361
PMCID: PMC1402330
DOI: 10.1186/1471-2105-7-73

Abstract

Background: Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance.

Results: We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors.

Conclusion: Given its highly intuitive underlying principles--simplicity, ease-of-use, and robustness--the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.

PubMed Disclaimer

Figures

**Figure 1**
**Prediction errors on the NCI60 data set**. The total number of misclassified cases in all ten folds are: 41 by distance-weighted k-NN, 41 by 1-NN, 54 by 3-NN, 31 by SVM, 55 by DT, and 57 by MLP.

**Figure 3**
**Prediction errors on the GCM data set**. The total number of misclassified cases in all ten folds are: 122 by distance-weighted k-NN, 120 by 1-NN, 136 by 3-NN, 136 by 5-NN, 115 by SVM, 168 by DT, and 215 by MLP.

**Figure 4**
**Sampling of learning and test set and selection of marker genes**. Depicted is one fold in the ten-fold resampling procedure. From the original data set comprising n cases and p genes, ~70% of the cases are randomly selected for the learning set L_iand ~30% cases for the test set T_i. On the learning set L_iwith unpermuted class labels, the signal-to-noise weight for each gene and each class is computed as illustrated for class B. The class labels are then randomly permuted 1,000 times and the signal-to-noise weights (for each gene and each class) are recomputed for each permutation to assess the significance of the weights for the unpermuted learning set. Both the learning and the test set are filtered to contain only those genes that are significantly differently expressed in the learning set.

**Figure 5**
**The distance-weighted k-NN classifier for a binary classification task**. The arrows indicate the three nearest neighbors of the test case. Here it is assumed that k_opt= 3.

**Figure 2**
**Prediction errors on the ALL data set**. The total number of misclassified cases in all ten folds are: 247 by distance-weighted k-NN, 257 by 1-NN, 248 by 3-NN, 248 by 5-NN, 250 by SVM, 348 by DT, and 333 by MLP.

See this image and copyright information in PMC

Cited by

ANMM4CBR: a case-based reasoning method for gene expression data classification.
Yao B, Li S. Yao B, et al. Algorithms Mol Biol. 2010 Jan 6;5:14. doi: 10.1186/1748-7188-5-14. Algorithms Mol Biol. 2010. PMID: 20051140 Free PMC article.
A hybrid BPSO-CGA approach for gene selection and classification of microarray data.
Chuang LY, Yang CH, Li JC, Yang CH. Chuang LY, et al. J Comput Biol. 2012 Jan;19(1):68-82. doi: 10.1089/cmb.2010.0064. Epub 2011 Jan 6. J Comput Biol. 2012. PMID: 21210743 Free PMC article.
Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning.
Chakraborty D, Maulik U. Chakraborty D, et al. IEEE J Transl Eng Health Med. 2014 Dec 2;2:4300211. doi: 10.1109/JTEHM.2014.2375820. eCollection 2014. IEEE J Transl Eng Health Med. 2014. PMID: 27170887 Free PMC article.
TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW. Diaz NN, et al. BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56. BMC Bioinformatics. 2009. PMID: 19210774 Free PMC article.
Use of yeast chemigenomics and COXEN informatics in preclinical evaluation of anticancer agents.
Smith SC, Havaleshko DM, Moon K, Baras AS, Lee J, Bekiranov S, Burke DJ, Theodorescu D. Smith SC, et al. Neoplasia. 2011 Jan;13(1):72-80. doi: 10.1593/neo.101214. Neoplasia. 2011. PMID: 21253455 Free PMC article.

See all "Cited by" articles

References

1. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO. Systematic variation in gene expression patterns in human cancer cell lines. Nat Gen. 2000;24:227–235. - PubMed
1. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo MLC, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA. 2001;98:15149–15154. - PMC - PubMed
1. Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. - PubMed
1. Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–1491. - PubMed
1. Dudoit S, Fridlyand J. Introduction to classification in microarray experiments. In: Berrar D, Dubitzky W, Granzow M, editor. A Practical Approach to Microarray Data Analysis. Boston: Kluwer Academic Publishers; 2002. pp. 131–151.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

[1] Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO. Systematic variation in gene expression patterns in human cancer cell lines. Nat Gen. 2000;24:227–235. - PubMed

[2] Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO. Systematic variation in gene expression patterns in human cancer cell lines. Nat Gen. 2000;24:227–235. - PubMed

[3] Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo MLC, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA. 2001;98:15149–15154. - PMC - PubMed

[4] Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo MLC, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA. 2001;98:15149–15154. - PMC - PubMed

[5] Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. - PubMed

[6] Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. - PubMed

[7] Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–1491. - PubMed

[8] Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–1491. - PubMed

[9] Dudoit S, Fridlyand J. Introduction to classification in microarray experiments. In: Berrar D, Dubitzky W, Granzow M, editor. A Practical Approach to Microarray Data Analysis. Boston: Kluwer Academic Publishers; 2002. pp. 131–151.

[10] Dudoit S, Fridlyand J. Introduction to classification in microarray experiments. In: Berrar D, Dubitzky W, Granzow M, editor. A Practical Approach to Microarray Data Analysis. Boston: Kluwer Academic Publishers; 2002. pp. 131–151.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Instance-based concept learning from multiclass DNA microarray data

Affiliation

Instance-based concept learning from multiclass DNA microarray data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources