Abstract
As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analyses, microarray-based cancer classification comprising multiple discriminatory molecular markers is an emerging trend. Such multiclass classification problems pose new methodological and computational challenges for developing novel and effective statistical approaches. In this paper, we introduce a new approach for classifying multiple disease states associated with cancer based on gene expression profiles. Our method focuses on detecting small sets of genes in which the relative comparison of their expression values leads to class discrimination. For an m-class problem, the classification rule typically depends on a small number of m-gene sets, which provide transparent decision boundaries and allow for potential biological interpretations. We first test our approach on seven common gene expression datasets and compare it with popular classification methods including support vector machines and random forests. We then consider an extremely large cohort of leukemia cancer patients to further assess its effectiveness. In both experiments, our method yields comparable or even better results to benchmark classifiers. In addition, we demonstrate that our approach can integrate pathway analysis of gene expression to provide accurate and biological meaningful classification.
Acknowledgments
The authors would like to thank Donald Geman and the reviewers for their insightful comments and suggestions. This work was partially supported by NIH-NCRR Grant UL1 RR 025005.
Appendix
A Bayesian decision-theoretic interpretation
In this section, we provide an interpretation for the scoring equation of the TSS classifier [see (3) in Methods] using the Bayesian decision theory, and derive the optimal Bayesian classifier with a general loss function and class priors. Consider an m-class classification problem, TSS aims to find m-gene set 𝒮={i1, i2, …, im} such that
Now suppose the class conditional probability distribution associated with gene expression comparisons in 𝒮 is given by
Class | ||||
y=1 | y=2 | … | y=m | |
Xi1>max{Xr, r∈𝒮\i1} | p11 | p12 | … | p1m |
Xi2>max{Xr, r∈𝒮\i2} | p21 | p22 | … | p2m |
…… | …… | |||
Xim>max{Xr, r∈𝒮\im} | pm1 | pm2 | … | pmm |
A decision procedure δ can be constructed where the result of each comparison in the table above is considered being indicative of a sample from a distinct class. In this situation, m classes lead to m! possible decision procedures for a given gene set where one of those decision procedures is illustrated below.
X | δ(X) |
Xi1>max{Xr, r∈𝒮\i1} | 3 |
Xi2>max{Xr, r∈𝒮\i2} | 5 |
…… | …… |
Xim>max{Xr, r∈𝒮\im} | 2 |
Next, a loss function can be introduced for δ by specifying the penalties for misclassification as follows
δ=1 | δ=2 | … | δ=m | |
y=1 | l11 | l12 | … | l1m |
y=2 | l21 | l22 | … | l2m |
… | …… | |||
y=m | lm1 | lm2 | … | lmm |
Based on the tables above, R(i, δ), the risk function of δ for class y=i can be written as
Consequently, r(δ), the Bayes risk of δ is given by
where πi is the prior probability for class y=i. Therefore, the Bayes risk associated with δ is given by
and the decision rule δ* satisfying
is the optimal rule referrred to as the Bayes Rule. As mentioned earlier, for a given gene set, there are m! possible decision procedures, and since the number of possible gene sets is also finite, the optimal Bayes rule δ* for the problem can be found by searching for the rule that minimizes r(δ) among all gene sets.
It is important to note that the Bayesian optimality of the decision rule described above only applies when the gene set used for classification has been determined. Otherwise, the development of the Bayes rule requires the joint probability distribution of all genes. In fact, no Bayesian theory is directly related to the choice of gene set for classification.
Equation (A-1) uses the general form of loss function and class prior probabilities. In practice, the choice of loss function and class priors depends on the problem. For example, the empirical estimation nc/N can be considered, where nc is the sample size of class c and N is the total sample size. In the context of this paper, the sample size of microarray datasets is quite limited and the sample proportions in a particular dataset may not reflect the actual distribution in the population. Therefore, we used equal class priors in our approach. In addition, without further information about the relative importance of various misclassifications, we assumed a 0-1 loss function, e.g.,
In this case, R(i, δ) becomes
and r(δ) is
Then minimizing r(δ) is equivalent to finding
The optimal rule for the equation above can be found heuristically using the equation (3) in Section 2.2, and it turns out to be the top scoring sets.
B The acceleration algorithm
The acceleration algorithm described here generalizes the pruning algorithm introduced by Tan et al. (2005) for the TSP classifier to the multiclass case. Similar to the binary TSP method, an important step for the multiclass TSS approach is to search for top scoring gene sets. Once the search process is completed, the decision rule can be immediately derived. However, given the large number of genes for microarray data, the search process is often computationally expensive. We have previously introduced two methods for searching gene sets with high scores. They are significantly more efficient than the exhaustive search, but may not be fast when combined with schemes such as cross-validation. Therefore, the algorithm here aims to accelerate the search process in the cross-validation loop.
In TSS, different methods can be employed in the search process, but only top scoring sets are kept. Therefore, gene sets that are impossible to achieve the top score can be excluded in the search process, which obviously requires a “complete” comparison among all gene sets. In a typically cross-validation loop, one such comparison is needed for each iteration. However, we will demonstrate that the acceleration algorithm can produce a small list of gene sets so that only a comparison among these gene sets is sufficient to find top scoring sets.
Let rg(n) denote the score obtained for a given gene set g when a subset of n samples is left out from N training samples in the cross-validation. The lower bound Lg(n) and the upper bound Ug(n) are defined as
Now suppose the lower and upper bounds are obtained for all possible gene sets {gi, i=1, 2,…}. Rank all lower bounds from largest to smallest and set the largest lower bound to L. Without loss of generality, assume L=Lg1(n). Then the following claim holds:
Claim: If Ugi(n)<L then the gene set gi can not be a top scoring set on N–n samples for any size n subset.
Proof: According to the definition of Ugi(n), rgi(n)≤Ugi(n). If rg(n)≤Ugi(n)<L, the following inequalities satisfy for any size n subset
Therefore, there is at least one gene set g1 scored higher than gi regardless of the choice of the size n subset. This claim follows immediately.
The reduced list Ω typically contains only a few gene sets. The identification of top scoring sets from Ω is extremely fast. The significant improvement in efficiency is hence achieved by repeatedly using Ω in each iteration of the cross-validation. The lower and upper bound for a given gene set can be obtained by calculating all possible scores when any size n subset is left out. Unless a large n and a large number of classes are considered simultaneously, this process is also efficient.
Acceleration Algorithm | |
Input: | N training samples, gene set collection |
Output: | The reduced gene set list Ω. |
1. | For each gene set gi, compute the lower bound Lgi(n) and the upper bound Ugi(n) under all possible situations that n training samples are left out. |
2. | Rank all Lgi(n) in descending order and take L=max{Lgi(n)} |
3. | Generate the list Ω consisting of all gi for which Ugi(n)≥L |
In practice,
The acceleration algorithm here can be immediately extended for the k-TSS classifier that uses top k scoring sets as the decision rule. In this situation, only step 2 in Table A.1 needs to be changed so that L is set to the k-th largest lower bound. This is because any gene set whose upper bound is less than L clearly cannot be one of the top k scoring sets during the cross-validation. As a result, the search process for top k scoring sets can also be quite efficient.
C Confusion matrices for acute leukemia subtypes
Section 3.4 provides the classification results of applying the greedy search-based TSS classifier on a large cohort of acute leukemia samples from the MILE project. The classification uses a two-step decision tree (Figure 4) to decompose the original problem into three sub-classification problems: a three-class problem at the top level, and a seven- and six-class problem at the bottom level. The following confusion matrices are for these three problems, respectively. True/gold standard (GS) classifications of samples are presented in rows and predictions are in columns.
References
Armstrong, S. A., J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D. Minden, S. E. Sallan, E. S. Lander, T. R. Golub and S. J. Korsmeyer (2002): “Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nat. Gene., 30(1), 41–47.Search in Google Scholar
Beer, D. G., S. L. Kardia, C. C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G. Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. Taylor, M. D. Iannettoni, M. B. Orringer and S. Hanash (2002): “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nat. Med., 8(8), 816–824.Search in Google Scholar
Breiman, L. (2001): “Random forests,” Mach. Learn., 45(1), 5–32.Search in Google Scholar
Burgess, D. J. (2011): “Cancer genetics: initially complex, always heterogeneous,” Nat. Rev. Cancer, 11, 153.Search in Google Scholar
Cheok, M. H., W. Yang, C. H. Pui, J. R. Downing, C. Cheng, C. W. Naeve, M. V. Relling and W. E. Evans (2003): “Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells,” Nat. Genet., 34(1), 85–90.Search in Google Scholar
Chih-Chung, C. and L. Chih-Jen (2011): “Libsvm: a library for support vector machines,” ACM Transact. Intell. Syst. Technol., 2(3), article 27.Search in Google Scholar
Cortes, C. and V. Vapnik (1995): “Support-vector networks,” Mach. Learn., 20(3), 273–297.Search in Google Scholar
Dehan, E., A. Ben-Dor, W. Liao, D. Lipson, H. Frimer, S. Rienstein, D. Simansky, M. Krupsky, P. Yaron, E. Friedman, G. Rechavi, M. Perlman, A. Aviram-Goldring, S. Izraeli, M. Bittner, Z. Yakhini and N. Kaminski (2007): “Chromosomal aberrations and gene expression profiles in non-small cell lung cancer,” Lung Cancer, 56(2), 157–184.10.1016/j.lungcan.2006.12.010Search in Google Scholar PubMed
Dyrskjot, L., T. Thykjaer, M. Kruhoffer, J. L. Jensen, N. Marcussen, D. S. Hamilton, H. Wolf and T. F. Orntoft (2003): “Identifying distinct classes of bladder carcinoma using microarrays,” Nat. Genet., 33(1), 90–96.Search in Google Scholar
Dyrskjot, L., K. Zieger, F. X. Real, N. Malats, A. Carrato, C. Hurst, S. Kotwal, M. Knowles, P. U. Malmstrom, M. de la Torre, K. Wester, Y. Allory, D. Vordos, A. Caillault, F. Radvanyi, A. M. Hein, J. L. Jensen, K. M. Jensen, N. Marcussen and T. F. Orntoft (2007): “Gene expression signatures predict outcome in non-muscle-invasive bladder carcinoma: a multicenter validation study,” Clin. Cancer Res., 13(12), 3545–3551.Search in Google Scholar
Eddy, J. A., J. Sung, D. Geman and N. D. Price (2010): “Relative expression analysis for molecular cancer diagnosis and prognosis,” Technol. Cancer Res. Treat., 9(2), 149–159.10.1177/153303461000900204Search in Google Scholar PubMed PubMed Central
Edelman, L. B., G. Toia, D. Geman, W. Zhang and N. D. Price (2009): “Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases,” BMC Genom., 10, 583.Search in Google Scholar
Efron, B. and R. Tibshirani (2006): “On testing the significance of sets of genes,” Technical Report, Stanford University, http://www-stat.stanford.edu/~tibs/GSA/.10.1214/07-AOAS101Search in Google Scholar
Gatza, M. L., J. E. Lucas, W. T. Barry, J. W. Kim, Q. Wang, M. D. Crawford, M. B. Datto, M. Kelley, B. Mathey-Prevot, A. Potti and J. R. Nevins (2010): “A pathway-based classification of human breast cancer,” Proc. Natl. Acad. Sci. USA, 107(15), 6994–6999.10.1073/pnas.0912708107Search in Google Scholar PubMed PubMed Central
Geman, D., C. d’ Avignon, D. Q. Naiman and R. L. Winslow (2004): “Classifying gene expression profiles from pairwise mrna comparisons,” Stat. Appl. Genet. Mol. Biol., 3, article 19.Search in Google Scholar
Gentleman, R. C., V. J. Carey, D. M. Bates and others (2004): “Bioconductor: Open software development for computational biology and bioinformatics,” Genome Biol., 5, R80.Search in Google Scholar
Golub, T., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander (1999): “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, 286(5439), 531–537.10.1126/science.286.5439.531Search in Google Scholar PubMed
Grate, L. R. (2005): “Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery,” BMC Bioinformatics, 6, 97.10.1186/1471-2105-6-97Search in Google Scholar PubMed PubMed Central
Haferlach, T., A. Kohlmann, L. Wieczorek, G. Basso, G. T. Kronnie, M. C. Bene, J. De Vos, J. M. Hernmandez, W. K. Hofmann, K. I. Mills, A. Gilkes, S. Chiaretti, S. A. Shurtleff, T. J. Kipps, L. Z. Rassenti, A. E. Yeoh, P. R. Papenhausen, W. M. Liu, P. M. Williams and R. Foa (2010): “Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group,” J. Clin. Oncol., 28(15), 2529–2537.10.1200/JCO.2009.23.4732Search in Google Scholar PubMed PubMed Central
Haibe-Kains, B., C. Desmedt, S. Loi, A. C. Culhane, G. Bontempi, J. Quackenbush and C. Sotiriou (2012): “A three-gene model to robustly identify breast cancer molecular subtypes,” J. Natl. Cancer Inst., 104(4), 311–325.10.1093/jnci/djr545Search in Google Scholar PubMed PubMed Central
Irizarry, R., B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf and T. Speed (2003): “Exploration, normalization, and summaries of high density oligonucleotide array probe level data,” Biostatistics, 4(2), 249–264.10.1093/biostatistics/4.2.249Search in Google Scholar PubMed
Kanehisa, M., S. Goto, S. Kawashima, Y. Okuno and M. Hattori (2004): “The kegg resource for deciphering the genome,” Nucleic Acids Res., 32, D277–D280.Search in Google Scholar
Kaur, P., D. Schlatzer, K. Cooke and M. R. Chance (2012): “Pairwise protein expression classifier for candidate biomarker discovery for early detection of human disease prognosis,” BMC Bioinformatics, 13, 191.10.1186/1471-2105-13-191Search in Google Scholar PubMed PubMed Central
Khan, J., J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer (2001): “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., 7(6), 673–679.Search in Google Scholar
Kim, S., M. Kon and C. DeLisi (2012): “A pathway-based classification of human breast cancer,” Biol. Direct, 7, 21.10.1186/1745-6150-7-21Search in Google Scholar PubMed PubMed Central
Leban, G., I. Bratko, U. Petrovic, T. Curk and B. Zupan (2005): “Vizrank: finding informative data projections in functional genomics by machine learning,” Bioinformatics, 21(3), 413–414.10.1093/bioinformatics/bti016Search in Google Scholar PubMed
Lin, X. (2008): Rank-based methods for statistical analysis of gene expression microarray data, Ph.D. Thesis. The Johns Hopkins Unversity, Baltimore, MD, USA.Search in Google Scholar
Lin, X., B. Afsari, L. Marchionni, L. Cope, G. Parmigiani, D. Q. Naiman and D. Geman (2009): “The ordering of expression among a few genes can provide simple cancer biomarkers and signal brca1 mutations,” BMC Bioinformatics, 10, 256.10.1186/1471-2105-10-256Search in Google Scholar PubMed PubMed Central
Mramor, M., G. Leban, J. Demsar and B. Zupan (2007): “Visualization-based cancer microarray data classification analysis,” Bioinformatics, 23(16), 2147–2154.10.1093/bioinformatics/btm312Search in Google Scholar PubMed
Patnaik, S. K., E. Kannisto, S. Knudsen and S. Yendamuri (2010): “Evaluation of microrna expression profiles that may predict recurrence of localized stage i non-small cell lung cancer after surgical resection,” Cancer Res., 70(1), 36–45.10.1158/0008-5472.CAN-09-3153Search in Google Scholar PubMed
Quackenbush, J. (2006): “Microarray analysis and tumor classification,” New Engl. J. Med., 354(23), 2463–2472.Search in Google Scholar
Shah, M. A., R. Khanin, L. Tang, Y. Y. Janjigian, D. S. Klimstra, H. Gerdes and D. P. Kelsen (2011): “Molecular classification of gastric cancer: a new paradigm,” Clin. Cancer Res., 17(9), 2693–2701.Search in Google Scholar
Shen, L. and E. C. Tan (2006): “Reducing multiclass cancer classification to binary by output coding and svm,” Comput. Biol. Chem., 30(1), 63–71.Search in Google Scholar
Statnikov, A., C. F. Aliferis, I. Tsamardinos, D. Hardin and S. Levy (2005): “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis,” Bioinformatics, 21(5), 631–643.10.1093/bioinformatics/bti033Search in Google Scholar PubMed
Statnikov, A., L. Wang and C. F. Aliferis (2008): “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, 9, 319.10.1186/1471-2105-9-319Search in Google Scholar PubMed PubMed Central
Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov (2005): “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Natl. Acad. Sci. USA, 102(43), 15545–15550.10.1073/pnas.0506580102Search in Google Scholar PubMed PubMed Central
Tan, A. C., D. Q. Naiman, L. Xu, R. L. Winslow and D. Geman (2005): “Simple decision rules for classifying human cancers from gene expression profiles,” Bioinformatics, 21(20), 3896–3904.10.1093/bioinformatics/bti631Search in Google Scholar PubMed PubMed Central
Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2002): “Diagnosis of multiple cancer types by shrunken centroids of gene expression,” Proc. Natl. Acad. Sci. USA, 99(10), 6567–6572.10.1073/pnas.082099299Search in Google Scholar PubMed PubMed Central
Xu, L., D. Geman and R. L. Winslow (2007): “Large-scale integration of cancer microarray data identifies a robust common cancer signature,” BMC Bioinformatics, 8, 275.10.1186/1471-2105-8-275Search in Google Scholar PubMed PubMed Central
Zhao, H., C. J. Logothetis and I. P. Gorlov (2010): “Usefulness of the top-scoring pairs of genes for prediction of prostate cancer progression,” Prostate Cancer Prost. Dis., 13(3), 252–259.Search in Google Scholar
© 2014 by De Gruyter