Multiclass cancer classification based  on gene expression comparison

Sitan Yang; Daniel Q. Naiman

doi:10.1515/sagmb-2013-0053

Published by De Gruyter June 11, 2014

Multiclass cancer classification based on gene expression comparison

Sitan Yang and Daniel Q. Naiman

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2013-0053

Showing a limited preview of this publication:

Abstract

As the complexity and heterogeneity of cancer is being increasingly appreciated through genomic analyses, microarray-based cancer classification comprising multiple discriminatory molecular markers is an emerging trend. Such multiclass classification problems pose new methodological and computational challenges for developing novel and effective statistical approaches. In this paper, we introduce a new approach for classifying multiple disease states associated with cancer based on gene expression profiles. Our method focuses on detecting small sets of genes in which the relative comparison of their expression values leads to class discrimination. For an m-class problem, the classification rule typically depends on a small number of m-gene sets, which provide transparent decision boundaries and allow for potential biological interpretations. We first test our approach on seven common gene expression datasets and compare it with popular classification methods including support vector machines and random forests. We then consider an extremely large cohort of leukemia cancer patients to further assess its effectiveness. In both experiments, our method yields comparable or even better results to benchmark classifiers. In addition, we demonstrate that our approach can integrate pathway analysis of gene expression to provide accurate and biological meaningful classification.

Keywords: biomarker discovery; gene expression analysis; multiclass cancer classification

Corresponding author: Sitan Yang, Department of Applied Mathematics and Statistics, Johns Hopkins University, 211C Whitehead Hall 3400 N., Charles Street, Baltimore, Maryland 21218, USA, e-mail: syang45@jhu.edu

Acknowledgments

The authors would like to thank Donald Geman and the reviewers for their insightful comments and suggestions. This work was partially supported by NIH-NCRR Grant UL1 RR 025005.

Appendix

A Bayesian decision-theoretic interpretation

In this section, we provide an interpretation for the scoring equation of the TSS classifier [see (3) in Methods] using the Bayesian decision theory, and derive the optimal Bayesian classifier with a general loss function and class priors. Consider an m-class classification problem, TSS aims to find m-gene set 𝒮={i₁, i₂, …, i_m} such that

P[arg max{Xr, r∈𝒮}=ic|y=c]>>P[arg max{Xr, r∈𝒮}=ic|y≠c], c∈{1, 2, …, m}.

Now suppose the class conditional probability distribution associated with gene expression comparisons in 𝒮 is given by

	Class
	y=1	y=2	…	y=m
X_i₁>max{X_r, r∈𝒮\i₁}	p₁₁	p₁₂	…	p_1m
X_i₂>max{X_r, r∈𝒮\i₂}	p₂₁	p₂₂	…	p_2m
……	……
X_{i_m}>max{X_r, r∈𝒮\i_m}	p_m1	p_m2	…	p_mm

A decision procedure δ can be constructed where the result of each comparison in the table above is considered being indicative of a sample from a distinct class. In this situation, m classes lead to m! possible decision procedures for a given gene set where one of those decision procedures is illustrated below.

X	δ(X)
X_i₁>max{X_r, r∈𝒮\i₁}	3
X_i₂>max{X_r, r∈𝒮\i₂}	5
……	……
X_{i_m}>max{X_r, r∈𝒮\i_m}	2

Next, a loss function can be introduced for δ by specifying the penalties for misclassification as follows

	δ=1	δ=2	…	δ=m
y=1	l₁₁	l₁₂	…	l_1m
y=2	l₂₁	l₂₂	…	l_2m
…	……
y=m	l_m1	l_m2	…	l_mm

Based on the tables above, R(i, δ), the risk function of δ for class y=i can be written as

R(i, δ)=∑j=1mlij⋅p(δ=j|y=i).

Consequently, r(δ), the Bayes risk of δ is given by

r(δ)=∑i=1mπi⋅R(i, δ)

where π_i is the prior probability for class y=i. Therefore, the Bayes risk associated with δ is given by

r(δ)=∑i=1mπi⋅∑j=1mlij⋅p(δ=j|y=i),

and the decision rule δ^* satisfying

δ*=argδ min r(δ)

is the optimal rule referrred to as the Bayes Rule. As mentioned earlier, for a given gene set, there are m! possible decision procedures, and since the number of possible gene sets is also finite, the optimal Bayes rule δ^* for the problem can be found by searching for the rule that minimizes r(δ) among all gene sets.

It is important to note that the Bayesian optimality of the decision rule described above only applies when the gene set used for classification has been determined. Otherwise, the development of the Bayes rule requires the joint probability distribution of all genes. In fact, no Bayesian theory is directly related to the choice of gene set for classification.

Equation (A-1) uses the general form of loss function and class prior probabilities. In practice, the choice of loss function and class priors depends on the problem. For example, the empirical estimation n_c/N can be considered, where n_c is the sample size of class c and N is the total sample size. In the context of this paper, the sample size of microarray datasets is quite limited and the sample proportions in a particular dataset may not reflect the actual distribution in the population. Therefore, we used equal class priors in our approach. In addition, without further information about the relative importance of various misclassifications, we assumed a 0-1 loss function, e.g.,

lij={0, if i=j,1, otherwise.

In this case, R(i, δ) becomes

R(i, δ)=∑j=1mlij⋅p(δ=j|y=i)=∑j≠ip(δ=j|y=i)=1−p(δ=i|y=i),

and r(δ) is

r(δ)=∑i=1mπiR(i, δ)=1−1m∑i=1mp(δ=i|y=i).

Then minimizing r(δ) is equivalent to finding

maxδ∑i=1mp(δ=i|y=i).

The optimal rule for the equation above can be found heuristically using the equation (3) in Section 2.2, and it turns out to be the top scoring sets.

B The acceleration algorithm

The acceleration algorithm described here generalizes the pruning algorithm introduced by Tan et al. (2005) for the TSP classifier to the multiclass case. Similar to the binary TSP method, an important step for the multiclass TSS approach is to search for top scoring gene sets. Once the search process is completed, the decision rule can be immediately derived. However, given the large number of genes for microarray data, the search process is often computationally expensive. We have previously introduced two methods for searching gene sets with high scores. They are significantly more efficient than the exhaustive search, but may not be fast when combined with schemes such as cross-validation. Therefore, the algorithm here aims to accelerate the search process in the cross-validation loop.

In TSS, different methods can be employed in the search process, but only top scoring sets are kept. Therefore, gene sets that are impossible to achieve the top score can be excluded in the search process, which obviously requires a “complete” comparison among all gene sets. In a typically cross-validation loop, one such comparison is needed for each iteration. However, we will demonstrate that the acceleration algorithm can produce a small list of gene sets so that only a comparison among these gene sets is sufficient to find top scoring sets.

Let r_g(n) denote the score obtained for a given gene set g when a subset of n samples is left out from N training samples in the cross-validation. The lower bound L_g(n) and the upper bound U_g(n) are defined as

Lg(n)≤min{rg(n) : any size n subset}Ug(n)≥max{rg(n) : any size n subset}.

Now suppose the lower and upper bounds are obtained for all possible gene sets {g_i, i=1, 2,…}. Rank all lower bounds from largest to smallest and set the largest lower bound to L. Without loss of generality, assume L=L_g₁(n). Then the following claim holds:

Claim: If U_{g_i}(n)<L then the gene set g_i can not be a top scoring set on N–n samples for any size n subset.

Proof: According to the definition of U_{g_i}(n), r_{g_i}(n)≤U_gi(n). If r_g(n)≤U_{g_i}(n)<L, the following inequalities satisfy for any size n subset

rgi(n)≤Ugi(n)<L≤rg1(n).

Therefore, there is at least one gene set g₁ scored higher than g_i regardless of the choice of the size n subset. This claim follows immediately.

The reduced list Ω typically contains only a few gene sets. The identification of top scoring sets from Ω is extremely fast. The significant improvement in efficiency is hence achieved by repeatedly using Ω in each iteration of the cross-validation. The lower and upper bound for a given gene set can be obtained by calculating all possible scores when any size n subset is left out. Unless a large n and a large number of classes are considered simultaneously, this process is also efficient.

Table A.1

Description of the acceleration algorithm.

Acceleration Algorithm
Input:	N training samples, gene set collection G={g1, g2, ...}
Output:	The reduced gene set list Ω.
1.	For each gene set g_i, compute the lower bound L_{g_i}(n) and the upper bound U_{g_i}(n) under all possible situations that n training samples are left out.
2.	Rank all L_{g_i}(n) in descending order and take L=max{L_{g_i}(n)}
3.	Generate the list Ω consisting of all g_i for which U_{g_i}(n)≥L

In practice, G in Table A.1 can be any gene set collection considered in the search process. For example, in the greedy search process, there are a number of sub-classification problems generated from the original problem. In each of these sub-problems, only top scoring sets are stored. Therefore, the acceleration algorithm can be applied in each step of the greedy search to yield a reduced list of gene sets that can possibly be identified as top scoring sets in the cross-validation. As a result, the greedy search process only needs to be applied one time on the training set.

The acceleration algorithm here can be immediately extended for the k-TSS classifier that uses top k scoring sets as the decision rule. In this situation, only step 2 in Table A.1 needs to be changed so that L is set to the k-th largest lower bound. This is because any gene set whose upper bound is less than L clearly cannot be one of the top k scoring sets during the cross-validation. As a result, the search process for top k scoring sets can also be quite efficient.

C Confusion matrices for acute leukemia subtypes

Section 3.4 provides the classification results of applying the greedy search-based TSS classifier on a large cohort of acute leukemia samples from the MILE project. The classification uses a two-step decision tree (Figure 4) to decompose the original problem into three sub-classification problems: a three-class problem at the top level, and a seven- and six-class problem at the bottom level. The following confusion matrices are for these three problems, respectively. True/gold standard (GS) classifications of samples are presented in rows and predictions are in columns.

Table A. 2

Confusion matrices for classification of acute leukemia subtypes.

References

Armstrong, S. A., J. E. Staunton, L. B. Silverman, R. Pieters, M. L. den Boer, M. D. Minden, S. E. Sallan, E. S. Lander, T. R. Golub and S. J. Korsmeyer (2002): “Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nat. Gene., 30(1), 41–47.Search in Google Scholar

Beer, D. G., S. L. Kardia, C. C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G. Thomas, M. L. Lizyness, R. Kuick, S. Hayasaka, J. M. Taylor, M. D. Iannettoni, M. B. Orringer and S. Hanash (2002): “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nat. Med., 8(8), 816–824.Search in Google Scholar

Breiman, L. (2001): “Random forests,” Mach. Learn., 45(1), 5–32.Search in Google Scholar

Burgess, D. J. (2011): “Cancer genetics: initially complex, always heterogeneous,” Nat. Rev. Cancer, 11, 153.Search in Google Scholar

Cheok, M. H., W. Yang, C. H. Pui, J. R. Downing, C. Cheng, C. W. Naeve, M. V. Relling and W. E. Evans (2003): “Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells,” Nat. Genet., 34(1), 85–90.Search in Google Scholar

Chih-Chung, C. and L. Chih-Jen (2011): “Libsvm: a library for support vector machines,” ACM Transact. Intell. Syst. Technol., 2(3), article 27.Search in Google Scholar

Cortes, C. and V. Vapnik (1995): “Support-vector networks,” Mach. Learn., 20(3), 273–297.Search in Google Scholar

Dehan, E., A. Ben-Dor, W. Liao, D. Lipson, H. Frimer, S. Rienstein, D. Simansky, M. Krupsky, P. Yaron, E. Friedman, G. Rechavi, M. Perlman, A. Aviram-Goldring, S. Izraeli, M. Bittner, Z. Yakhini and N. Kaminski (2007): “Chromosomal aberrations and gene expression profiles in non-small cell lung cancer,” Lung Cancer, 56(2), 157–184.10.1016/j.lungcan.2006.12.010Search in Google Scholar PubMed

Dyrskjot, L., T. Thykjaer, M. Kruhoffer, J. L. Jensen, N. Marcussen, D. S. Hamilton, H. Wolf and T. F. Orntoft (2003): “Identifying distinct classes of bladder carcinoma using microarrays,” Nat. Genet., 33(1), 90–96.Search in Google Scholar

Dyrskjot, L., K. Zieger, F. X. Real, N. Malats, A. Carrato, C. Hurst, S. Kotwal, M. Knowles, P. U. Malmstrom, M. de la Torre, K. Wester, Y. Allory, D. Vordos, A. Caillault, F. Radvanyi, A. M. Hein, J. L. Jensen, K. M. Jensen, N. Marcussen and T. F. Orntoft (2007): “Gene expression signatures predict outcome in non-muscle-invasive bladder carcinoma: a multicenter validation study,” Clin. Cancer Res., 13(12), 3545–3551.Search in Google Scholar

Eddy, J. A., J. Sung, D. Geman and N. D. Price (2010): “Relative expression analysis for molecular cancer diagnosis and prognosis,” Technol. Cancer Res. Treat., 9(2), 149–159.10.1177/153303461000900204Search in Google Scholar PubMed PubMed Central

Edelman, L. B., G. Toia, D. Geman, W. Zhang and N. D. Price (2009): “Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases,” BMC Genom., 10, 583.Search in Google Scholar

Efron, B. and R. Tibshirani (2006): “On testing the significance of sets of genes,” Technical Report, Stanford University, http://www-stat.stanford.edu/~tibs/GSA/.10.1214/07-AOAS101Search in Google Scholar

Gatza, M. L., J. E. Lucas, W. T. Barry, J. W. Kim, Q. Wang, M. D. Crawford, M. B. Datto, M. Kelley, B. Mathey-Prevot, A. Potti and J. R. Nevins (2010): “A pathway-based classification of human breast cancer,” Proc. Natl. Acad. Sci. USA, 107(15), 6994–6999.10.1073/pnas.0912708107Search in Google Scholar PubMed PubMed Central

Geman, D., C. d’ Avignon, D. Q. Naiman and R. L. Winslow (2004): “Classifying gene expression profiles from pairwise mrna comparisons,” Stat. Appl. Genet. Mol. Biol., 3, article 19.Search in Google Scholar

Gentleman, R. C., V. J. Carey, D. M. Bates and others (2004): “Bioconductor: Open software development for computational biology and bioinformatics,” Genome Biol., 5, R80.Search in Google Scholar

Golub, T., D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield and E. S. Lander (1999): “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, 286(5439), 531–537.10.1126/science.286.5439.531Search in Google Scholar PubMed

Grate, L. R. (2005): “Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery,” BMC Bioinformatics, 6, 97.10.1186/1471-2105-6-97Search in Google Scholar PubMed PubMed Central

Haferlach, T., A. Kohlmann, L. Wieczorek, G. Basso, G. T. Kronnie, M. C. Bene, J. De Vos, J. M. Hernmandez, W. K. Hofmann, K. I. Mills, A. Gilkes, S. Chiaretti, S. A. Shurtleff, T. J. Kipps, L. Z. Rassenti, A. E. Yeoh, P. R. Papenhausen, W. M. Liu, P. M. Williams and R. Foa (2010): “Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group,” J. Clin. Oncol., 28(15), 2529–2537.10.1200/JCO.2009.23.4732Search in Google Scholar PubMed PubMed Central

Haibe-Kains, B., C. Desmedt, S. Loi, A. C. Culhane, G. Bontempi, J. Quackenbush and C. Sotiriou (2012): “A three-gene model to robustly identify breast cancer molecular subtypes,” J. Natl. Cancer Inst., 104(4), 311–325.10.1093/jnci/djr545Search in Google Scholar PubMed PubMed Central

Irizarry, R., B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis, U. Scherf and T. Speed (2003): “Exploration, normalization, and summaries of high density oligonucleotide array probe level data,” Biostatistics, 4(2), 249–264.10.1093/biostatistics/4.2.249Search in Google Scholar PubMed

Kanehisa, M., S. Goto, S. Kawashima, Y. Okuno and M. Hattori (2004): “The kegg resource for deciphering the genome,” Nucleic Acids Res., 32, D277–D280.Search in Google Scholar

Kaur, P., D. Schlatzer, K. Cooke and M. R. Chance (2012): “Pairwise protein expression classifier for candidate biomarker discovery for early detection of human disease prognosis,” BMC Bioinformatics, 13, 191.10.1186/1471-2105-13-191Search in Google Scholar PubMed PubMed Central

Khan, J., J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson and P. S. Meltzer (2001): “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., 7(6), 673–679.Search in Google Scholar

Kim, S., M. Kon and C. DeLisi (2012): “A pathway-based classification of human breast cancer,” Biol. Direct, 7, 21.10.1186/1745-6150-7-21Search in Google Scholar PubMed PubMed Central

Leban, G., I. Bratko, U. Petrovic, T. Curk and B. Zupan (2005): “Vizrank: finding informative data projections in functional genomics by machine learning,” Bioinformatics, 21(3), 413–414.10.1093/bioinformatics/bti016Search in Google Scholar PubMed

Lin, X. (2008): Rank-based methods for statistical analysis of gene expression microarray data, Ph.D. Thesis. The Johns Hopkins Unversity, Baltimore, MD, USA.Search in Google Scholar

Lin, X., B. Afsari, L. Marchionni, L. Cope, G. Parmigiani, D. Q. Naiman and D. Geman (2009): “The ordering of expression among a few genes can provide simple cancer biomarkers and signal brca1 mutations,” BMC Bioinformatics, 10, 256.10.1186/1471-2105-10-256Search in Google Scholar PubMed PubMed Central

Mramor, M., G. Leban, J. Demsar and B. Zupan (2007): “Visualization-based cancer microarray data classification analysis,” Bioinformatics, 23(16), 2147–2154.10.1093/bioinformatics/btm312Search in Google Scholar PubMed

Patnaik, S. K., E. Kannisto, S. Knudsen and S. Yendamuri (2010): “Evaluation of microrna expression profiles that may predict recurrence of localized stage i non-small cell lung cancer after surgical resection,” Cancer Res., 70(1), 36–45.10.1158/0008-5472.CAN-09-3153Search in Google Scholar PubMed

Quackenbush, J. (2006): “Microarray analysis and tumor classification,” New Engl. J. Med., 354(23), 2463–2472.Search in Google Scholar

Shah, M. A., R. Khanin, L. Tang, Y. Y. Janjigian, D. S. Klimstra, H. Gerdes and D. P. Kelsen (2011): “Molecular classification of gastric cancer: a new paradigm,” Clin. Cancer Res., 17(9), 2693–2701.Search in Google Scholar

Shen, L. and E. C. Tan (2006): “Reducing multiclass cancer classification to binary by output coding and svm,” Comput. Biol. Chem., 30(1), 63–71.Search in Google Scholar

Statnikov, A., C. F. Aliferis, I. Tsamardinos, D. Hardin and S. Levy (2005): “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis,” Bioinformatics, 21(5), 631–643.10.1093/bioinformatics/bti033Search in Google Scholar PubMed

Statnikov, A., L. Wang and C. F. Aliferis (2008): “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, 9, 319.10.1186/1471-2105-9-319Search in Google Scholar PubMed PubMed Central

Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov (2005): “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,” Proc. Natl. Acad. Sci. USA, 102(43), 15545–15550.10.1073/pnas.0506580102Search in Google Scholar PubMed PubMed Central

Tan, A. C., D. Q. Naiman, L. Xu, R. L. Winslow and D. Geman (2005): “Simple decision rules for classifying human cancers from gene expression profiles,” Bioinformatics, 21(20), 3896–3904.10.1093/bioinformatics/bti631Search in Google Scholar PubMed PubMed Central

Tibshirani, R., T. Hastie, B. Narasimhan and G. Chu (2002): “Diagnosis of multiple cancer types by shrunken centroids of gene expression,” Proc. Natl. Acad. Sci. USA, 99(10), 6567–6572.10.1073/pnas.082099299Search in Google Scholar PubMed PubMed Central

Xu, L., D. Geman and R. L. Winslow (2007): “Large-scale integration of cancer microarray data identifies a robust common cancer signature,” BMC Bioinformatics, 8, 275.10.1186/1471-2105-8-275Search in Google Scholar PubMed PubMed Central

Zhao, H., C. J. Logothetis and I. P. Gorlov (2010): “Usefulness of the top-scoring pairs of genes for prediction of prostate cancer progression,” Prostate Cancer Prost. Dis., 13(3), 252–259.Search in Google Scholar

Published Online: 2014-6-11

Published in Print: 2014-8-1