Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 May 14;99(10):6567-72.
doi: 10.1073/pnas.082099299.

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Affiliations

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Robert Tibshirani et al. Proc Natl Acad Sci U S A. .

Abstract

We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Centroids (grey) and shrunken centroids (red) for the SRBCT dataset. The overall centroid has been subtracted from the centroid from each class. The horizontal units are log ratios of expression. From left to right, the numbers of training samples for each class are 8, 23, 12, and 20. The order of the genes is arbitrary.
Figure 2
Figure 2
SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter Δ. The value Δ = 4.34 is chosen and yields a subset of 43 selected genes.
Figure 3
Figure 3
Shrunken differences dik for the 43 genes having at least one nonzero difference. The genes with nonzero components in each class are almost mutually exclusive.
Figure 4
Figure 4
(Top) Heat map of the chosen 43 genes. Within each of the horizontal partitions, we have ordered the genes by hierarchical clustering, and similarly for the samples within each vertical partition. (Bottom) Heat map of three genes reported in the literature to characterize SRBCT tumors. They are c-myc (Top), CD45 (Middle) and myogenin (Bottom).
Figure 5
Figure 5
Estimated probabilities for the training data (Upper) and test data (Lower). Samples are partitioned by the true class (Upper) and the predicted class (Lower). All 63 of the training samples and all 20 of the test samples known to be SRBCT are correctly classified. Five of the test samples were non-SRBCT and hence should not be classified as such. The maximum estimated probability for each of these five samples is marked with a circle; they are below the maximum probabilities for the other test samples in each class.
Figure 6
Figure 6
Leukemia classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors. The value Δ = 4.06 yields a subset of 21 genes.

Similar articles

Cited by

References

    1. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, et al. Science. 1999;286:531–536. - PubMed
    1. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Raffeld M, et al. N Engl J Med. 2001;344:539–548. - PubMed
    1. Hastie T, Tibshirani R, Botstein D, Brown P. Genome Biol. 2001;2:1–12. - PMC - PubMed
    1. Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, et al. Nat Med. 2001;7:673–679. - PMC - PubMed
    1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inference and Prediction. New York: Springer; 2001.

Publication types

MeSH terms