Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

doi:10.1073/pnas.0601231103

. 2006 Apr 11;103(15):5923-8.

doi: 10.1073/pnas.0601231103. Epub 2006 Apr 3.

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Liat Ein-Dor¹, Or Zuk, Eytan Domany

Affiliations

PMID: 16585533
PMCID: PMC1458674
DOI: 10.1073/pnas.0601231103

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Liat Ein-Dor et al. Proc Natl Acad Sci U S A. 2006.

. 2006 Apr 11;103(15):5923-8.

doi: 10.1073/pnas.0601231103. Epub 2006 Apr 3.

Authors

Liat Ein-Dor¹, Or Zuk, Eytan Domany

Affiliation

¹ Department of Physics of Complex Systems, The Weizmann Institute of Science, Rehovot 76100, Israel.

PMID: 16585533
PMCID: PMC1458674
DOI: 10.1073/pnas.0601231103

Abstract

Predicting at the time of discovery the prognosis and metastatic potential of cancer is a major challenge in current clinical research. Numerous recent studies searched for gene expression signatures that outperform traditionally used clinical parameters in outcome prediction. Finding such a signature will free many patients of the suffering and toxicity associated with adjuvant chemotherapy given to them under current protocols, even though they do not need such treatment. A reliable set of predictive genes also will contribute to a better understanding of the biological mechanism of metastasis. Several groups have published lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common. This lack of agreement raised doubts about the reliability and robustness of the reported predictive gene lists, and the main source of the problem was shown to be the small number of samples that were used to generate the gene lists. Here, we introduce a previously undescribed mathematical method, probably approximately correct (PAC) sorting, for evaluating the robustness of such lists. We calculate for several published data sets the number of samples that are needed to achieve any desired level of reproducibility. For example, to achieve a typical overlap of 50% between two predictive lists of genes, breast cancer studies would need the expression profiles of several thousand early discovery patients.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement: No conflicts declared.

Figures

**Fig. 1.**
The overlap f of two top-gene lists derived from data of van de Vijver *et al.* (10), with α = 0.0046 (corresponding to predictive lists of 70 genes). (a) The mean and standard deviation (represented by vertical bars) of f for various values of n. (b) The probability distribution of f for the same values of n.

**Fig. 2.**
The mean overlap f*_n as a function of the number of samples, for six different data sets, for α = 0.012. The vertical bars indicate one standard deviation. Analytic estimations are in blue, and the results of simulations are in red. For each data set, the range of n for which results are presented reflects the number of samples of the particular experiment. Numbers in parentheses refer to the reference from which the data were taken.

**Fig. 3.**
The typical overlap f*_n as a function of the number of samples, for the six different data sets (α = 0.012 was used). All curves except lung cancer (3) were produced using the analytical results. Because no agreement was found between simulation and analytical results for lung cancer (3), this curve was produced using extrapolation of simulation results (see *Materials and Methods*). Numbers in parentheses refer to the reference from which the data were taken.

See this image and copyright information in PMC

Cited by

Reproducibility and concordance of differential DNA methylation and gene expression in cancer.
Yao C, Li H, Shen X, He Z, He L, Guo Z. Yao C, et al. PLoS One. 2012;7(1):e29686. doi: 10.1371/journal.pone.0029686. Epub 2012 Jan 3. PLoS One. 2012. PMID: 22235325 Free PMC article.
Algebraic comparison of partial lists in bioinformatics.
Jurman G, Riccadonna S, Visintainer R, Furlanello C. Jurman G, et al. PLoS One. 2012;7(5):e36540. doi: 10.1371/journal.pone.0036540. Epub 2012 May 17. PLoS One. 2012. PMID: 22615778 Free PMC article.
scFed: federated learning for cell type classification with scRNA-seq.
Wang S, Shen B, Guo L, Shang M, Liu J, Sun Q, Shen B. Wang S, et al. Brief Bioinform. 2023 Nov 22;25(1):bbad507. doi: 10.1093/bib/bbad507. Brief Bioinform. 2023. PMID: 38221903 Free PMC article.
All (remains) in the family? Using healthy relatives to define Crohn's gut microbiome alterations.
Amir A, Haberman Y. Amir A, et al. Cell Rep Med. 2024 Jul 16;5(7):101651. doi: 10.1016/j.xcrm.2024.101651. Cell Rep Med. 2024. PMID: 39019007 Free PMC article.
GPDRP: a multimodal framework for drug response prediction with graph transformer.
Yang Y, Li P. Yang Y, et al. BMC Bioinformatics. 2023 Dec 17;24(1):484. doi: 10.1186/s12859-023-05618-0. BMC Bioinformatics. 2023. PMID: 38105227 Free PMC article.

See all "Cited by" articles

References

1. Early Breast Cancer Trialists’ Collaborative Group. Lancet. 1998;352:930–942. - PubMed
1. Bair E., Tibshirani R. PLoS Biol. 2004;2:5011–5022. - PMC - PubMed
1. Beer D. G., Kardia S. L., Huang C. C., Giordano T. J., Levin A. M., Misek D. E., Lin L., Chen G., Gharib T. G., Thomas D. G., et al. Nat. Med. 2002;8:816–824. - PubMed
1. Khan J., Wei J. S., Ringner M., Saal L. H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C. R., Peterson C., et al. Nat. Med. 2001;7:673–679. - PMC - PubMed
1. Rosenwald A., Wright G., Chan W. C., Connors J. M., Campo E., Fisher R. I., Gascoyne R. D., Muller-Hermelink H. K., Smeland E. B., Giltnane J. M., et al. N. Engl. J. Med. 2002;346:1937–1947. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

[1] Early Breast Cancer Trialists’ Collaborative Group. Lancet. 1998;352:930–942. - PubMed

[2] Early Breast Cancer Trialists’ Collaborative Group. Lancet. 1998;352:930–942. - PubMed

[3] Bair E., Tibshirani R. PLoS Biol. 2004;2:5011–5022. - PMC - PubMed

[4] Bair E., Tibshirani R. PLoS Biol. 2004;2:5011–5022. - PMC - PubMed

[5] Beer D. G., Kardia S. L., Huang C. C., Giordano T. J., Levin A. M., Misek D. E., Lin L., Chen G., Gharib T. G., Thomas D. G., et al. Nat. Med. 2002;8:816–824. - PubMed

[6] Beer D. G., Kardia S. L., Huang C. C., Giordano T. J., Levin A. M., Misek D. E., Lin L., Chen G., Gharib T. G., Thomas D. G., et al. Nat. Med. 2002;8:816–824. - PubMed

[7] Khan J., Wei J. S., Ringner M., Saal L. H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C. R., Peterson C., et al. Nat. Med. 2001;7:673–679. - PMC - PubMed

[8] Khan J., Wei J. S., Ringner M., Saal L. H., Ladanyi M., Westermann F., Berthold F., Schwab M., Antonescu C. R., Peterson C., et al. Nat. Med. 2001;7:673–679. - PMC - PubMed

[9] Rosenwald A., Wright G., Chan W. C., Connors J. M., Campo E., Fisher R. I., Gascoyne R. D., Muller-Hermelink H. K., Smeland E. B., Giltnane J. M., et al. N. Engl. J. Med. 2002;346:1937–1947. - PubMed

[10] Rosenwald A., Wright G., Chan W. C., Connors J. M., Campo E., Fisher R. I., Gascoyne R. D., Muller-Hermelink H. K., Smeland E. B., Giltnane J. M., et al. N. Engl. J. Med. 2002;346:1937–1947. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Affiliation

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical