Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 7;116(19):9501-9510.
doi: 10.1073/pnas.1901695116. Epub 2019 Apr 23.

In silico learning of tumor evolution through mutational time series

Affiliations

In silico learning of tumor evolution through mutational time series

Noam Auslander et al. Proc Natl Acad Sci U S A. .

Abstract

Cancer arises through the accumulation of somatic mutations over time. Understanding the sequence of mutation occurrence during cancer progression can assist early and accurate diagnosis and improve clinical decision-making. Here we employ long short-term memory (LSTM) networks, a class of recurrent neural network, to learn the evolution of a tumor through an ordered sequence of mutations. We demonstrate the capacity of LSTMs to learn complex dynamics of the mutational time series governing tumor progression, allowing accurate prediction of the mutational burden and the occurrence of mutations in the sequence. Using the probabilities learned by the LSTM, we simulate mutational data and show that the simulation results are statistically indistinguishable from the empirical data. We identify passenger mutations that are significantly associated with established cancer drivers in the sequence and demonstrate that the genes carrying these mutations are substantially enriched in interactions with the corresponding driver genes. Breaking the network into modules consisting of driver genes and their interactors, we show that these interactions are associated with poor patient prognosis, thus likely conferring growth advantage for tumor progression. Thus, application of LSTM provides for prediction of numerous additional conditional drivers and reveals hitherto unknown aspects of cancer evolution.

Keywords: cancer progression; driver mutations; machine learning; neural networks; passenger mutations.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Prediction of tumor mutational load from the mutational time series. (A) The test AUCs (y axis) obtained for training LSTMs on different lengths of mutation sequences (x axis) when starting from the latest ordered mutation, for colon and lung test sets. (BD) Correlation between the score assigned by LSTMs with the 20 latest mutations in the time series (x axis), and the true mutational load (y axis, log transformed). (EG) Spearman correlation between the scores assigned by different learning models and the observed mutational load (y axis) when using different number of mutations from these that are ordered latest in the sequence (up to 50 mutations, x axis). The dashed lines show the results for classifiers trained on randomly selected mutations (rather than the ordered sequence of mutations that is shown by solid lines). (HK) Scores assigned by LSTM using the last 20 mutations for colon cancer patients with different clinical characteristics. (L) PFS of lung cancer patients in the test set, of samples assigned with high vs. low (using the median) scores with the last 20 mutations. COAD, colon adenocarcinoma; LUAD, lung adenocarcinoma.
Fig. 2.
Fig. 2.
Prediction and generation of the sequence of mutations. (A) Histogram of mutations count (y axis) for each performance level (AUC; x axis) for mutation prediction in a sequence for colon cancer test sets (pink and purple bars for test sets 1 and 2, respectively) and for lung cancer test set (green bars). (B) Mean AUC of mutation prediction in the sequence for the two colon test sets: comparison of drivers with all other genes. (C) AUC of mutation prediction in the sequence for the lung test set: comparison of drivers with all other genes. (D and E) Scatter plots of PC1–PC3 obtained by PCA applied to the combined mutational data from all datasets used and the simulated samples, for colon and lung cancers, respectively. The percentage of variance explained by each PC is indicated in parentheses. (F and G) Presence–absence patterns for the high-frequency cancer drivers in the reconstructed mutational samples (red) and the TCGA mutational data (blue) for colon and lung cancers, respectively. The samples are ordered by the hierarchical clustering results, with the Euclidean distance metric and average linkage.
Fig. 3.
Fig. 3.
STRING-validated interactions of major drivers in colon and lung cancers. (A and B) Heatmaps showing, for each major colon and lung cancer driver, respectively, the number of STRING interactions within the mutational data (first row), the number of predicted interactions (second row), the number of interactions in the intersection (third row), and the log-transformed hypergeometric P value (fourth row). (C and D) The networks of STRING-validated interactions for colon and lung cancers, respectively.
Fig. 4.
Fig. 4.
GO enrichment of the predicted interactions of major cancer drivers. (A and B) The gray bars show the fraction of GO processes associated with each major cancer driver that are significantly shared with its predicted interactors, for colon and lung cancers, respectively. The dot plots show the percentage of overlap of GO processes between each major driver and its predicted interactors (the red bar shows the mean of this distribution). (C) Heatmaps for GO processes enriched with the shared colon and lung major drivers and their interactors (presented are the interactors that are most strongly associated with these GO processes; for full information, see Dataset S6).
Fig. 5.
Fig. 5.
Modules of drivers and interactors in colon cancer. (AC) The complete networks of interactions between modules of major colon cancer drivers (modules I–III, respectively) and their predicted shared interactors. (DF) Kaplan–Meier survival curves of TCGA colon cancer samples with high mutation rate of drivers modules I–III, respectively, with high vs. low number of mutation in the interactors of these modules (defined by the median). (GI) Kaplan–Meier survival curves of TCGA colon cancer samples with vs. without mutations in individual interactors of these modules.
Fig. 6.
Fig. 6.
Modules of drivers and interactors in lung cancer. (A and B) The complete networks of interactions between modules of major lung cancer drivers (modules 1 and 2, respectively) and their predicted shared interactors. (CF) Kaplan–Meier survival curves of TCGA colon cancer samples (overall survival; C and E) and the lung cancer test set samples (PFS; D and F) of samples with high mutation rate of driver modules I and II, respectively, with high vs. low number of mutation in the interactors of these modules (defined by the median). (G) Kaplan–Meier survival curves of TCGA colon cancer samples with high mutation rate of driver module I (defined by the median), with vs. without mutations of individual interactors of module I.

Similar articles

Cited by

References

    1. Vogelstein B, Kinzler KW. The multistep nature of cancer. Trends Genet. 1993;9:138–141. - PubMed
    1. Farber E. The multistep nature of cancer development. Cancer Res. 1984;44:4217–4223. - PubMed
    1. Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458:719–724. - PMC - PubMed
    1. Vogelstein B, et al. Cancer genome landscapes. Science. 2013;339:1546–1558. - PMC - PubMed
    1. Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010;463:191–196. - PMC - PubMed

Publication types

LinkOut - more resources