Abstract
Large reference datasets of protein-coding variation in human populations have allowed us to determine which genes and genic subregions are intolerant to germline genetic variation. There is also a growing number of genes implicated in severe Mendelian diseases that overlap with genes implicated in cancer. We hypothesized that cancer-driving mutations might be enriched in genic subregions that are depleted of germline variation relative to somatic variation. We introduce a new metric, OncMTR (oncology missense tolerance ratio), which uses 125,748 exomes in the Genome Aggregation Database (gnomAD) to identify these genic subregions. We demonstrate that OncMTR can significantly predict driver mutations implicated in hematologic malignancies. Divergent OncMTR regions were enriched for cancer-relevant protein domains, and overlaying OncMTR scores on protein structures identified functionally important protein residues. Last, we performed a rare variant, gene-based collapsing analysis on an independent set of 394,694 exomes from the UK Biobank and find that OncMTR markedly improves genetic signals for hematologic malignancies.
Analysis of ~125,000 human exomes reveals that cancer-causing mutations accrue in regions where inherited mutations are depleted.
INTRODUCTION
The availability of large-scale human genetic variation reference datasets has revolutionized our ability to identify disease-causing mutations (1, 2). Through the effective process of natural selection, variants with severe clinical outcomes are generally depleted in these datasets. We and others have leveraged this paradigm to develop intolerance metrics that quantify the extent to which natural selection constrains germline variation in genes and genic subregions (3–6). These methods have proven invaluable in prioritizing which of the roughly 20,000 protein-coding variants observed in any given individual are most likely to contribute to disease. Interpreting variants in the context of cancer suffers from similar challenges as interpreting germline variation: Cancer cells often carry thousands of somatic mutations, but only some of these drive the oncogenic process. Despite their success in prioritizing germline variants, population genetics–based approaches have yet to be applied in the context of distinguishing between somatic cancer driving mutations and neutral “passenger” mutations.
Many developmental disorder-causing germline mutations occur in essential genic subregions, leading to dysfunction of crucial cellular biology pathways. We postulated that if these same mutations arise mitotically later in life, then they will not cause the same developmental disease due to more limited expression of the mutation but could have equally as profound impacts on cellular biology. Consistent with this, there are several examples whereby identical point mutations that cause severe developmental syndromes when mutated in the germ line cause cancer when mutated somatically (7, 8), including identical mutations in PTEN, ASXL1 (9), EZH2 (10), and others (11). Many of these genes are involved in cell proliferation, chromatin remodeling, genome maintenance, and signal transduction pathways. This convergence highlights a subset of genes in the human genome that are crucial to cell biology, whereby disruptive mutations can cause different clinical outcomes depending on their timing, localization, and cellular context.
Here, we hypothesized that regions of genes that are under strong negative selection for germline variation but are exceptionally mitotically mutable would be enriched for variants that increase cancer risk. Identifying these genic subregions could help prioritize cancer-driving mutations. Here, we focus on missense variants as they are the most observed protein-coding variant class and are becoming increasingly clinically actionable (12) but are also more difficult to interpret than protein-truncating annotated variants. We previously introduced the missense tolerance ratio (MTR), a sliding window–based approach that detects genetic subregions depleted of missense variation (6). In this study, we extended this method to produce a score [OncMTR (oncology missense tolerance ratio)] to identify depleted of germline variation relative to somatic variation using exome data from 125,748 individuals in the Genome Aggregation Database (gnomAD) (1). We demonstrate that OncMTR effectively predicts driver mutations of hematologic malignancies. We also use 394,694 UK Biobank (UKB) exomes to illustrate the utility of OncMTR in prioritizing variants in genetic discovery for cancer phenotypes. This work introduces a population genetics approach to identifying genic subregions enriched for cancer-related somatic missense mutations. We accompany our work with a web app that enables easy visualization of OncMTR scores for each protein-coding gene: http://oncmtr.public.cgr.astrazeneca.com.
RESULTS
Putative somatic variants in gnomAD
Population-level catalogs of human genetic variation allow for the investigation of selective constraint and mutational patterns in the exome. We used the gnomAD database of 125,748 human exomes to survey both germline and somatic variants (1). Although the gnomAD variant calling pipeline was tuned to detect germline variation, we reasoned that we may also be able to identify somatic variants that reach a sufficiently high variant allele frequency to be detected through their germline variant caller. Inherited heterozygous germline variants are expected to have an allelic ratio close to 50%. We observed that the distribution of median allelic balance (AB_median) values for gnomAD variants followed a bimodal distribution, with one distribution centered around 50% and another smaller distribution centered around 20% (Fig. 1A).
Defining OncMTR
We previously introduced a sliding window–based metric, the MTR, which measures purifying selection on missense variation in genic subregions (6). This score demonstrably detects crucial functional domains of proteins that can cause Mendelian disease when mutated in the germ line. Motivated by the overlap between mutations associated with Mendelian disease and cancer, we set out to create a cancer-relevant version of MTR (see Methods) that captures regions that are depleted of germline variation relative to somatic variation. In this study, we defined another variation of the MTR score, namely, MTRgermline. In its construction, MTRgermline is restricted to only those variants achieving an AB_median > 0.3. Taking the well-known cancer gene TP53 as an example, we can observe those genic subregions where the two MTR formulations diverged (Fig. 1B). We then define OncMTR as the difference between these two MTR formulations for each codon and using a 31-codon sliding window (Fig. 1B). Negative scores correspond to regions with the greatest divergence between germline intolerance and somatic variant enrichment. Overlaying OncMTR scores on the AlphaFold-predicted structure of TP53 (13) illustrated that the strongest negative scores correspond to the DNA binding domain, which is the domain enriched for mutations known to drive hematologic malignancies (Fig. 1C).
Using OncMTR to prioritize driver mutations in hematologic malignancies
Motivated by the positive proof of concept demonstrated for TP53, we next tested whether the MTR and MTRgermline distributions differed across other oncogenes and tumor suppressor genes included in the Catalogue of Somatic Mutations in Cancer (COSMIC) Cancer Gene Census (CGC). The CGC is divided into two tiers, with tier 1 containing bona fide cancer genes (n = 556) and tier 2 containing genes that have strong indications of playing a role in cancer but with less expansive evidence than tier 1 (n = 137). The difference between MTR and MTRgermline distributions per gene, calculated via cross-entropy, was significantly higher for tier 1 genes than a random selection of 556 non-CGC genes (P = 5.7 × 10−31), the remainder of the exome (P = 2.8 × 10−67), and tier 2 genes (P = 1.1 × 10−7) (Fig. 2A). The cross-entropy was also significantly larger for tier 2 genes than the remaining genes in the exome (P = 2.6 × 10−4) (Fig. 2A). We also compared the cross-entropy between genes annotated in IntOGen as loss of function (LoF) versus “activating” (i.e., gain of function) and explored the signals from the CGC tier 1 and 2 gene sets, excluding genes acting only through LoF mechanism (fig. S10). Overall, there is a significant difference between MTR and MTRgermline distributions for all genes regardless of the mechanism of action of the mutational cancer driver genes, when compared against the rest of the exome. However, we observe that restricting analysis to genes with a gain-of-function mechanism (i.e., excluding those with LoF mechanism), there is a stronger enrichment signal compared to the rest of the exome or between activating and “LoF” gene sets. For example, IntOGen activating cross-entropy was more divergent from the rest of the exome than IntOGen LoF and with a significant difference (P = 0.01, Mann-Whitney U test; fig. S10A). The same pattern was observed for the CGC tier 1 and 2 datasets, although it was significant only for CGC tier 1 (P = 0.037 and P = 0.28, respectively, Mann-Whitney U test; fig. S10, B and C). Together, these results support the hypothesis that mitotically mutable genic subregions that are intolerant to germline variation are broadly relevant to cancer.
Distinguishing between cancer-causing driver mutations and neutral passenger mutations remains a central challenge in cancer genomics. We thus tested whether OncMTR could help prioritize somatic mutations that cause hematologic malignancies. We found that the OncMTR scores of a previously defined list of 546 unique leukemogenic driver mutations (table S1) (14) were significantly lower than a size-matched set of random variants (P = 2.97 × 10−86, Mann-Whitney U test; fig. S1A). A random forest model using OncMTR achieved an area under the receiving operator curve (AUC) of 0.74 in discriminating between these leukemogenic variants and the random set (Fig. 2B). We also calculated transcript-level percentiles for the MTR scores, in which lower percentiles corresponded to lower OncMTR scores. The AUC or the OncMTR transcript percentiles was 0.76, and a combined model that incorporated both the raw OncMTR scores and transcript percentiles achieved an even higher AUC of 0.78 (Fig. 2B).
We next compared the performance of OncMTR with other population- and evolutionary-based scores, including CADD (15), ncER (16), LINSIGHT (17), and phyloP (18), in predicting these leukemogenic variants. Here, we defined the putatively neutral variant set as a random set that was both size- and transcript-matched with the leukemogenic set. Transcript matching ensures that we can detect the power of each score to distinguish driver from neutral mutations within the same genes, rather than with neutral variants in genes that may have different score profiles than the carrier genes. In comparing the mean AUC across fivefold cross-validation via logistic regression, OncMTR outperforms the other scores, achieving an AUC of 0.71, compared to AUCs of 0.61, 0.58, 0.52, and 0.50, achieved by ncER, LINSIGHT, CADD, and phyloP, respectively. This highlights OncMTR’s ability to capture orthogonal and previously unidentified information compared to existing scores in picking up leukemogenic variants, which were not a focus set of variants in the training and development of many existing variant effect scores.
As constructed, OncMTR compares the ratio of observed missense variants to synonymous variants over an expected ratio. Because the expected number of mutations could be influenced by local sequence context, we also constructed a score that accounts for trimer mutability rate in its construction (“OncMTR-mutrate”; see Methods). OncMTR-mutrate was considerably correlated with the original (nonmutability rate–informed) version of OncMTR, achieving a Pearson’s correlation of 0.75 across all transcripts. We also assessed the predictive power of OncMTR-mutrate in classifying leukemogenic variants against a random size- and transcript-matched set of putative neutral variants. OncMTR-mutrate achieved an AUC of 0.69 compared to the original OncMTR that achieved an AUC 0.71 on the same dataset (Fig. 2C). We also make the OncMTR-mutrate version of scores across all transcripts publicly available through the OncMTR Viewer web app (http://oncmtr.public.cgr.astrazeneca.com).
To further assess the capacity of OncMTR to prioritize driver mutations, we trained random forest models with raw OncMTR scores using fivefold cross-validation. The mean AUC for predicting leukemogenic variants was 0.74 (fig. S2). We next compared the performance of OncMTR in distinguishing between a set of random variants and 200 established driver mutations implicated in acute lymphocytic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic leukemia (CLL), diffuse large B cell lymphoma (DLBCL), or multiple myeloma (MM), achieving an AUC of 0.65 (fig. S2) and having significantly disparate OncMTR distributions from each other (P = 4.89 × 10−5, Mann-Whitney U test; fig. 1B and table S2). Logistic regression–based classifiers achieved similar, albeit marginally lower, AUCs than the random forest models (with AUCs of 0.73 and 0.62 for the two variant sets, respectively), likely due to a small degree of nonlinear distribution of OncMTR scores (fig. S6). To determine which OncMTR cutoff could maximize sensitivity and specificity, we calculated the Youden index from each of the above learning tasks (see Methods). The optimal cutoff was identified as OncMTR < −0.05. Together, these results demonstrate the utility of our population genetics–based approach in identifying genic subregions relevant to hematologic malignancies.
Because the somatic mutations used to calculate OncMTR arose in the blood, we expected that OncMTR would more reliably prioritize driver mutations in hematologic malignancies than in solid tumors. As expected, the OncMTR scores of driver mutations implicated in heme malignancies were significantly lower (P = 2.53 × 10−9, Mann-Whitney U test; Fig. 2D). More generally, we find that 7.2% of bases in hematological malignancy genes and 24.7% of the driver missense variants among them correspond to low OncMTR regions (<−0.05). This is in comparison to 3.5% of bases in other genes in the genome (P = 2.50 × 10−182 and P = 5.10 × 10−41, respectively, Fisher’s exact test). To determine whether OncMTR performs better for certain subtypes of heme malignancies, we compared OncMTR distributions of putative driver and passenger mutations identified in a recent comprehensive in silico saturation mutagenesis experiment (19). This dataset includes simulated variants across three genes for CLL, nine genes for AML, two genes for non-Hodgkin lymphoma, five genes for lymphoma, six genes for MM, and two genes for ALL (table S10). The OncMTR scores of predicted driver mutations were significantly lower than those of passenger mutations for each cancer subtype, although we observed the strongest separation in CLL (P < 2 × 10−308, Wilcoxon test) and AML (P = 1.4 × 10−155, Wilcoxon test) (fig. S3).
We next assessed whether OncMTR can successfully distinguish between ClinVar pathogenic and benign somatic variants. Logistic regression classification between pathogenic and benign or random variants across all protein-coding genes reached an AUC of 0.60 and 0.58, respectively [P = 815 unique pathogenic versus B = 58 unique benign variants; a set R (random) of equal size to P was sampled to compile the random variants; see also Methods] (fig. S4). We next restricted the set of pathogenic somatic variants to those occurring in genes associated with hematologic malignancies and compared to benign or random variants. The AUC was 0.62 in distinguishing between pathogenic and benign variants in hematologic malignancy genes (P = 64 versus B = 20) and 0.67 when comparing to benign variants across the entire exome (P = 64 versus B = 58). The AUCs for pathogenic hematologic malignancy variants versus random variants were 0.61 for random variants restricted to heme genes (P = 64 versus R = 64) and 0.64 for random variants pulled from all protein-coding genes (P = 815 versus R = 815) (fig. S4). These results provide support to this blood-based sequencing version of OncMTR being more powerful in identifying pathogenic mutations implicated with heme malignancies.
Last, to further explore OncMTR’s power to agnostically detect putative oncogenic regions, we scanned all protein-coding genes in ClinVar in search of transcripts that are preferentially enriched for ClinVar pathogenic somatic variants in regions with OncMTR scores at the bottom 20 percentile of the full OncMTR distribution (see Methods). We identified 101 such transcripts from 24 unique genes (P < 0.05, Fisher’s exact test; table S11), with several known cancer driver genes captured, such as TP53, IDH1, ALK, and HNRNPA1 (20). Many of the top-ranked genes are implicated in hematologic malignancies, including MYC, MSH2, and FBXW7 (fig. S5) (21–23).
Genes carrying mutations implicated in both human Mendelian disease and cancer
There have been an increasing number of observations, which germline mutations in certain genic regions cause severe Mendelian disease, whereas identical somatic mutations—occurring later in life and localized to specific tissue(s)—in these regions can cause cancer. Here, we examined OncMTR distributions for three genes implicated in both neurodevelopmental disease and adult leukemias: GNB1, NRAS, and DNMT3A (Fig. 3, A to C, and table S4).
Germline de novo mutations in GNB1 cause a severe developmental syndrome characterized by intellectual disability and other variable features, including hypotonia, seizures, and poor growth (8). Somatic mutations in this gene have been associated with ALL, CLL, and myelodysplastic syndrome (24). Three of four somatic driver mutations in this gene overlap with de novo mutations implicated in developmental delay (p.Asp76Gly, p.Ile80Thr, and p.Ile80Asn) (Fig. 3A) (8). All four mutated residues reside in a low OncMTR region (OncMTR < −0.05) of the gene, which corresponds to the Gβ protein surface that interacts with Gα subunits and downstream effectors (Fig. 3A).
NRAS encodes a RAS protein with intrinsic guanosine triphosphatase activity that has been implicated in multiple hematologic and solid malignancies (25). There are 28 somatic missense variants in this gene at four distinct amino acid positions associated with juvenile myelomonocytic leukemia and AML, and all are residing in low OncMTR regions (Fig. 3B) (14). Two of these mutations have also been reported as causal germline de novo mutations for Noonan syndrome, a developmental delay syndrome that includes congenital heart defects, short stature, and other features (p.Gly13Asp and p.Gly60Glu) (Fig. 3B) (26, 27).
DNMT3A encodes a DNA methyltransferase essential for DNA methylation during human embryogenesis and, when mutated somatically, increases risk of AML (28). In a large study on clonal hematopoiesis of indeterminate potential (CHIP), DNMT3A was found to harbor the largest proportion of CHIP variants of all CHIP-associated genes (29), suggesting that it is highly mitotically mutable. In line with this, the OncMTR distribution of this gene is highly enriched for negative values, even compared to GNB1 and NRAS (Fig. 3C). The R882 amino acid residue of DNMT3A corresponds to a DNA binding residue that is a major somatic mutation hotspot in CHIP and AML (28). De novo germline mutations at this residue are associated with an overgrowth syndrome called Tatton-Brown-Rahman syndrome characterized by tall stature and impaired intellectual development (30). Mutations at the R882 residue are thought to interfere with DNA binding, resulting in functional impairment of the protein and aberrant DNA methylation patterns (31). As expected, we identify that the leukemogenic variants in this gene are enriched in low OncMTR regions (Fig. 3C). Together, these results support the notion that some critically important genic subregions are exceptionally mitotically mutable, and mutations in these regions result in different phenotypic outcomes depending on timing and cellular context (9).
Enrichment of low OncMTR scores in protein domains
One strength of the sliding window approach implemented in OncMTR is that its estimates are independent of biological boundaries, such as annotated protein domains, which are not always well annotated. However, it is known that cancer-causing missense mutations tend to cluster in certain functional domains. We thus tested whether Pfam domains and domain superfamilies were enriched for low OncMTR regions (defined as OncMTR < −0.05). Across human protein-coding genes, low OncMTR regions were significantly enriched for several protein domains previously implicated in cancer, such as homeodomains (Bonferroni adjusted P = 4.9 × 10−46, Fisher’s exact test), protein kinase domains (Bonferroni adjusted P = 5.25 × 10−110, Fisher’s exact test), RING domains (Bonferroni adjusted P = 3.22 × 10−48, Fisher’s exact test), and others (Fig. 4, A and B, and tables S5 and S6). Furthermore, we found that proteins that had functional domains enriched for low OncMTR scores are significantly enriched in genes with TOPMed leukemogenic variants and known cancer hotspots (Fig. 4C and tables S1 to S3 and S7 to S9) (32). Among these two lists of genes, zinc finger motifs were found to be the most strongly enriched for low OncMTR scores (most significant adjusted P = 2.3 × 10−52 from the union list, based on Fisher’s exact test; Fig. 4, D to F), in line with their well-established role in cancer development (33). Although the calculation of OncMTR is agnostic to domain annotations, it independently identifies cancer-relevant functional genic subregions.
As low OncMTR scores can indicate functionally important sites within proteins, we sought to understand whether low OncMTR regions residing outside of annotated protein domains were enriched for any functional motifs. We found that low OncMTR regions share characteristics with structurally disordered regions and were not enriched for currently known sequence motifs (fig. S9).
Because of the high enrichment of protein domains within low OncMTR regions, we next explored whether OncMTR captures additional information orthogonal to the presence of a protein domain. To test this, we used a logistic regression classifier based on the presence or absence of a protein domain in predicting the leukemogenic variants. This classifier achieved an AUC of 0.62 (with fivefold cross-validation) as compared to the superior 0.71 achieved by OncMTR on the same dataset.
Informing rare variant collapsing analysis with OncMTR
With increasing adoption of next-generation sequencing to generate case-control cohorts, rare variant collapsing analysis has emerged as a powerful approach to detecting disease-associated genes for both rare and complex disorders. In this approach, the proportion of cases with a qualifying variant (QV) is compared to the proportion of controls with a QV in the same gene. We have previously shown that incorporating an MTR filter in defining QVs markedly improves rare variant collapsing analyses (2). In that phenome-wide association study (PheWAS) on approximately 300,000 exomes in the UKB, the collapsing analyses detected seven genes associated with hematologic malignancies (2). Here, we sought to test whether OncMTR would further improve collapsing analysis signals for hematologic malignancy associations by performing a collapsing analysis on 394,694 European exomes contained in the UKB focused on 1394 chapter IX (neoplasm) phenotypes. We defined a total of eight collapsing models with and without OncMTR filters (table S12). Imposing an OncMTR filter of −0.05 (i.e., only considering missense QVs that fall below this threshold) significantly increased the effect sizes of gene-phenotype associations (P < 0.0001) for each model (Fig. 5A and table S13). We observed genome-wide significant (P < 1 × 10−8) associations between several heme malignancies and DNMT3A, FBXW7, IDH2, IGLL5, JAK2, SF3B1, SRSF2, TET2, and TP53; in certain cases, the effect sizes were 10-fold greater than without adopting the OncMTR filter (Fig. 5B). We also found that the association between TP53 and CLL only reached significance in models including our OncMTR filter; for example, in the “raredmg” model, this association had a P value of 1.2 × 10−7 [odds ratio (OR) = 8.8; 95% confidence interval (CI): 4.8 to 16.0], whereas in the “raredmgoncmtr” model, the same association reached a P value of 3.4 × 10−10 (OR = 33.2; 95% CI: 16.1 to 68.7). Thus, applying the OncMTR filter effectively reduces background variation in the setting of gene-level collapsing analysis for hematological malignancy phenotypes, and we advise future large-scale hematological malignancy discovery studies to consider adopting OncMTR filter for improved signal detection.
DISCUSSION
Determining the clinical relevance of missense variants in oncogenes remains a central challenge in cancer genetics (12, 32). Motivated by the observation that missense variants in certain genic subregions can cause severe Mendelian disease when mutated in the germline and cancer when mutated somatically, we introduced a population genetics–based framework called OncMTR to quantitate the divergence between germline constraint and somatic mutability across the human exome.
First, we demonstrated that oncogenes are enriched for these critically important regions that do not tolerate germline missense variants but harbor somatic mutations. We then illustrated that OncMTR can effectively distinguish between leukemogenic driver mutations and passenger mutations. Although OncMTR is calculated using a sliding window without any input of domain annotations, we found that genic subregions that have low OncMTR scores are significantly enriched for protein domains known to be relevant to cancer. Illustrative of our hypothesis was the observation that identical point mutations implicated in both severe Mendelian disease and leukemia in the genes GNB1, NRAS, and DNMT3A occur in low OncMTR regions. Last, we found that incorporating OncMTR in a gene-level collapsing analysis on hematologic malignancy phenotypes using 394,694 UKB exomes improved the signal-to-noise ratio for detecting hematologic malignancy associations. We have also developed a web server for visualization of OncMTR scores for each human protein-coding gene: http://oncmtr.public.cgr.astrazeneca.com.
Our findings have important implications for the disease biology of both severe Mendelian disorders and cancer. The convergence of genes and genic subregions between these two disease areas suggest that similar biological processes play a fundamental role in these two groups of phenotypes. Cellular proliferation, chromatin remodeling, cell migration, and other cancer-relevant processes have been implicated in neurodevelopmental diseases (11, 34–36). Furthermore, our work supports the notion that mutations in these genes have different phenotypic manifestations based on timing (i.e., zygote versus adulthood), localization (systemic versus hematological), and cellular context.
There exist many other approaches that aim to predict which genes and genic subregions are relevant to cancer. Many of these methods consider nonrandom clustering patterns of somatic mutations in either the linear protein sequence or three-dimensional space (37). OncMTR could improve the predictive performance of other, orthogonal driver mutation prediction approaches (38), as a recent in silico saturation mutagenesis experiment demonstrated the strength of incorporating multiple lines of evidence in prioritizing driver mutations (19).
One limitation of OncMTR in its current formulation is that it does not reflect the broader range of solid tumor malignancies because it is based on somatic mutations observed in blood-based sequencing. The general framework introduced in this study could also be applied to tumor-normal sequence datasets when sample sizes for those datasets increase to comparable numbers. Other future work should also focus on extending OncMTR to the noncoding genome, as availability of population-level whole-genome sequencing data become available (39, 40). Furthermore, we used gnomAD because it represents the largest collection of publicly available aggregated allele frequency data. However, gnomAD variants were all called using a germline variant caller. While we demonstrated that we could detect somatic variants in this database, we were likely limited to those that reached a sufficiently high variant allele frequency to be detected. Use of somatic variant callers adopted on these large-scale datasets could further improve the sensitivity of OncMTR.
METHODS
Reconstructing the MTR with 125,000 samples from gnomAD
We first reconstruct the MTR using a cohort of 125,748 exomes from the gnomAD Consortium (v2, GRCh38 liftover). The formula for deriving the window-based MTR scores has been introduced in the original paper (6)
where the numerator represents the observed proportion of missense variants among the total observed protein-coding variation. The numerator is then scaled by the same proportion computed from the collection of all possible protein-coding variants in the corresponding protein-coding window. A window size of 31 codons has been used for calculating MTR based on the gnomAD cohort, in agreement with the previously published score (6).
The expected proportion of missense variants in a given protein-coding window was calculated by annotating all possible variants of a protein-coding transcript with SnpEff 4.3t using GRCh38.92 as the reference annotation and assuming that all events were equally likely to occur. Annotation with SnpEff focused on single-nucleotide variants (SNVs) that were flagged as PASS variants in the original gnomAD release (v2). Variants annotated as “missense_variant” or “missense_variant&splice_region_variant” by SnpEff represent the set of “missense” variants in the MTR formula. Variants annotated as “synonymous_variant,” “stop_retained_variant,” “splice_region_variant&stop_retained_variant,” or “splice_region_variant&synonymous_variant” by SnpEff were considered as the “synonymous” variants in the same formula.
OncMTR score construction
Using MTR as our basis, we construct the OncMTR score (i.e., Oncology MTR score) to capture protein-coding subregions that are depleted of germline variation relative to somatic variation. We observe that the total distribution of AB_median values across all gnomAD variants (Fig. 1A) is bimodal, with the main peak centered close to 0.5 and a second one emerging for values approximately around 0.2. The AB_median metric represents the allelic ratio between the alleles for each variant, with values close to 0.5, reflecting an equal number of copies being inherited from each parent in heterozygous settings, while truly biological variants that approach zero increasingly reflect variants that more likely arose somatically.
We leverage this observation to construct an alternative version of the original MTR score: excluding any putative somatic variants and using only germline variants from the gnomAD dataset. We achieve that by selecting only variants with AB_median > 0.3, constructing the MTRgermline version of the score. OncMTR is then defined as the difference of the original MTR score from the MTRgermline version
Negative OncMTR values (i.e., MTRgermline < MTR) represent regions that are depleted of germline variation relative to somatic variation, thus allowing to highlight putative oncogenic subregions in protein-coding genes.
Sensitivity analysis on the definition of putative somatic variants
We base the selection of the AB_median cutoff for defining putative somatic variants on the empirical distribution of AB_median values, derived from the gnomAD dataset (Fig. 1A). Specifically, we observe a bimodal distribution, with the two peaks centered at p1: +0.186 and p2: +0.477 on the x axis. The middle point of the two peaks is at m: +0.332 (fig. S8A). To be more conservative with regards to characterizing variants as putative somatic, we initially suggest a cutoff slightly lower than the middle point of the two peaks, picking 0.30 as our starting cutoff value.
Furthermore, we calculated additional versions of OncMTR scores resulting from different values of AB_median thresholds, specifically 0.20, 0.25, 0.28, 0.30, 0.32, and 0.35. Let us define OncMTRk as the set of OncMTR scores calculated using k as the AB_median threshold across all canonical transcripts. We calculated the Pearson’s correlations for all available {OncMTRi, OncMTRj} pairs and observed that OncMTR0.30 has a correlation of 0.86 and 0.88 with OncMTR0.28 and OncMTR0.32, respectively, showing that the score remains very robust around the 0.30 threshold (fig. S8B). Correlation remains considerably high also with OncMTR0.25 (r = 0.74), while it falls more rapidly when the threshold is shifted to the other direction (r = 0.68 with OncMTR0.35). That may indicate that increase of AB_median threshold above 0.32 leads to noticeable mixture of putative somatic and nonsomatic variants. On the basis of the above observations, we select 0.30 as the default AB_median cutoff to calculate OncMTR. We also report Pearson’s correlations between all {OncMTRi, OncMTRj} pairs for reference (fig. S8B).
Compilation of variant sets
We used a precompiled set of variants known to be drivers of hematologic malignancies in a total of 160 genes (41). This list was generated from recurrent hematologic somatic mutations in the literature and COSMIC, excluding genes with a relatively high proportion of LoF germline mutations. A second, smaller precompiled list focused on genes that were recurrent drivers specifically for myeloid malignancies (14). A third validation set included a list of annotated driver mutations provided through the IntOGen database (42). We restricted this set to “tier 1” (highest confidence) driver mutations observed in hematologic malignancies, which included ALL, AML, CLL, DLBCL, and MM. To identify variants overlapping between hematologic malignancy and neurodevelopmental disease, we identified genes from the leukemogenic set that had entries in the Online Mendelian Inheritance in Man for both adult leukemias and neurodevelopmental disease, which included GNB1, DNMT3A, and NRAS (43).
Classification of oncogenic variant sets with OncMTR
We have performed classification of different oncogenic variant sets (TOPMed leukemogenic and IntOGen drivers) against random variant sets of equal size. We use two supervised models for the binary classification task, logistic regression with “max_iter” = 1000 and a random forest classifier with “max_depth” = 2, to avoid overfitting on the training set. Each classification was performed as a fivefold cross-validation task, and the mean AUC from all folds is reported to reflect the total average performance of each learning task. The implementations of logistic regression and random forest were derived from the sklearn Python package (v0.22.1).
We also estimated the optimal OncMTR cut point for each classification by calculating the Youden’s index from each learning task. The average Youden index from all classification tasks performed with logistic regression was YLR = −0.0409 (SD: 0.00126), while for random forest, it was YRF = −0.0614 (SD: 0.00057). The mean of the two averages of Youden indexes is −0.05115 or −0.05, after rounding it up to one decimal point for simplicity. We thus consider OncMTR values of less than −0.05 to have the most distinctive power.
Identifying OncMTR regions significantly enriched for ClinVar somatic variants
For this analysis, we use all ClinVar somatic variants (ORIGIN = 2) from the GRCh38 release (last accessed on 9 June 2019), focusing on those annotated as missense or synonymous. We consider as pathogenic variants those annotated as “Pathogenic” or “Likely_pathogenic” and as benign those annotated as “Benign” or “Likely_benign” (based on ClinVar). Classification between pathogenic and benign (or random) variant sets was performed with a logistic regression classifier with fivefold cross validation (sklearn, Python package v0.22.1). When restricting the classification to heme-implicated genes, we derived those gene sets based on the IntOGen annotation (table S10).
To identify genes/transcripts across the exome that are preferentially enriched for ClinVar somatic pathogenic variants in regions with low OncMTR scores, we use Fisher’s exact test. Specifically, we scan across each transcript and identify what percentage of the codons in each transcript achieves an OncMTR score at the bottom 20 percentile of the full OncMTR distribution (across the entire transcript). Then, we check whether known pathogenic or likely pathogenic ClinVar missense variants preferentially land in these codons (i.e., corresponding to low OncMTR scores) compared to the rest of the transcript. We apply Fisher’s exact test to evaluate the enrichment of each set of regions, i.e., those with low OncMTR scores versus the rest of the transcript. Eventually, we rank each transcript based on the OR and significance of the Fisher’s exact test enrichments (table S11).
Enrichment of low OncMTR scores in protein domains
To describe the functional context of OncMTR, we calculated enrichment of constrained regions in protein domain families. Residues within each canonical transcript (as defined by UniProtKB) were divided into two classes on the basis of their corresponding OncMTR scores: less than −0.05 (constrained; as defined by Youden’s index) and greater or equal to −0.05 (relaxed). Domain and clan annotations for the human proteome were taken from the Pfam 34.0 database. DNA binding domains were pulled from a previous compendium (44). The final set of the canonical human proteome consisted of 18,313 annotated proteins. Enrichments of the constrained regions in protein domains were tested with the Fisher’s exact test, followed by Bonferroni correction and with significance level of adjusted P value of 0.05.
Analysis of low OncMTR regions outside of protein domains
Low OncMTR scores can indicate functionally important protein regions. We analyzed whether any genes had low OncMTR regions (i.e., spanning at least 30 residues with OncMTR lower than −0.05) residing outside of protein domains, but we did not identify any unannotated regions that could constitute previously unknown domains. According to UniProt annotations, 38% of them (372 of 978 in total) overlap with predicted disordered regions.
To find common sequence motifs that would be indicative of functionally important sites, we first obtained clusters of low OncMTR regions of similar sequences (using CD-HIT, at 60% sequence identity). We found only three clusters consisting of multiple (i.e., two) short sequences. To find sequence motifs shared within those clusters, we ran MEME searches, which did not return any significant motifs.
Last, to show that those low OncMTR regions do not significantly differ from other regions outside of Pfam domains in terms of structural disorder, we analyzed their per-residue confidence scores [pLDDT (predicted local distance difference test)] derived from AlphaFold2 models (13). In general, regions with lower pLDDT scores have a higher chance of being unstructured. As shown in fig. S9, pLDDTs for Pfam domains are significantly higher than for the regions outside of annotated domains.
UKB collapsing analysis
Collapsing analyses were performed using the 394,694 exomes available in the UKB cohort (45). The UKB is a prospective study of approximately 500,000 participants aged 40 to 69 years old at time of recruitment. Participants were recruited in the United Kingdom between 2006 and 2010 and are continuously followed. The average age at recruitment for sequenced individuals was 56.5 years, and 54% of the sequenced cohort is of female genetic sex. Participant data include health records that are periodically updated by the UKB, self-reported survey information, linkage to death and cancer registries, collection of urine and blood biomarkers, imaging data, accelerometer data, and various other phenotypic end points. All study participants provided informed consent, and the UKB has approval from the North-West Multi-centre Research Ethics Committee (11/NW/0382).
We performed a gene-based collapsing analysis on 1394 chapter IX (neoplasm) phenotypes adopting our previously described approach (2). We implemented a total of eight dominant collapsing models with and without OncMTR filters (table S12). Using SnpEff (46), we defined protein truncating variants (PTVs) as variants annotated as exon_loss_variant, frameshift_variant, start_lost, stop_gained, stop_lost, splice_acceptor_variant, splice_donor_variant, gene_fusion, bidirectional_gene_fusion, rare_amino_acid_variant, and transcript_ablation. We defined missense as follows: missense_variant_splice_region_variant and missense_variant. Nonsynonymous variants included the following: exon_loss_variant, frameshift_variant, start_lost, stop_gained, stop_lost, splice_acceptor_variant, splice_donor_variant, gene_fusion, bidirectional_gene_fusion, rare_amino_acid_variant, transcript_ablation, conservative_inframe_deletion, conservative_inframe_insertion, disruptive_inframe_insertion, disruptive_inframe_deletion, missense_variant_splice_region_variant, missense_variant, and protein_altering_variant. We derived allele frequencies from gnomAD (1). The raredmg, raredmg_OncMTR, flexdmg, and flexdmg_oncMTR models incorporated a REVEL cutoff of REVEL ≥ 0.5 to restrict to putatively damaging missense variants (47).
To compute P values, the carriers of at least one QV in a gene were compared to the noncarriers. The difference in the proportion of cases and controls carrying QVs in a gene was tested using a Fisher’s exact two-sided test. Variants were required to pass the following quality control criteria: minimum coverage of 10×; annotation in consensus coding sequence (CCDS) transcripts (release 22; approximately 34 Mb); at most 80% alternate reads in homozygous genotypes; percent of alternate reads in heterozygous variants ≤0.25 and ≥0.8; binomial test of alternate allele proportion departure from 50% in heterozygous state P > 1 × 10−6; genotype quality score (GQ) ≥ 20; Fisher’s strand bias score (FS) ≤ 200 (indels) ≤ 60 (SNVs); mapping quality score (MQ) ≥ 40; quality score (QUAL) ≥ 30; read position rank sum score (RPRS) ≥ −2; mapping quality rank sum score (MQRS) ≥ −8; DRAGEN variant status = PASS; the variant site achieved 10-fold coverage in ≤25% of gnomAD exomes; and if the variant was observed in gnomAD exomes, then the variant achieved exome z score ≤ −2.0 and exome MQ ≤ 30. We excluded 46 genes that we previously found associated with batch effects (2).
For all models, we applied the following quality control filters: minimum coverage of 10×; annotation in CCDS transcripts (release 22; approximately 34 Mb); at most 80% alternate reads in homozygous genotypes; percent of alternate reads in heterozygous variants ≤0.25 and ≥0.8; binomial test of alternate allele proportion departure from 50% in heterozygous state P < 1 × 10−6; GQ ≤ 20; FS ≥ 200 (indels) ≥ 60 (SNVs); MQ ≤ 40; QUAL ≤30; read position rank sum score ≤ −2; MQRS ≤ −8; DRAGEN variant status = PASS; the variant site achieved 10-fold coverage in ≤25% of gnomAD exomes; and if the variant was observed in gnomAD exomes, then the variant achieved exome z score ≤ −2.0 and exome MQ ≤ 30. We excluded 46 genes that we previously found associated with batch effects 10.
We defined the study-wide significance threshold as P < 1 × 10−8. We have previously shown using an n-of-1 permutation approach and the empirical null synonymous model that this threshold corresponds to a false positive rate of 9 and 2, respectively, of ∼346.5 million tests for binary traits in the setting of a collapsing analysis PheWAS (2).
Acknowledgments
We thank L. Middleton for assistance in developing the OncMTR web portal. We thank O. Backhouse for useful discussions and feedback on this work.
Funding: We acknowledge that they received no funding in support of this research.
Author contributions: D.V., R.S.D., and S.P. designed the study. D.V., R.S.D., D.M., J.M., J.A., Q.W., B.S., A.R.H., and S.P. performed analyses and statistical interpretation. Q.W. and F.H. performed bioinformatic processing. D.V., R.S.D., and S.P. wrote the manuscript. D.V. and D.M. developed the web portal. D.V., R.S.D., D.M., J.M., X.Z., J.A., F.H., Q.W., B.S., A.R.H., and S.P. reviewed the manuscript.
Competing interests: D.V., R.S.D., D.M., J.M., X.Z., J.A., F.H., Q.W., B.S., A.R.H., and S.P. are current employees and/or stockholders of AstraZeneca. The authors declare that they have no other competing interests.
Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. OncMTR scores are publicly available through the online portal: http://oncmtr.public.cgr.astrazeneca.com. Code is publicly available on GitHub: https://github.com/astrazeneca-cgr-publications/OncMTR. OncMTR scores and code are also available at a public Zenodo repository: https://doi.org/10.5281/zenodo.6817251. gnomAD data are accessible at https://gnomad.broadinstitute.org. All whole-exome sequencing data described in this paper are publicly available to registered researchers through the UKB data access protocol. Exomes can be found in the UKB showcase portal: https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=170. Additional information about registration for access to the data is available at www.ukbiobank.ac.uk/register-apply. Data for this study were obtained under resource application number 26041. Association statistics generated on this cohort are also publicly available through our AstraZeneca Centre for Genomics Research (CGR) PheWAS Portal: http://azphewas.com.
Supplementary Materials
This PDF file includes:
Other Supplementary Material for this manuscript includes the following:
REFERENCES AND NOTES
- 1.Karczewski K. J., Francioli L. C., Tiao G., Cummings B. B., Alföldi J., Wang Q., Collins R. L., Laricchia K. M., Ganna A., Birnbaum D. P., Gauthier L. D., Brand H., Solomonson M., Watts N. A., Rhodes D., Singer-Berk M., England E. M., Seaby E. G., Kosmicki J. A., Walters R. K., Tashman K., Farjoun Y., Banks E., Poterba T., Wang A., Seed C., Whiffin N., Chong J. X., Samocha K. E., Pierce-Hoffman E., Zappala Z., O’Donnell-Luria A. H., Minikel E. V., Weisburd B., Lek M., Ware J. S., Vittal C., Armean I. M., Bergelson L., Cibulskis K., Connolly K. M., Covarrubias M., Donnelly S., Ferriera S., Gabriel S., Gentry J., Gupta N., Jeandet T., Kaplan D., Llanwarne C., Munshi R., Novod S., Petrillo N., Roazen D., Ruano-Rubio V., Saltzman A., Schleicher M., Soto J., Tibbetts K., Tolonen C., Wade G., Talkowski M. E.; Genome Aggregation Database Consortium, Aguilar Salinas C. A., Ahmad T., Albert C. M., Ardissino D., Atzmon G., Barnard J., Beaugerie L., Benjamin E. J., Boehnke M., Bonnycastle L. L., Bottinger E. P., Bowden D. W., Bown M. J., Chambers J. C., Chan J. C., Chasman D., Cho J., Chung M. K., Cohen B., Correa A., Dabelea D., Daly M. J., Darbar D., Duggirala R., Dupuis J., Ellinor P. T., Elosua R., Erdmann J., Esko T., Färkkilä M., Florez J., Franke A., Getz G., Glaser B., Glatt S. J., Goldstein D., Gonzalez C., Groop L., Haiman C., Hanis C., Harms M., Hiltunen M., Holi M. M., Hultman C. M., Kallela M., Kaprio J., Kathiresan S., Kim B. J., Kim Y. J., Kirov G., Kooner J., Koskinen S., Krumholz H. M., Kugathasan S., Kwak S. H., Laakso M., Lehtimäki T., Loos R. J. F., Lubitz S. A., Ma R. C. W., MacArthur D. G., Marrugat J., Mattila K. M., McCarroll S., McCarthy M. I., McGovern D., McPherson R., Meigs J. B., Melander O., Metspalu A., Neale B. M., Nilsson P. M., O’Donovan M. C., Ongur D., Orozco L., Owen M. J., Palmer C. N. A., Palotie A., Park K. S., Pato C., Pulver A. E., Rahman N., Remes A. M., Rioux J. D., Ripatti S., Roden D. M., Saleheen D., Salomaa V., Samani N. J., Scharf J., Schunkert H., Shoemaker M. B., Sklar P., Soininen H., Sokol H., Spector T., Sullivan P. F., Suvisaari J., Tai E. S., Teo Y. Y., Tiinamaija T., Tsuang M., Turner D., Tusie-Luna T., Vartiainen E., Vawter M. P., Ware J. S., Watkins H., Weersma R. K., Wessman M., Wilson J. G., Xavier R. J., Neale B. M., Daly M. J., MacArthur D. G., The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wang Q., Dhindsa R. S., Carss K., Harper A. R., Nag A., Tachmazidou I., Vitsios D., Deevi S. V. V., Mackay A., Muthas D., Hühn M., Monkley S., Olsson H.; Astra Zeneca Genomics Initiative, Wasilewski S., Smith K. R., March R., Platt A., Haefliger C., Petrovski S., Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dhindsa R. S., Copeland B. R., Mustoe A. M., Goldstein D. B., Natural selection shapes codon usage in the human genome. Am. J. Hum. Genet. 107, 83–95 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Petrovski S., Wang Q., Heinzen E. L., Allen A. S., Goldstein D. B., Genic Intolerance to functional variation and the interpretation of personal genomes. PLOS Genet. 9, e1003709 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Samocha K. E., Robinson E. B., Sanders S. J., Stevens C., Sabo A., McGrath L. M., Kosmicki J. A., Rehnström K., Mallick S., Kirby A., Wall D. P., MacArthur D. G., Gabriel S. B., DePristo M., Purcell S. M., Palotie A., Boerwinkle E., Buxbaum J. D., Cook E. H. Jr., Gibbs R. A., Schellenberg G. D., Sutcliffe J. S., Devlin B., Roeder K., Neale B. M., Daly M. J., A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Traynelis J., Silk M., Wang Q., Berkovic S. F., Liu L., Ascher D. B., Balding D. J., Petrovski S., Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res. 27, 1715–1729 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hoischen A., Krumm N., Eichler E. E., Prioritization of neurodevelopmental disease genes by discovery of new mutations. Nat. Neurosci. 17, 764–772 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Petrovski S., Küry S., Myers C. T., Anyane-Yeboa K., Cogné B., Bialer M., Xia F., Hemati P., Riviello J., Mehaffey M., Besnard T., Becraft E., Wadley A., Politi A. R., Colombo S., Zhu X., Ren Z., Andrews I., Dudding-Byth T., Schneider A. L., Wallace G.; University of Washington Center for Mendelian Genomics, Rosen A. B. I., Schelley S., Enns G. M., Corre P., Dalton J., Mercier S., Latypova X., Schmitt S., Guzman E., Moore C., Bier L., Heinzen E. L., Karachunski P., Shur N., Grebe T., Basinger A., Nguyen J. M., Bézieau S., Wierenga K., Bernstein J. A., Scheffer I. E., Rosenfeld J. A., Mefford H. C., Isidor B., Goldstein D. B., Germline de novo mutations in GNB1 cause severe neurodevelopmental disability, hypotonia, and seizures. Am. J. Hum. Genet. 98, 1001–1010 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hoischen A., van Bon B. W. M., Rodríguez-Santiago B., Gilissen C., Vissers L. E. L. M., de Vries P., Janssen I., van Lier B., Hastings R., Smithson S. F., Newbury-Ecob R., Kjaergaard S., Goodship J., McGowan R., Bartholdi D., Rauch A., Peippo M., Cobben J. M., Wieczorek D., Gillessen-Kaesbach G., Veltman J. A., Brunner H. G., de Vries B. B. B. A., De novo nonsense mutations in ASXL1 cause Bohring-Opitz syndrome. Nat. Genet. 43, 729–731 (2011). [DOI] [PubMed] [Google Scholar]
- 10.Gibson W. T., Hood R. L., Zhan S. H., Bulman D. E., Fejes A. P., Moore R., Mungall A. J., Eydoux P., Babul-Hirji R., An J., Marra M. A.; FORGE Canada Consortium, Chitayat D., Boycott K. M., Weaver D. D., Jones S. J., Mutations in EZH2 cause weaver syndrome. Am. J. Hum. Genet. 90, 110–118 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kaplanis J., Samocha K. E., Wiel L., Zhang Z., Arvai K. J., Eberhardt R. Y., Gallone G., Lelieveld S. H., Martin H. C., McRae J. F., Short P. J., Torene R. I., de Boer E., Danecek P., Gardner E. J., Huang N., Lord J., Martincorena I., Pfundt R., Reijnders M. R. F., Yeung A., Yntema H. G.; Deciphering Developmental Disorders Study, Borras S., Clark C., Dean J., Miedzybrodzka Z., Ross A., Tennant S., Dabir T., Donnelly D., Humphreys M., Magee A., McConnell V., McKee S., McNerlan S., Morrison P. J., Rea G., Stewart F., Cole T., Cooper N., Cooper-Charles L., Cox H., Islam L., Jarvis J., Keelagher R., Lim D., McMullan D., Morton J., Naik S., O’Driscoll M., Ong K. R., Osio D., Ragge N., Turton S., Vogt J., Williams D., Bodek S., Donaldson A., Hills A., Low K., Newbury-Ecob R., Norman A. M., Roberts E., Scurr I., Smithson S., Tooley M., Abbs S., Armstrong R., Dunn C., Holden S., Park S. M., Paterson J., Raymond L., Reid E., Sandford R., Simonic I., Tischkowitz M., Woods G., Bradley L., Comerford J., Green A., Lynch S., McQuaid S., Mullaney B., Berg J., Goudie D., Mavrak E., McLean J., McWilliam C., Reavey E., Azam T., Cleary E., Jackson A., Lam W., Lampe A., Moore D., Porteous M., Baple E., Baptista J., Brewer C., Castle B., Kivuva E., Owens M., Rankin J., Shaw-Smith C., Turner C., Turnpenny P., Tysoe C., Bradley T., Davidson R., Gardiner C., Joss S., Kinning E., Longman C., McGowan R., Murday V., Pilz D., Tobias E., Whiteford M., Williams N., Barnicoat A., Clement E., Faravelli F., Hurst J., Jenkins L., Jones W., Kumar V. K. A., Lees M., Loughlin S., Male A., Morrogh D., Rosser E., Scott R., Wilson L., Beleza A., Deshpande C., Flinter F., Holder M., Irving M., Izatt L., Josifova D., Mohammed S., Molenda A., Robert L., Roworth W., Ruddy D., Ryten M., Yau S., Bennett C., Blyth M., Campbell J., Coates A., Dobbie A., Hewitt S., Hobson E., Jackson E., Jewell R., Kraus A., Prescott K., Sheridan E., Thomson J., Bradshaw K., Dixit A., Eason J., Haines R., Harrison R., Mutch S., Sarkar A., Searle C., Shannon N., Sharif A., Suri M., Vasudevan P., Canham N., Ellis I., Greenhalgh L., Howard E., Stinton V., Swale A., Weber A., Banka S., Breen C., Briggs T., Burkitt-Wright E., Chandler K., Clayton-Smith J., Donnai D., Douzgou S., Gaunt L., Jones E., Kerr B., Langley C., Metcalfe K., Smith A., Wright R., Bourn D., Burn J., Fisher R., Hellens S., Henderson A., Montgomery T., Splitt M., Straub V., Wright M., Zwolinski S., Allen Z., Bernhard B., Brady A., Brooks C., Busby L., Clowes V., Ghali N., Holder S., Ibitoye R., Wakeling E., Blair E., Carmichael J., Cilliers D., Clasper S., Gibbons R., Kini U., Lester T., Nemeth A., Poulton J., Price S., Shears D., Stewart H., Wilkie A., Albaba S., Baker D., Balasubramanian M., Johnson D., Parker M., Quarrell O., Stewart A., Willoughby J., Crosby C., Elmslie F., Homfray T., Jin H., Lahiri N., Mansour S., Marks K., McEntagart M., Saggar A., Tatton-Brown K., Butler R., Clarke A., Corrin S., Fry A., Kamath A., McCann E., Mugalaasi H., Pottinger C., Procter A., Sampson J., Sansbury F., Varghese V., Baralle D., Callaway A., Cassidy E. J., Daniels S., Douglas A., Foulds N., Hunt D., Kharbanda M., Lachlan K., Mercer C., Side L., Temple I. K., Wellesley D., Vissers L. E. L. M., Juusola J., Wright C. F., Brunner H. G., Firth H. V., FitzPatrick D. R., Barrett J. C., Hurles M. E., Gilissen C., Retterer K., Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hyman D. M., Taylor B. S., Baselga J., Implementing genome-driven oncology. Cell 168, 584–599 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bick A. G., Weinstock J. S., Nandakumar S. K., Fulco C. P., Bao E. L., Zekavat S. M., Szeto M. D., Liao X., Leventhal M. J., Nasser J., Chang K., Laurie C., Burugula B. B., Gibson C. J., Niroula A., Lin A. E., Taub M. A., Aguet F., Ardlie K., Mitchell B. D., Barnes K. C., Moscati A., Fornage M., Redline S., Psaty B. M., Silverman E. K., Weiss S. T., Palmer N. D., Vasan R. S., Burchard E. G., Kardia S. L. R., He J., Kaplan R. C., Smith N. L., Arnett D. K., Schwartz D. A., Correa A., de Andrade M., Guo X., Konkle B. A., Custer B., Peralta J. M., Gui H., Meyers D. A., McGarvey S. T., Chen I. Y. D., Shoemaker M. B., Peyser P. A., Broome J. G., Gogarten S. M., Wang F. F., Wong Q., Montasser M. E., Daya M., Kenny E. E., North K. E., Launer L. J., Cade B. E., Bis J. C., Cho M. H., Lasky-Su J., Bowden D. W., Cupples L. A., Mak A. C. Y., Becker L. C., Smith J. A., Kelly T. N., Aslibekyan S., Heckbert S. R., Tiwari H. K., Yang I. V., Heit J. A., Lubitz S. A., Johnsen J. M., Curran J. E., Wenzel S. E., Weeks D. E., Rao D. C., Darbar D., Moon J. Y., Tracy R. P., Buth E. J., Rafaels N., Loos R. J. F., Durda P., Liu Y., Hou L., Lee J., Kachroo P., Freedman B. I., Levy D., Bielak L. F., Hixson J. E., Floyd J. S., Whitsel E. A., Ellinor P. T., Irvin M. R., Fingerlin T. E., Raffield L. M., Armasu S. M., Wheeler M. M., Sabino E. C., Blangero J., Williams L. K., Levy B. D., Sheu W. H. H., Roden D. M., Boerwinkle E., Manson J. A. E., Mathias R. A., Desai P., Taylor K. D., Johnson A. D.; NHLBI Trans-Omics for Precision Medicine Consortium, Abe N., Albert C., Almasy L., Alonso A., Ament S., Anderson P., Anugu P., Applebaum-Bowden D., Arking D., Ashley-Koch A., Aslibekyan S., Assimes T., Avramopoulos D., Barnard J., Barr R. G., Barron-Casella E., Barwick L., Beaty T., Beck G., Becker D., Beer R., Beitelshees A., Benjamin E., Benos P., Bezerra M., Bielak L., Bowler R., Brody J., Broeckel U., Bunting K., Bustamante C., Cardwell J., Carey V., Carty C., Casaburi R., Casella J., Castaldi P., Chaffin M., Chang C., Chang Y. C., Chasman D., Chavan S., Chen B. J., Chen W. M., Choi S. H., Chuang L. M., Chung M., Chung R. H., Clish C., Comhair S., Cornell E., Crandall C., Crapo J., Curtis J., Damcott C., Das S., David S., Davis C., DeBaun M., Deka R., DeMeo D., Devine S., Duan Q., Duggirala R., Dutcher S., Eaton C., Ekunwe L., Boueiz A. E., Emery L., Erzurum S., Farber C., Flickinger M., Franceschini N., Frazar C., Fu M., Fullerton S. M., Fulton L., Gabriel S., Gan W., Gao S., Gao Y., Gass M., Gelb B., Geng X., Geraci M., Germer S., Gerszten R., Ghosh A., Gibbs R., Gignoux C., Gladwin M., Glahn D., Gong D. W., Goring H., Graw S., Grine D., Gu C. C., Guan Y., Gupta N., Haessler J., Hall M., Harris D., Hawley N. L., Heavner B., Hernandez R., Herrington D., Hersh C., Hidalgo B., Hobbs B., Hokanson J., Hong E., Hoth K., Hsiung C., Hung Y. J., Huston H., Hwu C. M., Jackson R., Jain D., Jaquish C., Jhun M. A., Johnson C., Johnston R., Jones K., Kang H. M., Kelly S., Kessler M., Khan A., Kim W., Kinney G., Kramer H., Lange C., LeBoff M., Lee S. S., Lee W. J., LeFaive J., Levine D., Lewis J., Li X., Li Y., Lin H., Lin H., Lin K. H., Lin X., Liu S., Liu Y., Lunetta K., Luo J., Mahaney M., Make B., Manichaikul A., Margolin L., Martin L., Mathai S., May S., McArdle P., McDonald M. L., McFarland S., McGoldrick D., McHugh C., Mei H., Mestroni L., Mikulla J., Min N., Minear M., Minster R. L., Moll M., Montgomery C., Musani S., Mwasongwe S., Mychaleckyj J. C., Nadkarni G., Naik R., Naseri T., Nekhai S., Nelson S. C., Neltner B., Nickerson D., O’Connell J., O’Connor T., Ochs-Balcom H., Paik D., Pankow J., Papanicolaou G., Parsa A., Perez M., Perry J., Peters U., Peyser P., Phillips L. S., Pollin T., Post W., Becker J. P., Boorgula M. P., Preuss M., Qasba P., Qiao D., Qin Z., Rasmussen-Torvik L., Ratan A., Reed R., Regan E., Sefuiva Reupena M., Rice K., Roselli C., Ruczinski I., Russell P., Ruuska S., Ryan K., Saleheen D., Salimi S., Salzberg S., Sandow K., Scheller C., Schmidt E., Schwander K., Sciurba F., Seidman C., Seidman J., Sheehan V., Sherman S. L., Shetty A., Shetty A., Silver B., Smith J., Smith T., Smoller S., Snively B., Snyder M., Sofer T., Sotoodehnia N., Stilp A. M., Storm G., Streeten E., Su J. L., Sung Y. J., Sylvia J., Szpiro A., Sztalryd C., Taliun D., Tang H., Taylor M., Taylor S., Telen M., Thornton T. A., Threlkeld M., Tinker L., Tirschwell D., Tishkoff S., Tiwari H., Tong C., Tsai M., Vaidya D., Berg D. V. D., VandeHaar P., Vrieze S., Walker T., Wallace R., Walts A., Wang H., Watson K., Weir B., Weng L. C., Wessel J., Willer C., Williams K., Wilson C., Wu J., Xu H., Yanek L., Yang R., Zaghloul N., Zhang Y., Zhao S. X., Zhao W., Zhi D., Zhou X., Zhu X., Zody M., Zoellner S., Auer P. L., Kooperberg C., Laurie C. C., Blackwell T. W., Smith A. V., Zhao H., Lange E., Lange L., Rich S. S., Rotter J. I., Wilson J. G., Scheet P., Kitzman J. O., Lander E. S., Engreitz J. M., Ebert B. L., Reiner A. P., Jaiswal S., Abecasis G., Sankaran V. G., Kathiresan S., Natarajan P., Inherited causes of clonal haematopoiesis in 97,691 whole genomes. Nature 586, 763–768 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kircher M., Witten D. M., Jain P., O’Roak B. J., Cooper G. M., Shendure J., A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wells A., Heckerman D., Torkamani A., Yin L., Sebat J., Ren B., Telenti A., di Iulio J., Ranking of non-coding pathogenic variants and putative essential regions of the human genome. Nat. Commun. 10, 5241 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Huang Y.-F., Gulko B., Siepel A., Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Pollard K. S., Hubisz M. J., Rosenbloom K. R., Siepel A., Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Muiños F., Martínez-Jiménez F., Pich O., Gonzalez-Perez A., Lopez-Bigas N., In silico saturation mutagenesis of cancer genes. Nature 596, 428–432 (2021). [DOI] [PubMed] [Google Scholar]
- 20.Martínez-Jiménez F., Muiños F., Sentís I., Deu-Pons J., Reyes-Salazar I., Arnedo-Pac C., Mularoni L., Pich O., Bonet J., Kranas H., Gonzalez-Perez A., Lopez-Bigas N., A compendium of mutational cancer driver genes. Nat. Rev. Cancer 20, 555–572 (2020). [DOI] [PubMed] [Google Scholar]
- 21.Bhatia K., Huppi K., Spangler G., Siwarski D., Iyer R., Magrath I., Point mutations in the c-Myc transactivation domain are common in Burkitt’s lymphoma and mouse plasmacytomas. Nat. Genet. 5, 56–61 (1993). [DOI] [PubMed] [Google Scholar]
- 22.King B., Trimarchi T., Reavie L., Xu L., Mullenders J., Ntziachristos P., Aranda-Orgilles B., Perez-Garcia A., Shi J., Vakoc C., Sandy P., Shen S. S., Ferrando A., Aifantis I., The ubiquitin ligase FBXW7 modulates leukemia-initiating cell activity by regulating MYC stability. Cell 153, 1552–1566 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Whiteside D., McLeod R., Graham G., Steckley J. L., Booth K., Somerville M. J., Andrew S. E., A homozygous germ-line mutation in the human MSH2 gene predisposes to hematological malignancy and multiple café-au-lait spots. Cancer Res. 62, 359–362 (2002). [PubMed] [Google Scholar]
- 24.Yoda A., Adelmant G., Tamburini J., Chapuy B., Shindoh N., Yoda Y., Weigert O., Kopp N., Wu S. C., Kim S. S., Liu H., Tivey T., Christie A. L., Elpek K. G., Card J., Gritsman K., Gotlib J., Deininger M. W., Makishima H., Turley S. J., Javidi-Sharifi N., Maciejewski J. P., Jaiswal S., Ebert B. L., Rodig S. J., Tyner J. W., Marto J. A., Weinstock D. M., Lane A. A., Mutations in G protein β subunits promote transformation and kinase inhibitor resistance. Nat. Med. 21, 71–75 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Oliveira J. B., Bidère N., Niemela J. E., Zheng L., Sakai K., Nix C. P., Danner R. L., Barb J., Munson P. J., Puck J. M., Dale J., Straus S. E., Fleisher T. A., Lenardo M. J., NRAS mutation causes a human autoimmune lymphoproliferative syndrome. Proc. Natl. Acad. Sci. U.S.A. 104, 8953–8958 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cirstea I. C., Kutsche K., Dvorsky R., Gremer L., Carta C., Horn D., Roberts A. E., Lepri F., Merbitz-Zahradnik T., König R., Kratz C. P., Pantaleoni F., Dentici M. L., Joshi V. A., Kucherlapati R. S., Mazzanti L., Mundlos S., Patton M. A., Silengo M. C., Rossi C., Zampino G., Digilio C., Stuppia L., Seemanova E., Pennacchio L. A., Gelb B. D., Dallapiccola B., Wittinghofer A., Ahmadian M. R., Tartaglia M., Zenker M., A restricted spectrum of NRAS mutations causes Noonan syndrome. Nat. Genet. 42, 27–29 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Matsuda K., Shimada A., Yoshida N., Ogawa A., Watanabe A., Yajima S., Iizuka S., Koike K., Yanai F., Kawasaki K., Yanagimachi M., Kikuchi A., Ohtsuka Y., Hidaka E., Yamauchi K., Tanaka M., Yanagisawa R., Nakazawa Y., Shiohara M., Manabe A., Kojima S., Koike K., Spontaneous improvement of hematologic abnormalities in patients having juvenile myelomonocytic leukemia with specific RAS mutations. Blood 109, 5477–5480 (2007). [DOI] [PubMed] [Google Scholar]
- 28.Kosaki R., Terashima H., Kubota M., Kosaki K., Acute myeloid leukemia-associated DNMT3A p.Arg882His mutation in a patient with Tatton-Brown-Rahman overgrowth syndrome as a constitutional mutation. Am. J. Med. Genet. A 173, 250–253 (2017). [DOI] [PubMed] [Google Scholar]
- 29.Jaiswal S., Natarajan P., Silver A. J., Gibson C. J., Bick A. G., Shvartz E., Conkey M. M., Gupta N., Gabriel S., Ardissino D., Baber U., Mehran R., Fuster V., Danesh J., Frossard P., Saleheen D., Melander O., Sukhova G. K., Neuberg D., Libby P., Kathiresan S., Ebert B. L., Clonal hematopoiesis and risk of atherosclerotic cardiovascular disease. N. Engl. J. Med. 377, 111–121 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tatton-Brown K., Seal S., Ruark E., Harmer J., Ramsay E., Duarte S. D. V., Zachariou A., Hanks S., O’Brien E., Aksglaede L., Baralle D., Dabir T., Gener B., Goudie D., Homfray T., Kumar A., Pilz D. T., Selicorni A., Temple I. K., Van Maldergem L., Yachelevich N.; Childhood Overgrowth Consortium, van Montfort R., Rahman N., Mutations in the DNA methyltransferase gene DNMT3A cause an overgrowth syndrome with intellectual disability. Nat. Genet. 46, 385–388 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang Z.-M., Lu R., Wang P., Yu Y., Chen D., Gao L., Liu S., Ji D., Rothbart S. B., Wang Y., Wang G. G., Song J., Structural basis for DNMT3A-mediated de novo DNA methylation. Nature 554, 387–391 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chang M. T., Bhattarai T. S., Schram A. M., Bielski C. M., Donoghue M. T. A., Jonsson P., Chakravarty D., Phillips S., Kandoth C., Penson A., Gorelick A., Shamu T., Patel S., Harris C., Gao J. J., Sumer S. O., Kundra R., Razavi P., Li B. T., Reales D. N., Socci N. D., Jayakumaran G., Zehir A., Benayed R., Arcila M. E., Chandarlapaty S., Ladanyi M., Schultz N., Baselga J., Berger M. F., Rosen N., Solit D. B., Hyman D. M., Taylor B. S., Accelerating discovery of functional mutant alleles in cancer. Cancer Discov. 8, 174–183 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cassandri M., Smirnov A., Novelli F., Pitolli C., Agostini M., Malewicz M., Melino G., Raschellà G., Zinc-finger proteins in health and disease. Cell Death Discov. 3, 17071 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.De Rubeis S., He X., Goldberg A. P., Poultney C. S., Samocha K., Cicek A. E., Kou Y., Liu L., Fromer M., Walker S., Singh T., Klei L., Kosmicki J., Shih-Chen F., Aleksic B., Biscaldi M., Bolton P. F., Brownfeld J. M., Cai J., Campbell N. G., Carracedo A., Chahrour M. H., Chiocchetti A. G., Coon H., Crawford E. L., Curran S. R., Dawson G., Duketis E., Fernandez B. A., Gallagher L., Geller E., Guter S. J., Hill R. S., Ionita-Laza J., Gonzalez P. J., Kilpinen H., Klauck S. M., Kolevzon A., Lee I., Lei I., Lei J., Lehtimäki T., Lin C.-F., Ma’ayan A., Marshall C. R., McInnes A. L., Neale B., Owen M. J., Ozaki N., Parellada M., Parr J. R., Purcell S., Puura K., Rajagopalan D., Rehnström K., Reichenberg A., Sabo A., Sachse M., Sanders S. J., Schafer C., Schulte-Rüther M., Skuse D., Stevens C., Szatmari P., Tammimies K., Valladares O., Voran A., Li-San W., Weiss L. A., Willsey A. J., Yu T. W., Yuen R. K. C.; DDD Study; Homozygosity Mapping Collaborative for Autism; UK10K Consortium; The Autism Sequencing Consortium, Cook E. H., Freitag C. M., Gill M., Hultman C. M., Lehner T., Palotie A., Schellenberg G. D., Sklar P., State M. W., Sutcliffe J. S., Walsh C. A., Scherer S. W., Zwick M. E., Barett J. C., Cutler D. J., Roeder K., Devlin B., Daly M. J., Buxbaum J. D., Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Dhindsa R. S., Zoghbi A. W., Krizay D. K., Vasavda C., Goldstein D. B., A transcriptome-based drug discovery paradigm for neurodevelopmental disorders. Ann. Neurol. 89, 199–211 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Feng Y.-C. A., Howrigan D. P., Abbott L. E., Tashman K., Cerrato F., Singh T., Heyne H., Byrnes A., Churchhouse C., Watts N., Solomonson M., Lal D., Heinzen E. L., Dhindsa R. S., Stanley K. E., Cavalleri G. L., Hakonarson H., Helbig I., Krause R., May P., Weckhuysen S., Petrovski S., Kamalakaran S., Sisodiya S. M., Cossette P., Cotsapas C., de Jonghe P., Dixon-Salazar T., Guerrini R., Kwan P., Marson A. G., Stewart R., Depondt C., Dlugos D. J., Scheffer I. E., Striano P., Freyer C., McKenna K., Regan B. M., Bellows S. T., Leu C., Bennett C. A., Johns E. M. C., Macdonald A., Shilling H., Burgess R., Weckhuysen D., Bahlo M., O’Brien T. J., Todaro M., Stamberger H., Andrade D. M., Sadoway T. R., Mo K., Krestel H., Gallati S., Papacostas S. S., Kousiappa I., Tanteles G. A., Štěrbová K., Vlčková M., Sedláčková L., Laššuthová P., Klein K. M., Rosenow F., Reif P. S., Knake S., Kunz W. S., Zsurka G., Elger C. E., Bauer J., Rademacher M., Pendziwiat M., Muhle H., Rademacher A., van Baalen A., von Spiczak S., Stephani U., Afawi Z., Korczyn A. D., Kanaan M., Canavati C., Kurlemann G., Müller-Schlüter K., Kluger G., Häusler M., Blatt I., Lemke J. R., Krey I., Weber Y. G., Wolking S., Becker F., Hengsbach C., Rau S., Maisch A. F., Steinhoff B. J., Schulze-Bonhage A., Schubert-Bast S., Schreiber H., Borggräfe I., Schankin C. J., Mayer T., Korinthenberg R., Brockmann K., Kurlemann G., Dennig D., Madeleyn R., Kälviäinen R., Auvinen P., Saarela A., Linnankivi T., Lehesjoki A. E., Rees M. I., Chung S. K., Pickrell W. O., Powell R., Schneider N., Balestrini S., Zagaglia S., Braatz V., Johnson M. R., Auce P., Sills G. J., Baum L. W., Sham P. C., Cherny S. S., Lui C. H. T., Barišić N., Delanty N., Doherty C. P., Shukralla A., McCormack M., el-Naggar H., Canafoglia L., Franceschetti S., Castellotti B., Granata T., Zara F., Iacomino M., Madia F., Vari M. S., Mancardi M. M., Salpietro V., Bisulli F., Tinuper P., Licchetta L., Pippucci T., Stipa C., Minardi R., Gambardella A., Labate A., Annesi G., Manna L., Gagliardi M., Parrini E., Mei D., Vetro A., Bianchini C., Montomoli M., Doccini V., Marini C., Suzuki T., Inoue Y., Yamakawa K., Tumiene B., Sadleir L. G., King C., Mountier E., Caglayan S. H., Arslan M., Yapıcı Z., Yis U., Topaloglu P., Kara B., Turkdogan D., Gundogdu-Eken A., Bebek N., Uğur-İşeri S., Baykan B., Salman B., Haryanyan G., Yücesan E., Kesim Y., Özkara Ç., Poduri A., Shiedley B. R., Shain C., Buono R. J., Ferraro T. N., Sperling M. R., Lo W., Privitera M., French J. A., Schachter S., Kuzniecky R. I., Devinsky O., Hegde M., Khankhanian P., Helbig K. L., Ellis C. A., Spalletta G., Piras F., Piras F., Gili T., Ciullo V., Reif A., McQuillin A., Bass N., McIntosh A., Blackwood D., Johnstone M., Palotie A., Pato M. T., Pato C. N., Bromet E. J., Carvalho C. B., Achtyes E. D., Azevedo M. H., Kotov R., Lehrer D. S., Malaspina D., Marder S. R., Medeiros H., Morley C. P., Perkins D. O., Sobell J. L., Buckley P. F., Macciardi F., Rapaport M. H., Knowles J. A., Fanous A. H., McCarroll S. A., Gupta N., Gabriel S. B., Daly M. J., Lander E. S., Lowenstein D. H., Goldstein D. B., Lerche H., Berkovic S. F., Neale B. M., Ultra-rare genetic variation in the epilepsies: A whole-exome sequencing study of 17,606 individuals. Am. J. Hum. Genet. 105, 267–282 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Porta-Pardo E., Kamburov A., Tamborero D., Pons T., Grases D., Valencia A., Lopez-Bigas N., Getz G., Godzik A., Comparison of algorithms for the detection of cancer drivers at subgene resolution. Nat. Methods 14, 782–788 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang T., Ruan S., Zhao X., Shi X., Teng H., Zhong J., You M., Xia K., Sun Z., Mao F., OncoVar: An integrated database and analysis platform for oncogenic driver variants in cancers. Nucleic Acids Res. 49, D1289–D1301 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Vitsios D., Dhindsa R. S., Middleton L., Gussow A. B., Petrovski S., Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 12, 1504 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gussow A. B., Copeland B. R., Dhindsa R. S., Wang Q., Petrovski S., Majoros W. H., Allen A. S., Goldstein D. B., Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLOS ONE 12, e0181604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Jaiswal S., Fontanillas P., Flannick J., Manning A., Grauman P. V., Mar B. G., Lindsley R. C., Mermel C. H., Burtt N., Chavez A., Higgins J. M., Moltchanov V., Kuo F. C., Kluk M. J., Henderson B., Kinnunen L., Koistinen H. A., Ladenvall C., Getz G., Correa A., Banahan B. F., Gabriel S., Kathiresan S., Stringham H. M., McCarthy M. I., Boehnke M., Tuomilehto J., Haiman C., Groop L., Atzmon G., Wilson J. G., Neuberg D., Altshuler D., Ebert B. L., Age-related clonal hematopoiesis associated with adverse outcomes. N. Eng. J. Med. 371, 2488–2498 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tamborero D., Rubio-Perez C., Deu-Pons J., Schroeder M. P., Vivancos A., Rovira A., Tusquets I., Albanell J., Rodon J., Tabernero J., de Torres C., Dienstmann R., Gonzalez-Perez A., Lopez-Bigas N., Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 10, 25 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Amberger J. S., Bocchini C. A., Schiettecatte F., Scott A. F., Hamosh A., OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Bahrami S., Ehsani R., Drabløs F., A property-based analysis of human transcription factors. BMC. Res. Notes 8, 82 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K., Motyer A., Vukcevic D., Delaneau O., O’Connell J., Cortes A., Welsh S., Young A., Effingham M., McVean G., Leslie S., Allen N., Donnelly P., Marchini J., The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Cingolani P., Platts A., Wang L. L., Coon M., Nguyen T., Wang L., Land S. J., Lu X., Ruden D. M., A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Ioannidis N. M., Rothstein J. H., Pejaver V., Middha S., McDonnell S. K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D., Cannon-Albright L. A., Teerlink C. C., Stanford J. L., Isaacs W. B., Xu J., Cooney K. A., Lange E. M., Schleutker J., Carpten J. D., Powell I. J., Cussenot O., Cancel-Tassin G., Giles G. G., MacInnis R. J., Maier C., Hsieh C. L., Wiklund F., Catalona W. J., Foulkes W. D., Mandal D., Eeles R. A., Kote-Jarai Z., Bustamante C. D., Schaid D. J., Hastie T., Ostrander E. A., Bailey-Wilson J. E., Radivojac P., Thibodeau S. N., Whittemore A. S., Sieh W., REVEL: An ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.