Sequential regulatory activity prediction across chromosomes with convolutional neural networks

doi:10.1101/gr.227819.117

. 2018 May;28(5):739-750.

doi: 10.1101/gr.227819.117. Epub 2018 Mar 27.

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

David R Kelley¹, Yakir A Reshef², Maxwell Bileschi³, David Belanger³, Cory Y McLean³, Jasper Snoek³

Affiliations

¹ Calico Labs, South San Francisco, California 94080, USA.
² Department of Computer Science, Harvard University, Cambridge, Massachusetts 02138, USA.
³ Google Brain, Cambridge, Massachusetts 02142, USA.

PMID: 29588361
PMCID: PMC5932613
DOI: 10.1101/gr.227819.117

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

David R Kelley et al. Genome Res. 2018 May.

. 2018 May;28(5):739-750.

doi: 10.1101/gr.227819.117. Epub 2018 Mar 27.

Authors

David R Kelley¹, Yakir A Reshef², Maxwell Bileschi³, David Belanger³, Cory Y McLean³, Jasper Snoek³

Affiliations

¹ Calico Labs, South San Francisco, California 94080, USA.
² Department of Computer Science, Harvard University, Cambridge, Massachusetts 02138, USA.
³ Google Brain, Cambridge, Massachusetts 02142, USA.

PMID: 29588361
PMCID: PMC5932613
DOI: 10.1101/gr.227819.117

Abstract

Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.

PubMed Disclaimer

Figures

**Figure 1.**
Sequential regulatory activity prediction. DNA sequences come in to the model one hot encoded to four rows representing A, C, G, and T. The annotations are fabrications to help convey the reasons for the various elements of the architecture. We apply several layers of convolution and max pooling, similar to previous methods (Kelley et al. 2016), to obtain representations that describe 128-bp bins. To share information across large distances, we apply several layers of dilated convolutions. The purple squares indicate the columns that the convolution directly sees; the teal shade is drawn proportional to the number of operations performed on that column with respect to the center position. Dilated convolution layers are densely passed on to the final prediction layer, where a width-one convolution layer makes predictions across the sequence. We compare these predictions to the experimental counts via a Poisson regression loss function and use stochastic gradient descent with back propagation to fit the model parameters.

**Figure 2.**
Basenji predicts diverse epigenetic and transcriptional profiles from DNA sequence. (A) The *AKT2* locus exemplifies the genome-wide accuracy of Basenji predictions; gene promoters and the strongest distal regulatory elements are easily identified, with some false-positive and -negative predictions for weaker elements. For each track, the darker version on *top* represents the experimental coverage, and the lighter version *below* represents Basenji predictions. (B) We computed the variance explained (R²) for each experiment and plot here the distributions classified by data set type. Basenji predicts punctate peak data, but broad chromatin marks remain challenging. (C) For the median accuracy DNase-seq experiment, mobilized CD34 cells, we plotted the log₂ predictions versus log₂ experiment coverage in 128-bp bins. (D) For all replicated experiments, we plotted log–log Pearson correlation between the replicate experiments versus the correlation between the experiment and its replicate's prediction (averaged across replicates). Both the mean and median Basenji prediction accuracy exceed the replicate accuracy.

**Figure 3.**
Basenji predicts cell-type–specific gene expression. (A) We computed Pearson correlation between the log₂ prediction and experiment across all nonzero expressed test set genes for each CAGE data set. We plotted those correlations against the total number of reads aligned to test gene TSSs, which measures the relevant sequencing depth. (B) For the median accuracy cell, artery endothelial cells, we plotted the experiment coverage versus Basenji prediction. (C) For both the experimental measurement and Basenji prediction, the gene expression by CAGE data set matrix displays clusters. We measured the similarity of those clusters between the experimental and predicted data by bootstrap sampling gene subsets, clustering both the experimental and predicted data and computing the adjusted Rand index between the cluster sets (Methods). The adjusted Rand index is significantly greater than the null model value zero (P-value <1 × 10⁻²⁶). (D) We plotted gene expression versus prediction after quantile normalization across cell types for the genes ranked in the 95th, 75th, 50th, and 25th percentiles by Pearson correlation.

**Figure 4.**
Basenji identifies distal regulatory elements. (A) ENCODE enhancer annotations for *PIM1* in GM12878 specify two downstream regulatory elements. Basenji saliency scores and FDR < 0.05 peaks (see Methods) mark these elements, in addition to a variety of others that lack typical enhancer chromatin. In silico saturation mutagenesis of these elements with respect to Basenji's *PIM1* GM12878 CAGE prediction outline the driving motifs. The quantities in the heatmap display the change in Basenji prediction “Δ pred” (summed across the sequence) after substituting the row's specified nucleotide into the sequence. The line plots display the minimum (loss) and maximum (gain) change among the possible substitutions. The upstream *cis*-regulatory module most prominently features a POU2F factor motif, while the downstream element consists solely of two adjacent PU.1 motifs. (B) We plotted the cumulative distributions of the maximum saliency score for elements of various regulatory annotation classes in GM12878 released by ENCODE. Genome-wide, each annotation class differs significantly from the background scores by Kolmogorov–Smirnov test.

**Figure 5.**
Basenji gene-specific variant scores enrich for eQTLs. (A) We defined SNP expression difference (SED) scores for each biallelic variant and gene combination as the difference between the model prediction for the two alleles at that gene's TSSs. (B) We computed the signed LD profile of the SED annotations (denoted by SED-LD) to more readily compare to eQTL measurements in human populations (Methods). |SED-LD| shows a strong relationship with eQTL statistics from GTEx. Here, we binned variants into five quantiles by the difference between their regression predictions including and excluding |SED-LD| and plotted the proportion of variants called significant eQTLs in pancreas. We chose five quantiles to represent the observed statistical trend parsimoniously and aesthetically. The proportion rises with greater |SED-LD| to 4.2× in the highest quantile over the average of the bottom three quantiles, which represented the median enrichment in a range of 3.2–5.8× across the 19 tissues. See Supplemental Figure S8 for all tissues and TSS-controlled analysis. (C) Plotting |SED-LD| versus the χ² statistics reveals a highly significant correlation.

**Figure 6.**
Basenji gene-specific variant scores illuminate a multiple sclerosis–associated locus. Lead variant rs74796499 is associated with multiple sclerosis (International Multiple Sclerosis Genetics Consortium 2013). Among the credible set of linked variants, Basenji predicts that rs78461372 would alter transcription of the nearby genes *GPR65* and *GALC*. In immune cells, such as treated CD14⁺ cells depicted here, both genes are transcribed and the C>G introduces an ETS factor motif that enhances transcription. In contrast, in other cell types, e.g., in the insular cortex, where *GPR65* is far less transcribed, Basenji predicts the same motifs play a role in repressing the gene.

See this image and copyright information in PMC

Cited by

Machine-guided design of cell-type-targeting cis-regulatory elements.
Gosai SJ, Castro RI, Fuentes N, Butts JC, Mouri K, Alasoadura M, Kales S, Nguyen TTL, Noche RR, Rao AS, Joy MT, Sabeti PC, Reilly SK, Tewhey R. Gosai SJ, et al. Nature. 2024 Oct;634(8036):1211-1220. doi: 10.1038/s41586-024-08070-z. Epub 2024 Oct 23. Nature. 2024. PMID: 39443793 Free PMC article.
Computational Characterization of Undifferentially Expressed Genes with Altered Transcription Regulation in Lung Cancer.
Xin R, Cheng Q, Chi X, Feng X, Zhang H, Wang Y, Duan M, Xie T, Song X, Yu Q, Fan Y, Huang L, Zhou F. Xin R, et al. Genes (Basel). 2023 Dec 1;14(12):2169. doi: 10.3390/genes14122169. Genes (Basel). 2023. PMID: 38136991 Free PMC article.
Variants in tubule epithelial regulatory elements mediate most heritable differences in human kidney function.
Loeb GB, Kathail P, Shuai R, Chung R, Grona RJ, Peddada S, Sevim V, Federman S, Mader K, Chu A, Davitte J, Du J, Gupta AR, Ye CJ, Shafer S, Przybyla L, Rapiteanu R, Ioannidis N, Reiter JF. Loeb GB, et al. bioRxiv [Preprint]. 2024 Jun 22:2024.06.18.599625. doi: 10.1101/2024.06.18.599625. bioRxiv. 2024. Update in: Nat Genet. 2024 Oct;56(10):2078-2092. doi: 10.1038/s41588-024-01904-6. PMID: 38948875 Free PMC article. Updated. Preprint.
Genome annotation across species using deep convolutional neural networks.
Khodabandelou G, Routhier E, Mozziconacci J. Khodabandelou G, et al. PeerJ Comput Sci. 2020 Jun 15;6:e278. doi: 10.7717/peerj-cs.278. eCollection 2020. PeerJ Comput Sci. 2020. PMID: 33816929 Free PMC article.
Current genomic deep learning models display decreased performance in cell type specific accessible regions.
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Kathail P, et al. bioRxiv [Preprint]. 2024 Jul 10:2024.07.05.602265. doi: 10.1101/2024.07.05.602265. bioRxiv. 2024. Update in: Genome Biol. 2024 Aug 1;25(1):202. doi: 10.1186/s13059-024-03335-2. PMID: 39026761 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016. TensorFlow: a system for large-scale machine learning. In USENIX symposium on operating systems design and implementation, Savannah, GA.
1. Albert FW, Kruglyak L. 2015. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16: 197–212. - PubMed
1. Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. - PubMed
1. Ayoubi TA, Van De Ven WJ. 1996. Regulation of gene expression by alternative promoters. FASEB J 10: 453–460. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

T32 GM007753/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed

[2] The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed

[3] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016. TensorFlow: a system for large-scale machine learning. In USENIX symposium on operating systems design and implementation, Savannah, GA.

[4] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016. TensorFlow: a system for large-scale machine learning. In USENIX symposium on operating systems design and implementation, Savannah, GA.

[5] Albert FW, Kruglyak L. 2015. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16: 197–212. - PubMed

[6] Albert FW, Kruglyak L. 2015. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16: 197–212. - PubMed

[7] Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. - PubMed

[8] Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. - PubMed

[9] Ayoubi TA, Van De Ven WJ. 1996. Regulation of gene expression by alternative promoters. FASEB J 10: 453–460. - PubMed

[10] Ayoubi TA, Van De Ven WJ. 1996. Regulation of gene expression by alternative promoters. FASEB J 10: 453–460. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Affiliations

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources