Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May;28(5):739-750.
doi: 10.1101/gr.227819.117. Epub 2018 Mar 27.

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Affiliations

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

David R Kelley et al. Genome Res. 2018 May.

Abstract

Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Sequential regulatory activity prediction. DNA sequences come in to the model one hot encoded to four rows representing A, C, G, and T. The annotations are fabrications to help convey the reasons for the various elements of the architecture. We apply several layers of convolution and max pooling, similar to previous methods (Kelley et al. 2016), to obtain representations that describe 128-bp bins. To share information across large distances, we apply several layers of dilated convolutions. The purple squares indicate the columns that the convolution directly sees; the teal shade is drawn proportional to the number of operations performed on that column with respect to the center position. Dilated convolution layers are densely passed on to the final prediction layer, where a width-one convolution layer makes predictions across the sequence. We compare these predictions to the experimental counts via a Poisson regression loss function and use stochastic gradient descent with back propagation to fit the model parameters.
Figure 2.
Figure 2.
Basenji predicts diverse epigenetic and transcriptional profiles from DNA sequence. (A) The AKT2 locus exemplifies the genome-wide accuracy of Basenji predictions; gene promoters and the strongest distal regulatory elements are easily identified, with some false-positive and -negative predictions for weaker elements. For each track, the darker version on top represents the experimental coverage, and the lighter version below represents Basenji predictions. (B) We computed the variance explained (R2) for each experiment and plot here the distributions classified by data set type. Basenji predicts punctate peak data, but broad chromatin marks remain challenging. (C) For the median accuracy DNase-seq experiment, mobilized CD34 cells, we plotted the log2 predictions versus log2 experiment coverage in 128-bp bins. (D) For all replicated experiments, we plotted log–log Pearson correlation between the replicate experiments versus the correlation between the experiment and its replicate's prediction (averaged across replicates). Both the mean and median Basenji prediction accuracy exceed the replicate accuracy.
Figure 3.
Figure 3.
Basenji predicts cell-type–specific gene expression. (A) We computed Pearson correlation between the log2 prediction and experiment across all nonzero expressed test set genes for each CAGE data set. We plotted those correlations against the total number of reads aligned to test gene TSSs, which measures the relevant sequencing depth. (B) For the median accuracy cell, artery endothelial cells, we plotted the experiment coverage versus Basenji prediction. (C) For both the experimental measurement and Basenji prediction, the gene expression by CAGE data set matrix displays clusters. We measured the similarity of those clusters between the experimental and predicted data by bootstrap sampling gene subsets, clustering both the experimental and predicted data and computing the adjusted Rand index between the cluster sets (Methods). The adjusted Rand index is significantly greater than the null model value zero (P-value <1 × 10−26). (D) We plotted gene expression versus prediction after quantile normalization across cell types for the genes ranked in the 95th, 75th, 50th, and 25th percentiles by Pearson correlation.
Figure 4.
Figure 4.
Basenji identifies distal regulatory elements. (A) ENCODE enhancer annotations for PIM1 in GM12878 specify two downstream regulatory elements. Basenji saliency scores and FDR < 0.05 peaks (see Methods) mark these elements, in addition to a variety of others that lack typical enhancer chromatin. In silico saturation mutagenesis of these elements with respect to Basenji's PIM1 GM12878 CAGE prediction outline the driving motifs. The quantities in the heatmap display the change in Basenji prediction “Δ pred” (summed across the sequence) after substituting the row's specified nucleotide into the sequence. The line plots display the minimum (loss) and maximum (gain) change among the possible substitutions. The upstream cis-regulatory module most prominently features a POU2F factor motif, while the downstream element consists solely of two adjacent PU.1 motifs. (B) We plotted the cumulative distributions of the maximum saliency score for elements of various regulatory annotation classes in GM12878 released by ENCODE. Genome-wide, each annotation class differs significantly from the background scores by Kolmogorov–Smirnov test.
Figure 5.
Figure 5.
Basenji gene-specific variant scores enrich for eQTLs. (A) We defined SNP expression difference (SED) scores for each biallelic variant and gene combination as the difference between the model prediction for the two alleles at that gene's TSSs. (B) We computed the signed LD profile of the SED annotations (denoted by SED-LD) to more readily compare to eQTL measurements in human populations (Methods). |SED-LD| shows a strong relationship with eQTL statistics from GTEx. Here, we binned variants into five quantiles by the difference between their regression predictions including and excluding |SED-LD| and plotted the proportion of variants called significant eQTLs in pancreas. We chose five quantiles to represent the observed statistical trend parsimoniously and aesthetically. The proportion rises with greater |SED-LD| to 4.2× in the highest quantile over the average of the bottom three quantiles, which represented the median enrichment in a range of 3.2–5.8× across the 19 tissues. See Supplemental Figure S8 for all tissues and TSS-controlled analysis. (C) Plotting |SED-LD| versus the χ2 statistics reveals a highly significant correlation.
Figure 6.
Figure 6.
Basenji gene-specific variant scores illuminate a multiple sclerosis–associated locus. Lead variant rs74796499 is associated with multiple sclerosis (International Multiple Sclerosis Genetics Consortium 2013). Among the credible set of linked variants, Basenji predicts that rs78461372 would alter transcription of the nearby genes GPR65 and GALC. In immune cells, such as treated CD14+ cells depicted here, both genes are transcribed and the C>G introduces an ETS factor motif that enhances transcription. In contrast, in other cell types, e.g., in the insular cortex, where GPR65 is far less transcribed, Basenji predicts the same motifs play a role in repressing the gene.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. - PMC - PubMed
    1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. 2016. TensorFlow: a system for large-scale machine learning. In USENIX symposium on operating systems design and implementation, Savannah, GA.
    1. Albert FW, Kruglyak L. 2015. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16: 197–212. - PubMed
    1. Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33: 831–838. - PubMed
    1. Ayoubi TA, Van De Ven WJ. 1996. Regulation of gene expression by alternative promoters. FASEB J 10: 453–460. - PubMed