Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 18;48(5):e27.
doi: 10.1093/nar/gkaa009.

Uncovering tissue-specific binding features from differential deep learning

Affiliations

Uncovering tissue-specific binding features from differential deep learning

Mike Phuycharoen et al. Nucleic Acids Res. .

Abstract

Transcription factors (TFs) can bind DNA in a cooperative manner, enabling a mutual increase in occupancy. Through this type of interaction, alternative binding sites can be preferentially bound in different tissues to regulate tissue-specific expression programmes. Recently, deep learning models have become state-of-the-art in various pattern analysis tasks, including applications in the field of genomics. We therefore investigate the application of convolutional neural network (CNN) models to the discovery of sequence features determining cooperative and differential TF binding across tissues. We analyse ChIP-seq data from MEIS, TFs which are broadly expressed across mouse branchial arches, and HOXA2, which is expressed in the second and more posterior branchial arches. By developing models predictive of MEIS differential binding in all three tissues, we are able to accurately predict HOXA2 co-binding sites. We evaluate transfer-like and multitask approaches to regularizing the high-dimensional classification task with a larger regression dataset, allowing for the creation of deeper and more accurate models. We test the performance of perturbation and gradient-based attribution methods in identifying the HOXA2 sites from differential MEIS data. Our results show that deep regularized models significantly outperform shallow CNNs as well as k-mer methods in the discovery of tissue-specific sites bound in vivo.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Location of BA tissues marked on a cartoon of mouse during embryonic development. (B) Amount of RNA measured by RNA-seq in BA tissues. For ChIP-seq experiments, a pan-MEIS antibody was used to immunoprecipitate MEIS 1-3; CPM, counts per million sequenced reads.
Figure 2.
Figure 2.
(A) MEIS RPKM (reads per kilobase of transcript per million mapped reads) regression features attributed by a deep model using mutagenesis. Values indicate the sum of predicted RPKM change if the base was mutated to all of its alternatives. Each line indicates features of a single replicate output for a tissue. Colours match cartoon labelling in Figure 1A. MEIS-binding site (CTGTCAG) is a feature in all tissues. (B) BA1-downbinding features from a differential model. A dimeric site containing HOXA2 and MEIS binding motifs is identified as a differential feature enhancing MEIS binding in BA2 and PBA, but not BA1.
Figure 3.
Figure 3.
(A) Schematic of a deep architecture. Pooling and dilation blocks are repeated for desired number of times, dilation blocks having their outputs concatenated. Number of blocks and other hyper-parameters are automatically optimized in the process of model selection. Input is a one-hot encoded nucleotide sequence. One or more outputs can be specified for classification and/or regression with task-dependent activations and loss functions. (B–D) Modes of regularization where latent variables of a larger dataset (regression, N targets) are used to regularize the training with a smaller dataset (classification, C classes). (B) Typical use of transfer learning in 1-layer CNN. A convolutional layer is copied and frozen for training of the second model, allowing for inference in terms of previously learned intermediate latent variables. (C) Serial architecture uses the output of the trained model as its input, performing non-linear weighting of regression targets for classification. (D) Parallel architecture alternates between training model outputs in each batch. In this architecture, latent variables are shared throughout model depth. Non-differential target is shared between up-binding and down-binding task outputs if both are used.
Figure 4.
Figure 4.
Test set precision-recall curves for the down-binding task. 1-layer CNN and deep, direct models were trained with classification dataset only. Transfer, parallel and serial models used MEIS regression data for regularization.
Figure 5.
Figure 5.
(A) Validation loss of MEIS RPKM model selection when the 1× bottleneck layer is omitted, or used with linear or ReLU activation. (B) Validation loss for varying amount of dimensionality reduction (proportion of channels of preceding layer) caused by the bottleneck using ReLU activation. (C) Validation loss as a function of total number of convolutional layers (including 1× with ReLU activation) for MEIS RPKM regression model.
Figure 6.
Figure 6.
Proportion of most confident features identified by differential analysis passing a Poisson test for alignment with both HOXA2 ChIP-seq replicates. Regions labelled as BA1-down are tested. One strongest feature in each region is selected. Random indicates chance of randomly selected location in the regions passing the Poisson test. (A) Comparison of CNNs with k-mer counting. Mutagenesis is used with CNN models. Homer known indicates using Homer with published HOXA2 ChIP-seq data (shown for reference). (B) Attribution method comparison using 1-layer CNN. (C) Attribution method comparison using deep parallel 2-task model.
Figure 7.
Figure 7.
(A) MEIS ChIP-seq profiles in a region differentially down-bound in BA1 compared to BA2 and PBA. (B) Nucleotide-level mutation map (and its 1D channel sum), shown in the central region marked with black rectangles. Attribution of MEIS BA1-down differential class using 2-task parallel model identifies HOXA2-binding site (ATCAATC). (C) Reference HOXA2 ChIP-seq profile (not used for model training).
Figure 8.
Figure 8.
(A) MEIS ChIP-seq profiles in a region differentially up-bound in PBA compared to BA1 and BA2. (B) GATA-binding site (AGATAAG) is identified as a feature of differential MEIS up-binding in PBA. Attribution was performed using mutagenesis and 3-task parallel model, and shown in the central region marked with black rectangles. (C) Reference GATA ChIP-seq profile in PBA (not used for model training).
Figure 9.
Figure 9.
Overfitting effects in BA1-down mutagenesis attribution validated with HOXA2 ChIP-seq (Poisson test, P< 0.05 in two ChIP replicates). Five models were trained holding out different folds of randomly shuffled data. Held out indicates each peak was attributed with model which held out the region during training. Ensemble indicates using mean attribution from all models. All data indicates using a single model, trained on all the data.

Similar articles

Cited by

References

    1. Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nussbaum C., Myers R.M., Brown M., Li W. et al. .. Model-based analysis of ChIP-Seq (MACS). Genome Biology. 2008; 9:R137. - PMC - PubMed
    1. Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 2009; 37:W202–W208. - PMC - PubMed
    1. Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of Lineage-Determining transcription factors prime cis-Regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. - PMC - PubMed
    1. Guo Y., Mahony S., Gifford D.K.. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computat. Biol. 2012; 8:e1002638. - PMC - PubMed
    1. Guo Y., Tian K., Zeng H., Guo X., Gifford D.K.. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 2018; 28:891–900. - PMC - PubMed

Publication types

MeSH terms