Uncovering tissue-specific binding features from differential deep learning

doi:10.1093/nar/gkaa009

. 2020 Mar 18;48(5):e27.

doi: 10.1093/nar/gkaa009.

Uncovering tissue-specific binding features from differential deep learning

Mike Phuycharoen¹, Peyman Zarrineh², Laure Bridoux³, Shilu Amin³, Marta Losa⁴, Ke Chen¹, Nicoletta Bobola³, Magnus Rattray²

Affiliations

¹ Department of Computer Science, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
² School of Health Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
³ School of Medical Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
⁴ Department of Orofacial Sciences and Department of Anatomy, University of California San Francisco, 513 Parnassus Avenue, HSW 740, San Francisco, CA 94143, USA.

PMID: 31974574
PMCID: PMC7049686
DOI: 10.1093/nar/gkaa009

Uncovering tissue-specific binding features from differential deep learning

Mike Phuycharoen et al. Nucleic Acids Res. 2020.

. 2020 Mar 18;48(5):e27.

doi: 10.1093/nar/gkaa009.

Authors

Mike Phuycharoen¹, Peyman Zarrineh², Laure Bridoux³, Shilu Amin³, Marta Losa⁴, Ke Chen¹, Nicoletta Bobola³, Magnus Rattray²

Affiliations

¹ Department of Computer Science, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
² School of Health Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
³ School of Medical Sciences, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK.
⁴ Department of Orofacial Sciences and Department of Anatomy, University of California San Francisco, 513 Parnassus Avenue, HSW 740, San Francisco, CA 94143, USA.

PMID: 31974574
PMCID: PMC7049686
DOI: 10.1093/nar/gkaa009

Abstract

Transcription factors (TFs) can bind DNA in a cooperative manner, enabling a mutual increase in occupancy. Through this type of interaction, alternative binding sites can be preferentially bound in different tissues to regulate tissue-specific expression programmes. Recently, deep learning models have become state-of-the-art in various pattern analysis tasks, including applications in the field of genomics. We therefore investigate the application of convolutional neural network (CNN) models to the discovery of sequence features determining cooperative and differential TF binding across tissues. We analyse ChIP-seq data from MEIS, TFs which are broadly expressed across mouse branchial arches, and HOXA2, which is expressed in the second and more posterior branchial arches. By developing models predictive of MEIS differential binding in all three tissues, we are able to accurately predict HOXA2 co-binding sites. We evaluate transfer-like and multitask approaches to regularizing the high-dimensional classification task with a larger regression dataset, allowing for the creation of deeper and more accurate models. We test the performance of perturbation and gradient-based attribution methods in identifying the HOXA2 sites from differential MEIS data. Our results show that deep regularized models significantly outperform shallow CNNs as well as k-mer methods in the discovery of tissue-specific sites bound in vivo.

PubMed Disclaimer

Figures

**Figure 1.**
(A) Location of BA tissues marked on a cartoon of mouse during embryonic development. (B) Amount of RNA measured by RNA-seq in BA tissues. For ChIP-seq experiments, a pan-MEIS antibody was used to immunoprecipitate MEIS 1-3; CPM, counts per million sequenced reads.

**Figure 2.**
(A) MEIS RPKM (reads per kilobase of transcript per million mapped reads) regression features attributed by a deep model using mutagenesis. Values indicate the sum of predicted RPKM change if the base was mutated to all of its alternatives. Each line indicates features of a single replicate output for a tissue. Colours match cartoon labelling in Figure 1A. MEIS-binding site (CTGTCAG) is a feature in all tissues. (B) BA1-downbinding features from a differential model. A dimeric site containing HOXA2 and MEIS binding motifs is identified as a differential feature enhancing MEIS binding in BA2 and PBA, but not BA1.

**Figure 3.**
(A) Schematic of a deep architecture. Pooling and dilation blocks are repeated for desired number of times, dilation blocks having their outputs concatenated. Number of blocks and other hyper-parameters are automatically optimized in the process of model selection. Input is a one-hot encoded nucleotide sequence. One or more outputs can be specified for classification and/or regression with task-dependent activations and loss functions. (**B–D**) Modes of regularization where latent variables of a larger dataset (regression, N targets) are used to regularize the training with a smaller dataset (classification, C classes). (B) Typical use of transfer learning in 1-layer CNN. A convolutional layer is copied and frozen for training of the second model, allowing for inference in terms of previously learned intermediate latent variables. (C) Serial architecture uses the output of the trained model as its input, performing non-linear weighting of regression targets for classification. (D) Parallel architecture alternates between training model outputs in each batch. In this architecture, latent variables are shared throughout model depth. Non-differential target is shared between up-binding and down-binding task outputs if both are used.

**Figure 4.**
Test set precision-recall curves for the down-binding task. 1-layer CNN and deep, direct models were trained with classification dataset only. Transfer, parallel and serial models used MEIS regression data for regularization.

**Figure 5.**
(A) Validation loss of MEIS RPKM model selection when the 1× bottleneck layer is omitted, or used with linear or ReLU activation. (B) Validation loss for varying amount of dimensionality reduction (proportion of channels of preceding layer) caused by the bottleneck using ReLU activation. (C) Validation loss as a function of total number of convolutional layers (including 1× with ReLU activation) for MEIS RPKM regression model.

**Figure 6.**
Proportion of most confident features identified by differential analysis passing a Poisson test for alignment with both HOXA2 ChIP-seq replicates. Regions labelled as BA1-down are tested. One strongest feature in each region is selected. Random indicates chance of randomly selected location in the regions passing the Poisson test. (A) Comparison of CNNs with k-mer counting. Mutagenesis is used with CNN models. Homer known indicates using Homer with published HOXA2 ChIP-seq data (shown for reference). (B) Attribution method comparison using 1-layer CNN. (C) Attribution method comparison using deep parallel 2-task model.

**Figure 7.**
(A) MEIS ChIP-seq profiles in a region differentially down-bound in BA1 compared to BA2 and PBA. (B) Nucleotide-level mutation map (and its 1D channel sum), shown in the central region marked with black rectangles. Attribution of MEIS BA1-down differential class using 2-task parallel model identifies HOXA2-binding site (ATCAATC). (C) Reference HOXA2 ChIP-seq profile (not used for model training).

**Figure 8.**
(A) MEIS ChIP-seq profiles in a region differentially up-bound in PBA compared to BA1 and BA2. (B) GATA-binding site (AGATAAG) is identified as a feature of differential MEIS up-binding in PBA. Attribution was performed using mutagenesis and 3-task parallel model, and shown in the central region marked with black rectangles. (C) Reference GATA ChIP-seq profile in PBA (not used for model training).

**Figure 9.**
Overfitting effects in BA1-down mutagenesis attribution validated with HOXA2 ChIP-seq (Poisson test, P< 0.05 in two ChIP replicates). Five models were trained holding out different folds of randomly shuffled data. Held out indicates each peak was attributed with model which held out the region during training. Ensemble indicates using mean attribution from all models. All data indicates using a single model, trained on all the data.

See this image and copyright information in PMC

Cited by

HOX paralogs selectively convert binding of ubiquitous transcription factors into tissue-specific patterns of enhancer activation.
Bridoux L, Zarrineh P, Mallen J, Phuycharoen M, Latorre V, Ladam F, Losa M, Baker SM, Sagerstrom C, Mace KA, Rattray M, Bobola N. Bridoux L, et al. PLoS Genet. 2020 Dec 14;16(12):e1009162. doi: 10.1371/journal.pgen.1009162. eCollection 2020 Dec. PLoS Genet. 2020. PMID: 33315856 Free PMC article.
Host-pathogen protein-nucleic acid interactions: A comprehensive review.
Jain A, Mittal S, Tripathi LP, Nussinov R, Ahmad S. Jain A, et al. Comput Struct Biotechnol J. 2022 Aug 4;20:4415-4436. doi: 10.1016/j.csbj.2022.08.001. eCollection 2022. Comput Struct Biotechnol J. 2022. PMID: 36051878 Free PMC article. Review.
JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles.
Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D, Khan A, Fornes O, Gundersen S, Johansen M, Hovig E, Lenhard B, Sandelin A, Wasserman WW, Parcy F, Mathelier A. Rauluseviciute I, et al. Nucleic Acids Res. 2024 Jan 5;52(D1):D174-D182. doi: 10.1093/nar/gkad1059. Nucleic Acids Res. 2024. PMID: 37962376 Free PMC article.
Identifying transcription factors with cell-type specific DNA binding signatures.
Awdeh A, Turcotte M, Perkins TJ. Awdeh A, et al. BMC Genomics. 2024 Oct 14;25(1):957. doi: 10.1186/s12864-024-10859-1. BMC Genomics. 2024. PMID: 39402535 Free PMC article.
Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans.
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Tahara S, et al. BMC Genomics. 2023 Oct 7;24(1):597. doi: 10.1186/s12864-023-09692-9. BMC Genomics. 2023. PMID: 37805453 Free PMC article.

See all "Cited by" articles

References

1. Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nussbaum C., Myers R.M., Brown M., Li W. et al. .. Model-based analysis of ChIP-Seq (MACS). Genome Biology. 2008; 9:R137. - PMC - PubMed
1. Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 2009; 37:W202–W208. - PMC - PubMed
1. Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of Lineage-Determining transcription factors prime cis-Regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. - PMC - PubMed
1. Guo Y., Mahony S., Gifford D.K.. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computat. Biol. 2012; 8:e1002638. - PMC - PubMed
1. Guo Y., Tian K., Zeng H., Guo X., Gifford D.K.. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 2018; 28:891–900. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

[1] Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nussbaum C., Myers R.M., Brown M., Li W. et al. .. Model-based analysis of ChIP-Seq (MACS). Genome Biology. 2008; 9:R137. - PMC - PubMed

[2] Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nussbaum C., Myers R.M., Brown M., Li W. et al. .. Model-based analysis of ChIP-Seq (MACS). Genome Biology. 2008; 9:R137. - PMC - PubMed

[3] Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 2009; 37:W202–W208. - PMC - PubMed

[4] Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME Suite: Tools for motif discovery and searching. Nucleic Acids Res. 2009; 37:W202–W208. - PMC - PubMed

[5] Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of Lineage-Determining transcription factors prime cis-Regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. - PMC - PubMed

[6] Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of Lineage-Determining transcription factors prime cis-Regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. - PMC - PubMed

[7] Guo Y., Mahony S., Gifford D.K.. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computat. Biol. 2012; 8:e1002638. - PMC - PubMed

[8] Guo Y., Mahony S., Gifford D.K.. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints. PLoS Computat. Biol. 2012; 8:e1002638. - PMC - PubMed

[9] Guo Y., Tian K., Zeng H., Guo X., Gifford D.K.. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 2018; 28:891–900. - PMC - PubMed

[10] Guo Y., Tian K., Zeng H., Guo X., Gifford D.K.. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 2018; 28:891–900. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Uncovering tissue-specific binding features from differential deep learning

Affiliations

Uncovering tissue-specific binding features from differential deep learning

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous