Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Predicting the human epigenome from DNA motifs

Abstract

The epigenome is established and maintained by the site-specific recruitment of chromatin-modifying enzymes and their cofactors. Identifying the cis elements that regulate epigenomic modification is critical for understanding the regulatory mechanisms that control gene expression patterns. We present Epigram, an analysis pipeline that predicts histone modification and DNA methylation patterns from DNA motifs. The identified cis elements represent interactions with the site-specific DNA-binding factors that establish and maintain epigenomic modifications. We cataloged the cis elements in embryonic stem cells and four derived lineages and found numerous motifs that have location preference, such as at the center of H3K27ac or at the edges of H3K4me3 and H3K9me3, which provides mechanistic insight about the shaping of the epigenome. The Epigram pipeline and predictive motifs are at http://wanglab.ucsd.edu/star/epigram/.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Identifying motifs that are predicative of epigenomic modifications.
Figure 2: Predicting epigenomic modification from DNA motifs.
Figure 3: The specificities of interplay between DNA motifs and the epigenome.
Figure 4: Predictive motifs have location preferences.
Figure 5: De novo motif disruption and H3K27ac levels are correlated.

Similar content being viewed by others

References

  1. Yuan, G.C. Linking genome to epigenome. Wiley Interdiscip. Rev. Syst. Biol. Med. 4, 297–309 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Mendenhall, E.M. et al. GC-rich sequence elements recruit PRC2 in mammalian ES cells. PLoS Genet. 6, e1001244 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Thomson, J.P. et al. CpG islands influence chromatin structure via the CpG-binding protein Cfp1. Nature 464, 1082–1086 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Klattenhoff, C.A. et al. Braveheart, a long noncoding RNA required for cardiovascular lineage commitment. Cell 152, 570–583 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Tsai, M.C. et al. Long noncoding RNA as modular scaffold of histone modification complexes. Science 329, 689–693 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Baudat, F. et al. PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 (2010).

    Article  CAS  PubMed  Google Scholar 

  7. Struhl, K. & Segal, E. Determinants of nucleosome positioning. Nat. Struct. Mol. Biol. 20, 267–273 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Yuan, G.C. & Liu, J.S. Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput. Biol. 4, e13 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Kaplan, N. et al. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature 458, 362–366 (2009).

    Article  CAS  PubMed  Google Scholar 

  11. Iyer, V. & Struhl, K. Poly(dA:dT), a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure. EMBO J. 14, 2570–2579 (1995).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Segal, E. & Widom, J. Poly(dA:dT) tracts: major determinants of nucleosome organization. Curr. Opin. Struct. Biol. 19, 65–71 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Wu, R. & Li, H. Positioned and G/C-capped poly(dA:dT) tracts associate with the centers of nucleosome-free regions in yeast promoters. Genome Res. 20, 473–484 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zhang, Y. et al. Evidence against a genomic code for nucleosome positioning. Nat. Struct. Mol. Biol. 17, 920–923 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Ernst, J. & Kellis, M. Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types. Genome Res. 23, 1142–1154 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Ha, M., Hong, S. & Li, W.H. Predicting the probability of H3K4me3 occupation at a base pair from the genome sequence context. Bioinformatics 29, 1199–1205 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Xie, W. et al. Epigenomic analysis of multi-lineage differentiation of human embryonic stem cell. Cell 153, 1134–1148 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Benjamini, Y. & Speed, T.P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Cheung, M.S., Down, T.A., Latorre, I. & Ahringer, J. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res. 39, e103 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Yuan, Y., Guo, L., Shen, L. & Liu, J.S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Creyghton, M.P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci. USA 107, 21931–21936 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Heintzman, N.D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108–112 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Graham, V., Khudyakov, J., Ellis, P. & Pevny, L. SOX2 functions to maintain neural progenitor identity. Neuron 39, 749–765 (2003).

    Article  CAS  PubMed  Google Scholar 

  26. Mauvieux, L., Villey, I. & de Villartay, J.P. TEA regulates local TCR-Jα accessibility through histone acetylation. Eur. J. Immunol. 33, 2216–2222 (2003).

    Article  CAS  PubMed  Google Scholar 

  27. Choi, J.Y. et al. Subnuclear targeting of Runx/Cbfa/AML factors is essential for tissue-specific differentiation during embryonic development. Proc. Natl. Acad. Sci. USA 98, 8650–8655 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Morrisey, E.E., Ip, H.S., Tang, Z., Lu, M.M. & Parmacek, M.S. GATA-5: a transcriptional activator expressed in a novel temporally and spatially-restricted pattern during embryonic development. Dev. Biol. 183, 21–36 (1997).

    Article  CAS  PubMed  Google Scholar 

  29. Lupien, M. et al. FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription. Cell 132, 958–970 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. He, H.H. et al. Nucleosome dynamics define transcriptional enhancers. Nat. Genet. 42, 343–347 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Schuierer, M. et al. Induction of AP-2α expression by adenoviral infection involves inactivation of the AP-2rep transcriptional corepressor CtBP1. J. Biol. Chem. 276, 27944–27949 (2001).

    Article  CAS  PubMed  Google Scholar 

  32. Shi, Y. et al. Coordinated histone modifications mediated by a CtBP co-repressor complex. Nature 422, 735–738 (2003).

    Article  CAS  PubMed  Google Scholar 

  33. Kawahara, T.L. et al. SIRT6 links histone H3 lysine 9 deacetylation to NF-κB-dependent gene expression and organismal life span. Cell 136, 62–74 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Woo, C.J., Kharchenko, P.V., Daheron, L., Park, P.J. & Kingston, R.E. Variable requirements for DNA-binding proteins at Polycomb-dependent repressive regions in human HOX clusters. Mol. Cell. Biol. 33, 3274–3285 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. de la Cruz, C.C. et al. The Polycomb group protein SUZ12 regulates histone H3 lysine 9 methylation and HP1α distribution. Chromosome Res. 15, 299–314 (2007).

    Article  CAS  PubMed  Google Scholar 

  36. Wu, S., Hu, Y.C., Liu, H. & Shi, Y. Loss of YY1 impacts the heterochromatic state and meiotic double-strand breaks during mouse spermatogenesis. Mol. Cell. Biol. 29, 6245–6256 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Chen, J. et al. H3K9 methylation is a barrier during somatic cell reprogramming into iPSCs. Nat. Genet. 45, 34–42 (2013).

    Article  CAS  PubMed  Google Scholar 

  38. Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Wang, Z. et al. Structure and function of Nurr1 identifies a class of ligand-independent nuclear receptors. Nature 423, 555–560 (2003).

    Article  CAS  PubMed  Google Scholar 

  40. Sekiya, T. et al. The nuclear orphan receptor Nr4a2 induces Foxp3 and regulates differentiation of CD4+ T cells. Nat. Commun. 2, 269 (2011).

    Article  CAS  PubMed  Google Scholar 

  41. Rohs, R. et al. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79, 233–269 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    Article  CAS  PubMed  Google Scholar 

  43. Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750–752 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Kim, J. et al. Ikaros DNA-binding proteins direct formation of chromatin remodeling complexes in lymphocytes. Immunity 10, 345–355 (1999).

    Article  CAS  PubMed  Google Scholar 

  45. Hathaway, N.A. et al. Dynamics and memory of heterochromatin in living cells. Cell 149, 1447–1460 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Miller, J.C. et al. A TALE nuclease architecture for efficient genome editing. Nat. Biotechnol. 29, 143–148 (2011).

    Article  CAS  PubMed  Google Scholar 

  47. Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Chinenov, Y. & Kerppola, T.K. Close encounters of many kinds: Fos-Jun interactions that mediate transcription regulatory specificity. Oncogene 20, 2438–2452 (2001).

    Article  CAS  PubMed  Google Scholar 

  49. Wang, T. et al. STAR: an integrated solution to management and visualization of sequencing data. Bioinformatics 29, 3204–3210 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Hawkins, R.D. et al. Distinct epigenomic landscapes of pluripotent and lineage-committed human cells. Cell Stem Cell 6, 479–491 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).

    Article  CAS  PubMed  Google Scholar 

  52. Pruitt, K.D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).

    Article  CAS  PubMed  Google Scholar 

  53. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Koehler, R., Issac, H., Cloonan, N. & Grimmond, S.M. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics 27, 272–274 (2011).

    Article  CAS  PubMed  Google Scholar 

  55. Bailey, T.L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Harbison, C.T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Caruana, R., Karampatziakis, N. & Yessenalina, A. in Proc. 25th Int. Conf. Mach. Learn. 96–103 (ACM, 2008).

  58. Hall, M. et al. The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009).

    Article  Google Scholar 

  59. Gupta, S., Stamatoyannopoulos, J.A., Bailey, T.L. & Noble, W.S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Tanaka, E., Bailey, T., Grant, C.E., Noble, W.S. & Keich, U. Improved similarity scores for comparing motifs. Bioinformatics 27, 1603–1609 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).

    Article  CAS  PubMed  Google Scholar 

  62. Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).

    Article  CAS  PubMed  Google Scholar 

  63. Robasky, K. & Bulyk, M.L. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 39, D124–D128 (2011).

    Article  CAS  PubMed  Google Scholar 

  64. Xie, Z., Hu, S., Blackshaw, S., Zhu, H. & Qian, J. hPDI: a database of experimental human protein-DNA interactions. Bioinformatics 26, 287–289 (2010).

    Article  CAS  PubMed  Google Scholar 

  65. Ward, J.H. Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the US National Institutes of Health (U01 ES017166 to W.W., principal investigator, B. Ren). The authors wish to thank B. Ren, D.R. Westhead and M.H. Sherman for discussion of this work. We are grateful to M. Snyder for providing the SNP data of the 19 individuals.

Author information

Authors and Affiliations

Authors

Contributions

J.W.W. and W.W. conceived of and designed the project, J.W.W. performed all the analyses, Z.C. contributed to data analysis, W.W. analyzed the data, and J.W.W. and W.W. wrote the manuscript.

Corresponding author

Correspondence to Wei Wang.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Cross-validation procedures used to assess the prediction of epigenome from DNA motifs.

A schematic describes the different cross-validation procedures that were used to validate our predictions. In the top left the standard five-fold cross-validation procedure is shown. Beneath we show how shuffling of test sequences is carried out during cross-validation. This is done to ensure all the prediction performance is coming from the motifs and not simple sequence biases, such as GC-content. In the top right we show how sequences that are removed during SSB are still tested by training the model on the entire set of sequences that were selected during SSB.

Supplementary Figure 2 The effect of SSB on prediction performance.

The left ROC curve shows the prediction performance when the SSB step is carried out. The right ROC curve shows the performance on the same dataset but with the omission of the SSB step.

Supplementary Figure 3 The levels of histone modification ChIP-seq reads inside peaks and background regions in H1.

The bar plot shows the AUC from the ‘single mark analysis’ in H1. The violin plots show the normalized read counts (ChIPRPKM – inputRPKM) of ChIP-seq reads for each of the modifications. The levels are shown both inside and outside the modification peaks. The read counts were adjusted by region size and total reads. Then the corresponding input score was subtracted. The violin plots with dotted borderlines show the enrichment within modification peaks while the violin plots with solid borderlines show enrichment outside modifications peaks.

Supplementary Figure 4 Controls to confirm DNA motifs are predictive of histone modification.

(a) As Fig. 2a–b but showing the results of the ‘mark-specific analysis’. In the left hand schematic different colored stars represents different histone modifications. (b) As Fig. 2a–c but showing the results of the ‘typical background analysis’. (c) As Fig. 2a–b but showing the results of the ‘cell-type-specific analysis’. In the left hand schematic pink represents the epigenome of a different cell-type. The middle plot shows the H1 and MSC comparison. (d) Shows the average AUC for each histone modification in each of the four analyze.

Supplementary Figure 5 Comparison of cell type–specific analyses.

The average cell-type-specific prediction performance for each mark is shown. On the left are the results from comparing H1 to each of the four derived cell-types. On the right in the performance when H1 is compared to IMR90 and seven ENCODE cell-types (A549, CD14+, GM12878, HeLa, HepG2, HUVEC and K562).

Supplementary Figure 6 Combined clustering of cell type– and modification–specific interplay between DNA motifs and the epigenome.

The heat map shows the clustering of 589 motif groups. The dendrogram shows the motif groups clustered by their interplay with epigenomic modification and cell-type. Both motifs that are enriched and depleted from modification peaks are shown. On the side of the heat map furthest from the dendrogram, a histogram shows the number of motifs in each cluster. Furthest from the dendrogram are two plots that show the locations of: (i) motif groups that contain both positive and negative interplay for the same modification (ii) the locations of motif groups that correspond to certain groups proteins that share DNA-binding motifs.

Supplementary Figure 7 A modification-specific summary of interplay with DNA motifs.

Schematics represent the types of interplay that were identified for each of the histone modifications. Pie charts represent the proportion of a modifications predictive motifs that are unique to that modification or that overlap with other modifications.

Supplementary Figure 8 Overview of the motif location preference profiles.

A motifs profile is constructed using the complete (pre-SSB) set of sequences that were identified for that particular cell-type, modification and analysis type.

Supplementary Figure 9 Sequence-set balancing (SSB).

The figure illustrates the SSB process. (a) The sequences from each set are separately binned by region length and GC-content. In the figure only a subset of the bins are shown: region lengths from 500-700bps and GC-content from 45-47%. (b) Bins with uneven numbers are highlighted in red. (c) Sequences are randomly removed from bins that possess more sequences than their corresponding bin in the other set.

Supplementary Figure 10 Overview of Epigram.

(a) A general overview of the Epigram workflow is shown. (b) On the left 9-mers that differ by from the seed (show at the top 9-mer) by one or two positions are aligned. Differing positions are highlighted in red. Adjacently to the right of the 9-mers are their weight scores (W). On the right of the arrow a PWM is produced from the alignment. When making the PWM the 9-mers are weighted by their W scores. (c) The alignment shown in (b) is expanded by one position.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–10 and Supplementary Note (PDF 4459 kb)

Supplementary Data 1

Comparison of YY1 ChIP-seq peaks to the ChIP-seq peaks of other factors (XLSX 14 kb)

Supplementary Data 2

A list of datasets that were used in this study (XLSX 16 kb)

Supplementary Data 3

The motif with the greatest information content from each of the 589 groups (meme format) (TXT 170 kb)

Supplementary Data 4

Annotation of the de novo motif groups with known motifs (XLSX 75 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Whitaker, J., Chen, Z. & Wang, W. Predicting the human epigenome from DNA motifs. Nat Methods 12, 265–272 (2015). https://doi.org/10.1038/nmeth.3065

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3065

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research