Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;41(2):827-41.
doi: 10.1093/nar/gks1284. Epub 2012 Dec 5.

Integrative annotation of chromatin elements from ENCODE data

Affiliations

Integrative annotation of chromatin elements from ENCODE data

Michael M Hoffman et al. Nucleic Acids Res. 2013 Jan.

Abstract

The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Enrichment of various segment labels (vertically, labeled by green panels) from (A) Segway and (B) ChromHMM K562 segmentations over positions on an idealized p300 binding site, gene, CTCF binding site, and LaminB1 binding site. We calculated enrichment as the base-2 logarithm of the observed frequency of a label at a particular position along an annotation divided by the expected frequency of the label from its prevalence in the genome overall. Enriched positions are shown in red, and depleted positions are shown in blue. The labels for idealized gene components at the top include the mean length of that component in parentheses. (C) Heat map of parameters from Segway training for 14 GM12878 signal tracks against 25 segment labels. Color indicates the mean of a Gaussian according to the color bar on the right. (D) Heat map of parameters from ChromHMM concatenated training on 84 signal tracks from 6 ENCODE Tier 1–2 cell types. Color indicates the probability of a present mark, as a percentage, according to the color bar on the right.
Figure 2.
Figure 2.
View of the ENC1 locus on the minus strand using the ENCODE GM12878 segmentations. The unusual state pattern in middle of the gene in all three segmentations reveals a potential intronic regulatory element, which is confirmed by H3K4me1, H3K27ac, DNaseI hypersensitivity and transcription factor binding, and overlaps a putative GENCODE processed transcript.
Figure 3.
Figure 3.
(A) Enrichment or depletion of GWAS SNPs (and several comparison SNP sets) in function-associated segments. The bars extend to the level of enrichment or depletion of each SNP set in the 25 segmentation classes from Segway (top) and ChromHMM (bottom) in GM12878. The results for 1000 random samplings of the SNPs matched to the phenotype-associated SNPs are displayed as a box plot, with the box extending from the 25th to the 75th percentiles, the whiskers extending to 1.5 times the interquartile range, and any outliers beyond shown as circles. If the enrichment for the phenotype associated, GWAS lead SNPs exceeded the 95th percentile of the results from the matched SNPs, then the bar is colored red (orange if otherwise). (B) An example of Crohn’s disease SNPs in non-coding sequences that could serve to regulate expression of NOD2. The figures show gene models from the GENCODE group (version 12), locations of SNPs associated with Crohn’s disease by GWAS, results of ChromHMM and Segway segmentations, selected histone modifications measured in GM12878 and HUVEC cells, locations of DNase hypersensitive sites in several cell types, and sites of occupancy by selected transcription factors. Regions discussed in the text are outlined by blue rectangles.
Figure 4.
Figure 4.
Distribution of various classes of transcripts in the segmentations. Enrichment (red) or depletion (blue) of RNA-seq transcript categories (‘biotypes’) in each state for two 25-state segmentations: (A) Segway GM12878 and (B) ChromHMM GM12878. White cells indicate an absence of an RNA biotype in the corresponding state. Distribution of expression levels in segmentation states. The level of expression of each protein-coding RNA-seq contig intersecting a protein-coding gene in each state for (C) Segway GM12878 and (D) ChromHMM GM12878 was extracted from the data in Djebali et al. (29). The distribution of those values for all RNA contigs in the DNA segments for each state is shown as a box plot.
Figure 5.
Figure 5.
(A) Average log2 enrichment or depletion of four different conserved element sets—PhastCons (33), SiPhy-ω, SiPhy-π (16,35), and GERP (34)—for the 25 ChromHMM states averaged across all 6 cell types. (B) The same comparison for Segway states, but restricted to the K562 segmentation.

Similar articles

Cited by

References

    1. ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE) PLoS Biol. 2011;9:e1001046. - PMC - PubMed
    1. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. - PMC - PubMed
    1. Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19:2133–2143. - PMC - PubMed
    1. Abeel T, Van de Peer Y, Saeys Y. Toward a gold standard for promoter prediction evaluation. Bioinformatics. 2009;25:i313–i320. - PMC - PubMed
    1. Yip KY, Cheng C, Bhardwaj N, Brown JB, Leng J, Kundaje A, Rozowsky J, Birney E, Bickel P, Snyder M, et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012;13:R48. - PMC - PubMed

Publication types