Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;8(7):e1002613.
doi: 10.1371/journal.pcbi.1002613. Epub 2012 Jul 26.

Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data

Affiliations

Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data

Haipeng Xing et al. PLoS Comput Biol. 2012.

Abstract

Next-generation sequencing (NGS) technologies have matured considerably since their introduction and a focus has been placed on developing sophisticated analytical tools to deal with the amassing volumes of data. Chromatin immunoprecipitation sequencing (ChIP-seq), a major application of NGS, is a widely adopted technique for examining protein-DNA interactions and is commonly used to investigate epigenetic signatures of diffuse histone marks. These datasets have notoriously high variance and subtle levels of enrichment across large expanses, making them exceedingly difficult to define. Windows-based, heuristic models and finite-state hidden Markov models (HMMs) have been used with some success in analyzing ChIP-seq data but with lingering limitations. To improve the ability to detect broad regions of enrichment, we developed a stochastic Bayesian Change-Point (BCP) method, which addresses some of these unresolved issues. BCP makes use of recent advances in infinite-state HMMs by obtaining explicit formulas for posterior means of read densities. These posterior means can be used to categorize the genome into enriched and unenriched segments, as is customarily done, or examined for more detailed relationships since the underlying subpeaks are preserved rather than simplified into a binary classification. BCP performs a near exhaustive search of all possible change points between different posterior means at high-resolution to minimize the subjectivity of window sizes and is computationally efficient, due to a speed-up algorithm and the explicit formulas it employs. In the absence of a well-established "gold standard" for diffuse histone mark enrichment, we corroborated BCP's island detection accuracy and reproducibility using various forms of empirical evidence. We show that BCP is especially suited for analysis of diffuse histone ChIP-seq data but also effective in analyzing punctate transcription factor ChIP datasets, making it widely applicable for numerous experiment types.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. H3K27me3 and H3K36me3 diffuse histone marks.
ChIP-seq was used to identify regions of enrichment based on read density profiles, visualized here in the UCSC genome browser (http://genome.ucsc.edu/). The enriched islands identified by BCP (blue) and SICER (red) are indicated. Additionally, posterior mean estimates used in BCP island detection are shown along with a line (orange) illustrating how thresholds are used to segment the signal. The correlation between H3K36me3 and gene bodies (outlined in green) and the mutually exclusivity of H3K27me3 and H3K36me3 were evident. The signal fluctuations caused by the highly variable read densities common to ChIP-seq data of diffuse marks is one of the notable difficulties for standard peak-calling algorithms, causing them to fragment the broader regions of enrichment into smaller, discontiguous peaks.
Figure 2
Figure 2. The distance from H3K36me3 island boundaries to nearest gene boundary was used as a measure of accuracy.
H3K36me3 islands have been shown to correspond to actively transcribed gene bodies so we expected the boundaries of island and genes to coincide. The sum of the distances from both upstream and downstream island boundaries to the nearest gene boundaries were used as a per island error and illustrated in the histogram for BCP (left) and SICER (right).
Figure 3
Figure 3. BCP was robust, providing consistent results in replicate and at various coverage depths.
Using a second H3K36me3 data set and sub-samplings of the full replicate one dataset (30–90% randomly selected reads), we evaluated the reproducibility of BCP island calls. A) Enriched regions coinciding with gene coordinates were captured by the large, contiguous BCP islands (blue), while SICER islands (red) were more fractionated. B) We quantified the reproducible fraction of the full data set results versus the sub-samples (the number of full dataset island bases covered by a replicate/sub-sample island divided by total bases in full dataset islands, averaged across all islands) and vice versa. Also, we computed the fraction of island basepairs overlapping genic and intergenic regions (number of islands bases covered by genic/intergenic annotation divided by total bases in island, average across all islands).
Figure 4
Figure 4. BCP dynamically adapted to many different types of data.
To demonstrate its versatility, we compiled a set of several histone modifications and analyzed each under the default parameters for BCP and SICER. Regardless of the histone mark characteristics, whether more punctate as in acetylation marks and H3K4me3 or broad as in H3K27me3, H3K36me3, and H3K9me3, BCP (black) was able to make reasonable island calls that effectively described the underlying read profiles. SICER (grey) seemed more primed to identify smaller, sharper islands so often fragmented more general regions of enrichment.
Figure 5
Figure 5. BCP showed strong performance in punctate transcription factor ChIP-seq data.
Compared to MACS, a representative peak-calling algorithm designed for punctate peaks detection, BCP showed a comparable false-discovery rate (FDR) and rate of motif occurrence in both CTCF and STAT1 datasets. We apply the empirical FDR described in the Methods and by , dividing the negative peaks (detected when the input control sample was set as the test and the ChIP sample was set as the control) by the number of test peaks (the ChIP sample was set as the test and the input control sample was set as the control). Peaks are ranked according to p-value. Additionally, BCP displayed a slightly improved motif occurrence rate (the fraction of peaks containing a match to the TRANSFAC consensus motifs, as determined by STORM, formula image).

Similar articles

Cited by

References

    1. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, et al. Genome-wide location and function of dna binding proteins. Science. 2000;290:2306–2309. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-dna interactions. Science. 2007;316:1497–1502. - PubMed
    1. Robertson, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. Genome-wide profiles of stat1 dna association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. - PubMed
    1. Barski, Cuddapah S, Cui K, Roh TY, Schones DE, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553–560. - PMC - PubMed

Publication types