Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 28;51(4):1608-1624.
doi: 10.1093/nar/gkac1272.

Predictive model of transcriptional elongation control identifies trans regulatory factors from chromatin signatures

Affiliations

Predictive model of transcriptional elongation control identifies trans regulatory factors from chromatin signatures

Toray S Akcan et al. Nucleic Acids Res. .

Abstract

Promoter-proximal Polymerase II (Pol II) pausing is a key rate-limiting step for gene expression. DNA and RNA-binding trans-acting factors regulating the extent of pausing have been identified. However, we lack a quantitative model of how interactions of these factors determine pausing, therefore the relative importance of implicated factors is unknown. Moreover, previously unknown regulators might exist. Here we address this gap with a machine learning model that accurately predicts the extent of promoter-proximal Pol II pausing from large-scale genome and transcriptome binding maps and gene annotation and sequence composition features. We demonstrate high accuracy and generalizability of the model by validation on an independent cell line which reveals the model's cell line agnostic character. Model interpretation in light of prior knowledge about molecular functions of regulatory factors confirms the interconnection of pausing with other RNA processing steps. Harnessing underlying feature contributions, we assess the relative importance of each factor, quantify their predictive effects and systematically identify previously unknown regulators of pausing. We additionally identify 16 previously unknown 7SK ncRNA interacting RNA-binding proteins predictive of pausing. Our work provides a framework to further our understanding of the regulation of the critical early steps in transcriptional elongation.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Predictive modeling of transcriptional elongation control based on large-scale protein binding maps, gene annotation and sequence composition features in multiple cell-lines followed by model interpretation reveals trans regulatory factors.
Figure 1.
Figure 1.
(A) Central question as to which specific factors are implicated in the transitioning of promoter-proximally paused polymerase II into its elongating phase of nascent RNA synthesis. (B) Integration of large-scale genomic data sets to build the chromatin context of transcriptional pausing (A) with protein binding events and gene annotation and sequence composition features for the prediction task of promoter-proximal pausing of the polymerase II. Pausing is quantified by relating GRO-seq read densities at the TSS to GRO-seq read densities in the gene body. (C) Machine learning approach to predict promoter-proximal Pol II pausing with chromatin signatures (B), followed by the integration of prior knowledge and selection of factors as regulators of promoter-proximal Pol II pausing.
Figure 2.
Figure 2.
(A) Observed versus predicted pausing indices (log2 scale) of a 5-fold cross-validated and regularized XGB regression model in the K562 cell line applied to an independent 50% hold-out test dataset from the same cell line taken prior to training. Pearson's correlation coefficient rho (ρ) with the associated p-value is depicted in the upper left. (B) Observed vs. predicted pausing indices of a 5-fold cross-validated and regularized XGB regression model in the K562 cell line applied to the independent test dataset from the cross cell line (HepG2). The model was trained with features common to both cell lines. Pearson's correlation coefficient rho (ρ) with the associated p-value is depicted in the upper left. (C) Venn diagram of transcripts between cell lines. (D) Observed vs. predicted pausing indices of a 5-fold cross-validated and regularized XGB regression model from each cell line applied to data of genes exclusively expressed in the cross cell line. Pearson's correlation coefficient rho (ρ) with the associated P-values are depicted in the upper left. (E) Observed pausing indices from the K562 versus HepG2 cell line. Transcripts with at least a 2-fold higher pausing index in one but not the other cell line are colored either green (HepG2 specific transcripts) or blue (K562 specific transcripts). Transcripts with similar pausing indices (less than a 2-fold change) in both cell lines, thus not specific to any of the cell lines, are colored in orange. Pearson's correlation coefficients (ρ) for each of the groups with associated p-values are depicted in the upper left. (F) Observed pausing index differences between cell lines against differences of predicted pausing indices obtained from models trained in each cell line and applied to data from the cross cell line. Models were trained on features common to both cell lines. Differences are shown for genes which showed a 2-fold change between cell lines as identified in E).
Figure 3.
Figure 3.
(A) Individual feature contributions (SHAP feature contributions, y-axis) on each transcript (x-axis) with a sample zoom-in on a subset of transcripts for better visual investigation. Only the top 5 most influential features are colored and remaining features aggregated in ‘Other’, see legend. Feature ‘ChIP RBFOX2 5′’ refers to the binary indicator variable for a RBFOX2 binding site determined by ChIP-seq being present in the 5′ region of the transcript (see Methods section on feature engineering). The other ChIP-seq data sets are labeled analogously. (B) Aggregate absolute contributions of factor classes based on prior knowledge, further divided by sequence and non-sequence specific binding factors. The process ‘Processing’ refers to mRNA polyadenylation and export from the nucleus. Number of factors are given behind the bars, only factors with non-zero contributions were counted. (C) R2 performances of individual models of factor classes based on prior knowledge on 50% holdout test data set. Number of factors associated with each functional process are given behind the bars, irrespective of their contributions scores, i.e. same factor sets as in (B) which in turn shows only factors with contributions >0. (D) Aggregate absolute contributions of factors based on their binding modes.
Figure 4.
Figure 4.
(A) Increasingly ordered aggregate factor contributions of factors that make up at least 50% of model contributions. Established pausing/elongation factors are colored red. The bar fill colors identify DNA-binding (DBP; dark red), RNA-binding (RBP; orange), or DNA- and RNA-binding (DBP/RBP; grey) factors. (B) A conceptual view on the interconnection and interplay of identified transcriptional pause regulatory proteins with associated transcriptional regulatory processes (Chromatin Remodelling, Transcription Activation/Repression, Transcriptional Pausing, R-Loop resolution and Splicing).

Similar articles

Cited by

References

    1. Lin J., Amir A.. Homeostasis of protein and mRNA concentrations in growing cells. Nat. Commun. 2018; 9:4496. - PMC - PubMed
    1. Sallie R. Transcriptional homeostasis: a mechanism of protein quality control. Med. Hypotheses. 2004; 63:232–234. - PubMed
    1. Mitsis T., Efthimiadou A., Bacopoulou F., Vlachakis D., Chrousos G.P., Eliopoulos E.. Transcription factors and evolution: an integral part of gene expression (Review). World Acad. Sci. J. 2020; 2:3–8.
    1. Schier A.C., Taatjes D.J.. Structure and mechanism of the RNA polymerase II transcription machinery. Genes Dev. 2020; 34:465–488. - PMC - PubMed
    1. Malik S., Molina H., Xue Z.. PIC activation through functional interplay between mediator and TFIIH. J. Mol. Biol. 2017; 429:48–63. - PMC - PubMed

Publication types