Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct;112(2):583-596.
doi: 10.1111/tpj.15957. Epub 2022 Oct 2.

Simple and accurate transcriptional start site identification using Smar2C2 and examination of conserved promoter features

Affiliations

Simple and accurate transcriptional start site identification using Smar2C2 and examination of conserved promoter features

Andrew Murray et al. Plant J. 2022 Oct.

Abstract

The precise and accurate identification and quantification of transcriptional start sites (TSSs) is key to understanding the control of transcription. The core promoter consists of the TSS and proximal non-coding sequences, which are critical in transcriptional regulation. Therefore, the accurate identification of TSSs is important for understanding the molecular regulation of transcription. Existing protocols for TSS identification are challenging and expensive, leaving high-quality data available for a small subset of organisms. This sparsity of data impairs study of TSS usage across tissues or in an evolutionary context. To address these shortcomings, we developed Smart-Seq2 Rolling Circle to Concatemeric Consensus (Smar2C2), which identifies and quantifies TSSs and transcription termination sites. Smar2C2 incorporates unique molecular identifiers that allowed for the identification of as many as 70 million sites, with no known upper limit. We have also generated TSS data sets from as little as 40 pg of total RNA, which was the smallest input tested. In this study, we used Smar2C2 to identify TSSs in Glycine max (soybean), Oryza sativa (rice), Sorghum bicolor (sorghum), Triticum aestivum (wheat) and Zea mays (maize) across multiple tissues. This wide panel of plant TSSs facilitated the identification of evolutionarily conserved features, such as novel patterns in the dinucleotides that compose the initiator element (Inr), that correlated with promoter expression levels across all species examined. We also discovered sequence variations in known promoter motifs that are positioned reliably close to the TSS, such as differences in the TATA box and in the Inr that may prove significant to our understanding and control of transcription initiation. Smar2C2 allows for the easy study of these critical sequences, providing a tool to facilitate discovery.

Keywords: cis-regulatory elements; promoter; rolling circle amplification; technical advance; template switching reverse transcriptase; transcription start site.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflicts of interest associated with this work.

Figures

Figure 1
Figure 1
Design overview of Smar2C2. (a) cDNA is generated with a template‐switching reverse transcriptase using extracted RNA (light green) and a poly‐dT primer (light blue) with an adaptor (orange) (b). (c) A template‐switching oligo containing an adaptor (blue) and unique molecular identifier (UMI) (purple) is bound to the deposited cytosines and used to add the second adaptor and UMI to the cDNA (d). The final construct (e) is circularized using a linker (dark green) (f) and amplified using rolling circle amplification (g). Rolling circle amplification generates a linear strand of repeating segments (h), and Tn5 is used to generate a final library for sequencing (i). This places the transcription start site (TSS), identifying adaptor and UMI in variable locations within the read (j), allowing for them to be sequenced and extracted bioinformatically.
Figure 2
Figure 2
Smar2C2 overlap with CAGE. (a) Browser tracks showing the overlap between Smar2C2 and existing CAGE data in Zea mays (maize) at a single base‐pair resolution relative to existing annotations. The box indicates a magnified image of the browser track to highlight single base‐pair resolution. (b) A heat map of CAGE reads and Smar2C2 reads centered on the genic transcription start site (TSS) identified by Smar2C2 shows that when CAGE reads are present at a Smar2C2 TSS they show a high degree of precise and concentrated overlap. (c) A comparison of the location of TSS reads in CAGE and Smar2C2 using the same processing pipeline.
Figure 3
Figure 3
Validation of Smar2C2 TSS using epigenomic data. (a) The expected orientation of epigenomic data relative to the transcription start site (TSS) in plants with accessible chromatin identified via ATAC‐seq (gray) present upstream in the promoter, histone modifications of transcription initiation H3K56ac (pink) and H3K4me3 (purple) directly downstream, and histone modifications of transcription elongation H3K36me3 (red) and H3K4me1 (burgundy) further downstream in the gene body. (b) Heat maps of epigenomic and CAGE data centered on the primary TSS identified with Smar2C2 show that these histone modification patterns are consistent across the entire genome, with genes ranked by the volume of ChIP data.
Figure 4
Figure 4
Smar2C2 transcription start site (TSS) compared against ChiP‐corrected annotation. Previous work has established that genome annotations can be improved by using histone modification ChIP‐seq data to predict the location of TSSs within the genome that might differ significantly from the existing annotation. Smar2C2 TSSs can be used to corroborate the corrected annotations. An example browser shot of the existing annotation, the corrected annotation predicted via ChIP‐seq data and the relevant ChIP‐seq data tracks is shown here. The triangles represent the single base‐pair TSS, as determined by Smar2C2, whereas the red arrows indicate the direction of transcription.
Figure 5
Figure 5
TATA‐box motifs identified by Smar2C2. Sequence logos show sequence patterns surrounding the transcription start site (TSS), including possible initiator element (Inr) and TATA‐box sequences (a). The TATA‐box motifs discovered close to the TSS display precise positional enrichment relative to the TSS (c), and are found in the classic region from positions −28 to −35 of the promoter (c, d). TATA‐box motifs discovered in the highest expression decile show some sequence deviation from the classic TATA‐box motif, with Oryza sativa (rice), Sorghum bicolor (sorghum) and Triticum aestivum (wheat) displaying enrichment for a C preceding the more classic TA‐rich motif (b).
Figure 6
Figure 6
Nucleotide trends directly flanking the transcription start site (TSS). (a) A nucleotide heat map in Zea mays (maize) centered on the 10 nucleotides flanking the TSS shows a clear sequence bias, with C/T being more present at the +1 nucleotide, directly upstream of the TSS, and A/G being more present at the −1 nucleotide, directly downstream of the TSS. This general pattern is more prevalent at higher expression deciles, becoming less apparent as the expression levels decrease. (b) These sequence patterns can also be examined as dinucleotide ratios flanking the TSS. Although the pattern of C/T upstream and A/G downstream is consistent, there is clearly a significant bias towards CA, and then CG followed closely by TG. These patterns are most apparent at the highest expression deciles and become less pronounced as expression decreases.

Similar articles

Cited by

References

    1. Adiconis, X. , Haber, A.L. , Simmons, S.K. , Levy Moonshine, A. , Ji, Z. , Busby, M.A. et al. (2018) Comprehensive comparative analysis of 5′‐end RNA‐sequencing methods. Nature Methods, 15(7), 505–511. 10.1038/s41592-018-0014-2 - DOI - PMC - PubMed
    1. Andersson, R. & Sandelin, A. (2020) Determinants of enhancer and promoter activities of regulatory elements. Nature Reviews Genetics, 21(2), 71–87. 10.1038/s41576-019-0173-8 - DOI - PubMed
    1. Assi, H.A. , Garavís, M. , González, C. & Damha, M.J. (2018) I‐motif DNA: structural features and significance to cell biology. Nucleic Acids Research, 46(16), 8038–8056. 10.1093/nar/gky735 - DOI - PMC - PubMed
    1. Bailey, T.L. , Johnson, J. , Grant, C.E. & Noble, W.S. (2015) The MEME suite. Nucleic Acids Research, 43(W1), W39–W49. 10.1093/nar/gkv416 - DOI - PMC - PubMed
    1. Bansal, M. , Kumar, A. & Yella, V.R. (2014) Role of DNA sequence based structural features of promoters in transcription initiation and gene expression. Current Opinion in Structural Biology, 25, 77–85. 10.1016/j.sbi.2014.01.007 - DOI - PubMed

Publication types