Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan;31(1):88-100.
doi: 10.1101/gr.265173.120. Epub 2020 Dec 21.

Transcript assembly improves expression quantification of transposable elements in single-cell RNA-seq data

Affiliations

Transcript assembly improves expression quantification of transposable elements in single-cell RNA-seq data

Wanqing Shao et al. Genome Res. 2021 Jan.

Abstract

Transposable elements (TEs) are an integral part of the host transcriptome. TE-containing noncoding RNAs (ncRNAs) show considerable tissue specificity and play important roles during development, including stem cell maintenance and cell differentiation. Recent advances in single-cell RNA-seq (scRNA-seq) revolutionized cell type-specific gene expression analysis. However, effective scRNA-seq quantification tools tailored for TEs are lacking, limiting our ability to dissect TE expression dynamics at single-cell resolution. To address this issue, we established a TE expression quantification pipeline that is compatible with scRNA-seq data generated across multiple technology platforms. We constructed TE-containing ncRNA references using bulk RNA-seq data and showed that quantifying TE expression at the transcript level effectively reduces noise. As proof of principle, we applied this strategy to mouse embryonic stem cells and successfully captured the expression profile of endogenous retroviruses in single cells. We further expanded our analysis to scRNA-seq data from early stages of mouse embryogenesis. Our results illustrated the dynamic TE expression at preimplantation stages and revealed 146 TE-containing ncRNA transcripts with substantial tissue specificity during gastrulation and early organogenesis.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Counting scRNA-seq signal at individual TEs results in large numbers of false positive candidates. (A) Distribution of mappable reads in 16 bulk RNA-seq and 36 scRNA-seq data sets. Compared to bulk RNA-seq, scRNA-seq data have a higher percentage of reads mapped to TEs. Samples were arranged by studies. Data sets used in this figure are summarized in Supplemental Table S1. (PC) Protein-coding exons defined by RefSeq; (TE) transposable elements that do not overlap with protein-coding exons; (Other) other genomic locations; (mESC) mouse embryonic stem cell; (PBMC) human peripheral blood mononuclear cell; (GM12878 and GM12891) human lymphoblastoid cell lines. (B) Number of expressed (counts per million, CPM ≥ 1) protein-coding genes and TEs in mESC bulk RNA-seq and Smart-seq samples. On average, 12,000 protein-coding genes and 6000 TEs were detected in each bulk RNA-seq sample. In contrast, scRNA-seq captured 7000 protein-coding genes and 20,000 TEs per cell. (C) Number of candidates as a function of cell number cutoff. (Cell number cutoff) Minimum number of cells each candidate is expressed in; (expression cutoff) CPM ≥ 1. A cell number cutoff of 10 requires a candidate to have at least 1 CPM in at least 10 cells. Although the majority of protein-coding gene candidates were consistently detected in mESC Smart-seq data, a large number of TE candidates were detected in fewer than 10 cells. (D) Correlation between bulk RNA-seq and averaged scRNA-seq signal at protein-coding genes and TEs (Teichmann laboratory, mESC). Low correlation between bulk RNA-seq and averaged Smart-seq signal was observed at TEs regardless of expression cutoff. (Cell cutoff) Minimum number of cells each candidate is expressed in; (CPM cutoff) minimum CPM value for one candidate to be considered as expressed. Color scale represents the number of candidates. (E) TE-family enrichment analysis using TE candidates identified from mESC bulk RNA-seq and Smart-seq. Enrichment of ERV elements was observed with bulk RNA-seq data, but not in single cells. Smart-seq data of four single cells with different percentage of TE reads and merged Smart-seq data from 10 cells were included.
Figure 2.
Figure 2.
Transcript assembly improves scRNA-seq TE expression analysis. Data sets used in this figure are summarized in Supplemental Table S1. (A) Flowchart of scRNA-seq TE quantification pipeline. In short, transcript assembly was performed with bulk RNA-seq data, and transcripts that overlap with TEs but not protein-coding exons were used for expression quantification in scRNA-seq data. (B) Transcript assembly using three mESC bulk RNA-seq data (Wang laboratory) yielded 692 TE transcripts. Among these TE transcripts, 179 overlap with ncRNAs annotated by RefSeq. (C) FANTOM5 CAGE peaks, ATAC-seq signals, and CpG methylation signals at the promoter region of TE transcripts with RPKM ≥ 1 (reads per kilobase million). (D) Correlation between mESC bulk RNA-seq and averaged Smart-seq (Teichmann laboratory) signals at TE transcripts. Color scale represents the number of candidates. (E) TE-family enrichment analysis using expressed TE transcripts. Enrichment of ERV elements was observed with both bulk RNA-seq and Smart-seq samples. (F) Examples of TE transcript. Assembled TE transcripts, uniquely mapped reads of mESC bulk RNA-seq, Smart-seq, merged Smart-seq, ATAC-seq, and CpG methylation were included. (Left) A TE transcript that initiates from RLTR16b_MM. This TE transcript overlaps Platr14, a long ncRNA known to impact the mESC differentiation-associated genes. (Right) A TE transcript that initiates from RLTRETN_Mm. This transcript is largely composed of TEs and reflect the transcription unit of ERV.
Figure 3.
Figure 3.
Dynamic TE expression in mouse preimplantation embryos. (A) Using 37 bulk RNA-seq samples, 5299 TE transcripts were constructed. Of these, 770 TE transcripts overlap with ncRNAs annotated by RefSeq. (B) More than half of all the assembled TE transcripts either initiate from TEs or have >50% of their exons composed of TEs. (C, upper) UMAP of scRNA-seq data from mouse zygote to E6.5 embryos. Cells were colored based on developmental stages. (Lower) Expression of cell type–specific markers. (D) Examples of developmental stage– and tissue-specific TE transcripts. (E) TE transcripts were grouped into six clusters based on their expression pattern across preimplantation stages. (F) TE subfamily enrichment analysis using TE transcripts within each of the six clusters.
Figure 4.
Figure 4.
Tissue-specific TE expression during mouse gastrulation and early organogenesis. (A, upper left) Fewer unique molecular identifiers (UMIs) were mapped to TE transcripts than to protein-coding genes. (Upper right) The averaged expression level of TE transcripts across all the cells was lower compared to protein-coding genes. (Lower left) TE transcripts lack the extreme standardized variance observed at protein-coding genes. (Lower right) TE transcripts account for 73 of the top 1000 variable features. (B) UMAP of scRNA-seq data. Cells were colored based on tissue information provided by the original study. (C) Examples of tissue-specific TE transcripts. (D) Normalized expression pattern (center, heatmap) of 146 TE transcripts (columns) across 37 tissues (rows). Transcript length, annotation status (top, bar plot), and TE composition (bottom, bar plot) were shown for each TE transcript. (E) Genome browser view of two TE transcripts with strong tissue enrichment. Assembled TE transcripts, uniquely mapped reads of merged bulk RNA-seq (from 37 samples that were used for transcript assembly), and scRNA-seq signal for selected tissues were shown. (Left) A TE transcript that is initiated from an L2a element, the second exon of this transcript is composed of non-TE sequences. (Right) A TE transcript that is almost exclusively composed of ERV sequences.

Similar articles

Cited by

References

    1. Anwar SL, Wulaningsih W, Lehmann U. 2017. Transposable elements in human cancer: causes and consequences of deregulation. Int J Mol Sci 18: 974 10.3390/ijms18050974 - DOI - PMC - PubMed
    1. Attig J, Young GR, Hosie L, Perkins D, Encheva-Yokoya V, Stoye JP, Snijders AP, Ternette N, Kassiotis G. 2019. LTR retroelement expansion of the human cancer transcriptome and immunopeptidome revealed by de novo transcript assembly. Genome Res 29: 1578–1590. 10.1101/gr.248922.119 - DOI - PMC - PubMed
    1. Baas D, Meiniel A, Benadiba C, Bonnafe E, Meiniel O, Reith W, Durand B. 2006. A deficiency in RFX3 causes hydrocephalus associated with abnormal differentiation of ependymal cells. Eur J Neurosci 24: 1020–1030. 10.1111/j.1460-9568.2006.05002.x - DOI - PubMed
    1. Benadiba C, Magnani D, Niquille M, Morlé L, Valloton D, Nawabi H, Ait-Lounis A, Otsmane B, Reith W, Theil T, et al. 2012. The ciliogenic transcription factor RFX3 regulates early midline distribution of guidepost neurons required for corpus callosum development. PLoS Genet 8: e1002606 10.1371/journal.pgen.1002606 - DOI - PMC - PubMed
    1. Bendall ML, de Mulder M, Iñiguez LP, Lecanda-Sánchez A, Pérez-Losada M, Ostrowski MA, Jones RB, Mulder LCF, Reyes-Terán G, Crandall KA, et al. 2019. Telescope: characterization of the retrotranscriptome by accurate estimation of transposable element expression. PLoS Comput Biol 15: e1006453 10.1371/journal.pcbi.1006453 - DOI - PMC - PubMed

Publication types

Substances