Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 13;51(18):9764-9784.
doi: 10.1093/nar/gkad671.

ChimeraTE: a pipeline to detect chimeric transcripts derived from genes and transposable elements

Affiliations

ChimeraTE: a pipeline to detect chimeric transcripts derived from genes and transposable elements

Daniel S Oliveira et al. Nucleic Acids Res. .

Abstract

Transposable elements (TEs) produce structural variants and are considered an important source of genetic diversity. Notably, TE-gene fusion transcripts, i.e. chimeric transcripts, have been associated with adaptation in several species. However, the identification of these chimeras remains hindered due to the lack of detection tools at a transcriptome-wide scale, and to the reliance on a reference genome, even though different individuals/cells/strains have different TE insertions. Therefore, we developed ChimeraTE, a pipeline that uses paired-end RNA-seq reads to identify chimeric transcripts through two different modes. Mode 1 is the reference-guided approach that employs canonical genome alignment, and Mode 2 identifies chimeras derived from fixed or insertionally polymorphic TEs without any reference genome. We have validated both modes using RNA-seq data from four Drosophila melanogaster wild-type strains. We found ∼1.12% of all genes generating chimeric transcripts, most of them from TE-exonized sequences. Approximately ∼23% of all detected chimeras were absent from the reference genome, indicating that TEs belonging to chimeric transcripts may be recent, polymorphic insertions. ChimeraTE is the first pipeline able to automatically uncover chimeric transcripts without a reference genome, consisting of two running Modes that can be used as a tool to investigate the contribution of TEs to transcriptome plasticity.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
ChimeraTE Mode 1 (genome-guided) workflow. Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified. (A) Input data: fasta file with the genome assembly, gtf files with gene and TE annotations, as well as stranded paired-end reads from RNA-seq (fastq). (B) Alignment: RNA-seq alignment to the genome is used to calculate gene expression levels. Genes with FPKM < 1 are removed from downstream analyses. A subsequent list of reads that have aligned against genes or TE insertions is created. (C) Chimeric read detection & filtering: both read lists are then compared and read pairs that have common reads between the two lists are named chimeric reads, i.e. paired-end reads mapping to a gene and a TE copy. The average of these reads between replicates is used as chimeric read coverage for each putative chimeric transcript. All putative chimeras are then processed with three ChimeraTE scripts to categorize them into TE-initiated, TE-exonized, and TE-terminated transcripts. These steps are performed for all RNA-seq replicates. Finally, all chimeric transcripts present in at least 2 replicates and with at least 2 chimeric reads on average between replicates are maintained. (D) Chimeric transcripts: Three predictions obtained from Mode 1. Blue boxes: exons; red boxes: TEs; arrowhead in between TE and exon boxes: transcription sense; blue and red boxes linked by a line: chimeric reads. The ChimeraTE mode 1 output is divided into three predictions: (i) TE-initiated transcript: the TE insertion is located upstream of the gene region; (ii) TE-exonized transcript: the TE insertion is present within exons (embedded), overlapping exons (overlapped), or introns (intronic); (iii) TE-terminated transcript: the TE insertion is located downstream of the gene region.
Figure 2.
Figure 2.
ChimeraTE Mode 2 (genome-blind)workflow. Round white boxes: input data; square boxes: pipeline step; round gray boxes: thresholds that can be modified. (A) Input data: two fasta files containing reference transcripts and TE insertions, as well as stranded paired-end reads from RNA-seq (fastq). (B) Alignment and chimeric reads: The alignment against transcripts is performed and their expression is calculated. Transcripts with FPKM < 1 are removed from the downstream analysis. Next, a list of reads aligned against transcripts is created. Through the alignment of reads against TE insertions, a second list with reads stemming from TEs is also created. Then, mapped paired-end reads and singletons are identified, generating the list of chimeric reads, for all replicates. All chimeric transcripts that have an average of chimeric reads >= 2 and are present in >= 2 replicates are maintained as true chimeras. (C) Transcriptome assembly and chimeric reads: The de novo transcriptome assembly is a non-default option of ChimeraTE Mode 2. It performs a transcriptome assembly and aligns reads against the assembled transcripts. Then, TE insertions in the assembled transcripts are identified with RepeatMasker and the TE reads are recovered. Using the two lists of reads (transcripts and TEs), the chimeric read list is generated and the putative assembled chimeric transcripts are predicted. Next, a blastn is performed between these transcripts and the reference transcripts provided in the input. All transcripts with length >= 80% are selected. The process is repeated for all RNA-seq replicates and chimeric transcripts assembled >= 2 replicates are maintained as true chimeras. (D) Chimeric transcripts: if the assembly is activated, ChimeraTE mode 2 provides three outputs: (1) Chimeric reads: predicted only based on the method depicted in B; (2) Assembled transcripts: predicted only based on the transcriptome assembly method depicted in C; and (3) Double evidence: predicted by both methods -B and C-.
Figure 3.
Figure 3.
General results from Chimera Mode 1. (A) Five examples of chimeric transcripts manually curated with the IGV genome browser. Red boxes: TE insertion; blue boxes: exons and UTRs; black density graphs: coverage of RNA-seq reads; head arrows: transcription sense, blue and red boxes linked by a line: chimeric reads. (B) Total genes generating chimeric transcripts, following the TE position classification in the four wild-type strains.
Figure 4.
Figure 4.
(A) Total number of chimeric transcripts found by ChimeraTE mode 2. “Assembled transcripts”: chimeric transcripts detected only by the method of transcriptome assembly (Figure 2C). “Chimeric reads”: chimeric transcripts detected only by the method of chimeric reads (Figure 2B). “Double evidence”: chimeric transcripts detected by both methods. (B) Correlation between chimeric read coverage and gene expression of chimeric transcripts found by Mode 2. In all strains, we did not observe correlation between gene expression (FPKM) and the coverage of chimeric reads between the three categories of evidence. All double evidence chimeras were found with high coverage of chimeric reads, suggesting high reliability of chimeras when both methods (chimeric reads and transcriptome) are considered together.
Figure 5.
Figure 5.
Mps1 gene and its downstream region in the genomes of the four wild-type strains. (A)Mps1 in the dm6 reference genome and the alt gene located downstream to it, on opposite strands and with overlapped 3’ UTRs. (B) In the dmsj23 and dmsj7, Mps1 and alt are distributed as found in the reference genome. (C) In dmgoth101 and dmgoth63, there is a FB insertion located downstream to Mps1, which has chimeric reads supporting a TE-terminated downstream in both strains.
Figure 6.
Figure 6.
The frequency of the 76 TE families generating chimeric transcripts in the four wild-type strains. In chimeric transcripts derived from TEs near genes, INE-1 was the most frequent (15%) in TE-terminated downstream, whereas for TE-initiated upstream, S2, INE-1, hobo, 1360 and 412 had the same frequency (11%). Regarding TEs inside genes, the roo element has the highest frequency of TE-exonized embedded, representing 50% of all chimeras. In TE-exonized intronic and overlapped, INE-1 was the most frequent TE family.
Figure 7.
Figure 7.
(A) Completeness of protein domains identified in TE insertions generating TE-exonized transcripts, with p-value < 0.05. (B) Alignment depth with roo consensus and embedded roo insertions generating TE-exonized transcripts. At the top, the scheme of the full-length roo element: brown boxes: LTRs; yellow box: first tandem repeats at 5' UTR; blue box: second tandem repeat at 5' UTR; red box: Open reading frame (ORF). The coverage depth of the multiple alignments between embedded roo insertions and the consensus is separated by strain.
Figure 8.
Figure 8.
The 76 chimeric transcripts derived from TE insertionsthat are absent in the dm6 reference genome, 5.19% of them correspond to TEs located upstream, 74.03% to TEs located inside genes (introns and exons), and 20.78% to TEs located downstream. (A) TE upstream: Chimeric transcripts in which the TE is located up to 3kb upstream of the gene. (B) TE inside: Chimeric transcripts with TE insertions located inside the gene region (exons and introns). There are four chimeric transcripts found in all strains, and one specific to French strains. (C) TE downstream: Chimeric transcripts in which the TE is located up to 3kb downstream of the gene. Only Mps1-FB was specific to French strains, whereas all the other 15 chimeras are strain-specific.
Figure 9.
Figure 9.
Normalized counts from DEseq2 indicating gene expression of genes generating polymorphic chimeric transcripts among the four wild-type strains. Colorful forms represent differentially expressed genes (adj. p-value < 0.05); transparent forms represent non-differentially expressed genes (adj. p-value). (A) Expression level of genes producing chimeric transcripts only in dmgoth101, in comparison to dmgoth63, dmsj23, and dmsj7. (B) Expression level of genes producing chimeric transcripts only in dmgoth63, in comparison to dmgoth101, dmsj23, and dmsj7. (C) Expression level of genes producing chimeric transcripts only in dmsj23, in comparison to dmgoth101, dmgoth63, and dmsj7. (D) Expression level of genes producing chimeric transcripts only in dmsj7, in comparison to dmgoth101, dmgoth63, and dmsj23. Overall, gene expression does not change when comparing strains with and without chimeric transcripts.

Similar articles

Cited by

References

    1. Quesneville H., Bergman C.M., Andrieu O., Autard D., Nouaud D., Ashburner M., Anxolabehere D.. Combined evidence annotation of transposable elements in genome sequences. PLoS Comput. Biol. 2005; 1:e22. - PMC - PubMed
    1. International Human Genome Sequencing Consortium, Whitehead Institute for Biomedical Research, Center for Genome Research Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K.et al. .. Initial sequencing and analysis of the human genome. Nature. 2001; 409:860–921. - PubMed
    1. Schnable P.S., Ware D., Fulton R.S., Stein J.C., Wei F., Pasternak S., Liang C., Zhang J., Fulton L., Graves T.A.et al. .. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009; 326:1112–1115. - PubMed
    1. Sotero-Caio C.G., Platt R.N., Suh A., Ray D.A.. Evolution and diversity of transposable elements in vertebrate genomes. Genome Biol. Evolut. 2017; 9:161–177. - PMC - PubMed
    1. Danilevskaya O.N., Arkhipova I.R., Pardue M.L., Traverse K.L.. Promoting in tandem: the promoter for telomere transposon HeT-A and implications for the evolution of retroviral ltrs. Cell. 1997; 88:647–655. - PubMed

Substances

LinkOut - more resources