TopHat: discovering splice junctions with RNA-Seq

doi:10.1093/bioinformatics/btp120

. 2009 May 1;25(9):1105-11.

doi: 10.1093/bioinformatics/btp120. Epub 2009 Mar 16.

TopHat: discovering splice junctions with RNA-Seq

Cole Trapnell¹, Lior Pachter, Steven L Salzberg

Affiliations

PMID: 19289445
PMCID: PMC2672628
DOI: 10.1093/bioinformatics/btp120

TopHat: discovering splice junctions with RNA-Seq

Cole Trapnell et al. Bioinformatics. 2009.

. 2009 May 1;25(9):1105-11.

doi: 10.1093/bioinformatics/btp120. Epub 2009 Mar 16.

Authors

Cole Trapnell¹, Lior Pachter, Steven L Salzberg

Affiliation

¹ Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA. cole@cs.umd.edu

PMID: 19289445
PMCID: PMC2672628
DOI: 10.1093/bioinformatics/btp120

Abstract

Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.

Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20,000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.

**Fig. 2.**
An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi *et al.*, 2008). The lack of a large coverage gap causes TopHat to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction.

**Fig. 3.**
The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default.

**Fig. 4.**
TopHat sensitivity as RPKM varies. For genes transcribed above 15.0 RPKM, TopHat detects more than 80% reported by ERANGE in the *M. musculus* brain tissue study. TopHat detects more than 72% of all junctions observed by ERANGE, including those in genes expressed at only a single transcript per cell. A *de novo* assembly of the RNA-Seq reads, followed by spliced alignment of the assembled transcripts produces markedly poorer sensitivity, detecting around 40% of junctions in genes transcribed above 25.0 RPKM, but comparatively few junctions in more highly transcribed genes.

**Fig. 5.**
The BLAT E-value distribution of known, previously unreported, and randomly generated splice junction sequences when searched against GenBank mouse ESTs. As expected, known junctions have high-quality BLAT hits to the EST database. Randomly-generated junction sequences do not. High-quality BLAT hits for more than 11% of the junctions identified by TopHat suggest that the UCSC gene models for mouse are incomplete. These junctions are almost certainly genuine, and because the mouse EST database is not complete, 11% is only a lower bound on the specificity of TopHat.

**Fig. 6.**
TopHat detects junctions in genes transcribed at very low levels. The gene *Pnlip* was transcribed at only 7.88 RPKM in the brain tissue according to ERANGE, and yet TopHat reports the complete known gene model.

**Fig. 7.**
A previously unreported splice junction detected by TopHat is shown as the topmost horizontal line. This junction skips two exons in the ADP-ribosylation gene *Arfgef1*. As explained in Section 2, islands of read coverage in the Bowtie mapping are extended by 45 bp on either side.

See this image and copyright information in PMC

Cited by

What lies behind the large genome of Colletotrichum lindemuthianum.
da Silva LL, Correia HLN, Gonçalves OS, Vidigal PMP, Rosa RO, Santana MF, de Queiroz MV. da Silva LL, et al. Front Fungal Biol. 2024 Oct 15;5:1459229. doi: 10.3389/ffunb.2024.1459229. eCollection 2024. Front Fungal Biol. 2024. PMID: 39473581 Free PMC article.
The blue light signaling inhibitor 3-bromo-7-nitroindazole affects gene translation at the initial reception of blue light in young Arabidopsis seedlings.
Kurihara Y, Akagi C, Makita Y, Kawauchi M, Okubo-Kurihara E, Tsuge T, Aoyama T, Matsui M. Kurihara Y, et al. Plant Biotechnol (Tokyo). 2024 Jun 25;41(2):153-157. doi: 10.5511/plantbiotechnology.24.0323a. Plant Biotechnol (Tokyo). 2024. PMID: 39463773 Free PMC article.
Impact of Polydeoxyribonucleotides on the Morphology, Viability, and Osteogenic Differentiation of Gingiva-Derived Stem Cell Spheroids.
Lee H, Hwa S, Cho S, Kim JH, Song HJ, Ko Y, Park JB. Lee H, et al. Medicina (Kaunas). 2024 Oct 1;60(10):1610. doi: 10.3390/medicina60101610. Medicina (Kaunas). 2024. PMID: 39459397 Free PMC article.
AtC3H3, an Arabidopsis Non-TZF Gene, Enhances Salt Tolerance by Increasing the Expression of Both ABA-Dependent and -Independent Stress-Responsive Genes.
Seok HY, Lee SY, Nguyen LV, Bayzid M, Jang Y, Moon YH. Seok HY, et al. Int J Mol Sci. 2024 Oct 11;25(20):10943. doi: 10.3390/ijms252010943. Int J Mol Sci. 2024. PMID: 39456724 Free PMC article.
Inflammation impacts androgen receptor signaling in basal prostate stem cells through interleukin 1 receptor antagonist.
Cooper PO, Yang J, Wang HH, Broman MM, Jayasundara SM, Sahoo SS, Yan B, Awdalkreem GD, Cresswell GM, Wang L, Goossens E, Lanman NA, Doerge RW, Zheng F, Cheng L, Alqahtani S, Crist SA, Braun RE, Kazemian M, Jerde TJ, Ratliff TL. Cooper PO, et al. Commun Biol. 2024 Oct 25;7(1):1390. doi: 10.1038/s42003-024-07071-y. Commun Biol. 2024. PMID: 39455902 Free PMC article.

See all "Cited by" articles

References

1. Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg. 2004;2:53–86.
1. Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993;4:373–380. - PubMed
1. Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: DEC, Digital Systems Research Center; 1994. A block sorting lossless data compression algorithm.
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. - PubMed
1. De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–i180. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

[1] Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg. 2004;2:53–86.

[2] Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg. 2004;2:53–86.

[3] Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993;4:373–380. - PubMed

[4] Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993;4:373–380. - PubMed

[5] Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: DEC, Digital Systems Research Center; 1994. A block sorting lossless data compression algorithm.

[6] Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: DEC, Digital Systems Research Center; 1994. A block sorting lossless data compression algorithm.

[7] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. - PubMed

[8] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. - PubMed

[9] De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–i180. - PubMed

[10] De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–i180. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TopHat: discovering splice junctions with RNA-Seq

Affiliation

TopHat: discovering splice junctions with RNA-Seq

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources