Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

doi:10.1038/nprot.2012.016

. 2012 Mar 1;7(3):562-78.

doi: 10.1038/nprot.2012.016.

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Cole Trapnell¹, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, Lior Pachter

Affiliations

PMID: 22383036
PMCID: PMC3334321
DOI: 10.1038/nprot.2012.016

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Cole Trapnell et al. Nat Protoc. 2012.

. 2012 Mar 1;7(3):562-78.

doi: 10.1038/nprot.2012.016.

Authors

Cole Trapnell¹, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, Lior Pachter

Affiliation

¹ Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. cole@broadinstitute.org

PMID: 22383036
PMCID: PMC3334321
DOI: 10.1038/nprot.2012.016

Erratum in

Nat Protoc. 2014 Oct;9(10):2513

Abstract

Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.

PubMed Disclaimer

Figures

**Figure 1**
Software components used in this protocol. Bowtie forms the algorithmic core of TopHat, which aligns millions of RNA-seq reads to the genome per CPU hour. TopHat's read alignments are assembled by Cufflinks and its associated utility program to produce a transcriptome annotation of the genome. Cuffdiff quantifies this transcriptome across multiple conditions using the TopHat read alignments. CummeRbund helps users rapidly explore and visualize the gene expression data produced by Cuffdiff, including differentially expressed genes and transcripts.

**Figure 2**
An overview of the Tuxedo protocol. In an experiment involving two conditions, reads are first mapped to the genome with TopHat. The reads for each biological replicate are mapped independently. These mapped reads are provided as input to Cufflinks, which produces one file of assembled transfrags for each replicate. The assembly files are merged with the reference transcriptome annotation into a unified annotation for further analysis. This merged annotation is quantified in each condition by Cuffdiff, which produces expression data in a set of tabular files. These files are indexed and visualized with CummeRbund to facilitate exploration of genes identified by Cuffdiff as differentially expressed, spliced, or transcriptionally regulated genes. FPKM, fragments per kilobase of transcript per million fragments mapped.

**Figure 3**
Merging sample assemblies with a reference transcriptome annotation. Genes with low expression may receive insufficient sequencing depth to permit full reconstruction in each replicate. However, merging the replicate assemblies with Cuffmerge often recovers the complete gene. Newly discovered isoforms are also integrated with known ones at this stage into more complete gene models.

**Figure 4**
Analyzing groups of transcripts identifies differentially regulated genes. (a) Genes may produce multiple splice variants (labeled A–C) at different abundances through alternative transcription start sites (TSS), alternative cleavage and polyadenylation of 3′ ends, or by alternative splicing of primary transcripts. (b) Grouping isoforms by TSS and looking for changes in relative abundance between and within these groups yield mechanistic clues into how genes are differentially regulated. (c) For example, in the above hypothetical gene, changes in the relative abundance between isoforms A and B within TSS I group across conditions may be attributable to differential splicing of the primary transcript from which they are both produced. (d) Adding their expression levels yields a proxy expression value for this primary transcript. (e) Changes in this level relative to the gene's other primary transcript (i.e., isoform C) indicate possible differential promoter preference across conditions. (**f,g**) Similarly, genes with multiple annotated coding sequences (CDS) (f) can be analyzed for differential output of protein-coding sequences (g).

**Figure 5**
CummeRbund helps users rapidly explore their expression data and create publication-ready plots of differentially expressed and regulated genes. With just a few lines of plotting code, CummeRbund can visualize differential expression at the isoform level, as well as broad patterns among large sets of genes. (a) A myoblast differentiation time-course experiment reveals the emergence of a skeletal muscle-specific isoform of tropomyosin I. (b) This same time-course data capture the dynamics of hundreds of other genes in the mouse transcriptome during muscle development. FPKM, fragments per kilobase of transcript per million fragments mapped.

**Figure 6**
CummeRbund plots of the expression level distribution for all genes in simulated experimental conditions C1 and C2. FPKM, fragments per kilobase of transcript per million fragments mapped.

**Figure 7**
CummeRbund scatter plots highlight general similarities and specific outliers between conditions C1 and C2. Scatter plots can be created from expression data for genes, splice isoforms, TSS groups or CDS groups.

**Figure 8**
CummeRbund volcano plots reveal genes, transcripts, TSS groups or CDS groups that differ significantly between the pairs of conditions C1 and C2.

**Figure 9**
Differential analysis results for *regucalcin*. (a) Expression plot shows clear differences in the expression of *regucalcin* across conditions C1 and C2, measured in FPKM (**Box 2**). Expression of a transcript is proportional to the number of reads sequenced from that transcript after normalizing for that transcript's length. Each gene and transcript expression value is annotated with error bars that capture both cross-replicate variability and measurement uncertainty as estimated by Cuffdiff's statistical model of RNA-seq. (b) Changes in *regucalcin* expression are attributable to a large increase in the expression of one of four alternative isoforms. (c) The read coverage, viewed through the genome browsing application IGV, shows an increase in sequencing reads originating from the gene in condition C2.

**Figure 10**
Differential analysis results for *Rala*. (a) This gene has four isoforms in the merged assembly. (b) Cuffdiff identifies TCONS_00024713 and TCONS_00024715 as being significantly differentially expressed. The relatively modest overall change in gene-level expression, combined with high isoform-level measurement variability, leaves Cuffdiff unable to reject the null hypothesis that the observed gene level is attributable to measurement or cross-replicate variability.

See this image and copyright information in PMC

Cited by

Comparative transcriptomic analysis of thermally stressed Arabidopsis thaliana meiotic recombination mutants.
Huang J, Wang H, Wang Y, Copenhaver GP. Huang J, et al. BMC Genomics. 2021 Mar 12;22(1):181. doi: 10.1186/s12864-021-07497-2. BMC Genomics. 2021. PMID: 33711924 Free PMC article.
Long intervening non-coding RNA 00320 is human brain-specific and highly expressed in the cortical white matter.
Mills JD, Chen J, Kim WS, Waters PD, Prabowo AS, Aronica E, Halliday GM, Janitz M. Mills JD, et al. Neurogenetics. 2015 Jul;16(3):201-13. doi: 10.1007/s10048-015-0445-1. Epub 2015 Mar 29. Neurogenetics. 2015. PMID: 25819921
Longer Duration of Active Oil Biosynthesis during Seed Development Is Crucial for High Oil Yield-Lessons from Genome-Wide In Silico Mining and RNA-Seq Validation in Sesame.
Nawade B, Kumar A, Maurya R, Subramani R, Yadav R, Singh K, Rangan P. Nawade B, et al. Plants (Basel). 2022 Nov 4;11(21):2980. doi: 10.3390/plants11212980. Plants (Basel). 2022. PMID: 36365434 Free PMC article.
VvHDZ28 positively regulate salicylic acid biosynthesis during seed abortion in Thompson Seedless.
Li Z, Jiao Y, Zhang C, Dou M, Weng K, Wang Y, Xu Y. Li Z, et al. Plant Biotechnol J. 2021 Sep;19(9):1824-1838. doi: 10.1111/pbi.13596. Epub 2021 May 7. Plant Biotechnol J. 2021. PMID: 33835678 Free PMC article.
Mirror proteases of Ac-Trypsin and Ac-LysargiNase precisely improve novel event identifications in Mycolicibacterium smegmatis MC² 155 by proteogenomic analysis.
Jiang S, Shi J, Li Y, Zhang Z, Chang L, Wang G, Wu W, Yu L, Dai E, Zhang L, Lyu Z, Xu P, Zhang Y. Jiang S, et al. Front Microbiol. 2022 Oct 12;13:1015140. doi: 10.3389/fmicb.2022.1015140. eCollection 2022. Front Microbiol. 2022. PMID: 36312923 Free PMC article.

See all "Cited by" articles

References

1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods. 2008;5:621–628. - PubMed
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed
1. Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed
1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. - PubMed
1. Adams MD, et al. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods. 2008;5:621–628. - PubMed

[2] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods. 2008;5:621–628. - PubMed

[3] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed

[4] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed

[5] Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed

[6] Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed

[7] Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. - PubMed

[8] Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. - PubMed

[9] Adams MD, et al. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. - PubMed

[10] Adams MD, et al. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Affiliation

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

Authors

Affiliation

Erratum in

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Erratum in

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases