Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

doi:10.1038/nprot.2016.095

. 2016 Sep;11(9):1650-67.

doi: 10.1038/nprot.2016.095. Epub 2016 Aug 11.

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Mihaela Pertea^{1

2}, Daehwan Kim¹, Geo M Pertea¹, Jeffrey T Leek³, Steven L Salzberg^{1

2

3

4}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, USA.
² Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, USA.
³ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, USA.
⁴ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA.

PMID: 27560171
PMCID: PMC5032908
DOI: 10.1038/nprot.2016.095

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Mihaela Pertea et al. Nat Protoc. 2016 Sep.

. 2016 Sep;11(9):1650-67.

doi: 10.1038/nprot.2016.095. Epub 2016 Aug 11.

Authors

Mihaela Pertea^{1

2}, Daehwan Kim¹, Geo M Pertea¹, Jeffrey T Leek³, Steven L Salzberg^{1

2

3

4}

Affiliations

¹ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, USA.
² Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, USA.
³ Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, USA.
⁴ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, USA.

PMID: 27560171
PMCID: PMC5032908
DOI: 10.1038/nprot.2016.095

Abstract

High-throughput sequencing of mRNA (RNA-seq) has become the standard method for measuring and comparing the levels of gene expression in a wide variety of species and conditions. RNA-seq experiments generate very large, complex data sets that demand fast, accurate and flexible software to reduce the raw read data to comprehensible results. HISAT (hierarchical indexing for spliced alignment of transcripts), StringTie and Ballgown are free, open-source software tools for comprehensive analysis of RNA-seq experiments. Together, they allow scientists to align reads to a genome, assemble transcripts including novel splice variants, compute the abundance of these transcripts in each sample and compare experiments to identify differentially expressed genes and transcripts. This protocol describes all the steps necessary to process a large set of raw sequencing reads and create lists of gene transcripts, expression levels, and differentially expressed genes and transcripts. The protocol's execution time depends on the computing resources, but it typically takes under 45 min of computer time. HISAT, StringTie and Ballgown are available from http://ccb.jhu.edu/software.shtml.

PubMed Disclaimer

Figures

**Figure 1**
An overview of the "new Tuxedo" protocol. In an experiment involving multiple RNA-seq data sets, reads are first mapped to the genome using HISAT (Steps 1–2). Annotation of reference genes and transcripts can be provided as input, but this is optional, as indicated by the dotted line. The alignments are then passed to StringTie (Step 3), which assembles and quantifies the transcripts in each sample. (In the alternative protocol, the alignments from Step 2 are passed directly to Step 6, skipping all assembly steps. Step 6 will then estimate abundance only for known, annotated transcripts.) After initial assembly, the assembled transcripts are merged together (Step 4) by a special StringTie module, which creates a uniform set of transcripts for all samples. StringTie can use annotation in both of these steps, as shown by the dotted lines. The gffcompare program then compares the genes and transcripts to the annotation and reports statistics on this comparison (Step 5). In Step 6, StringTie processes the read alignments and either the merged transcripts or the reference annotation (through the diamond labeled "OR"). Using this input, StringTie re-estimates abundances where necessary and creates new transcript tables for input to Ballgown. Ballgown then compares all transcripts across conditions and produces tables and plots of differentially expressed genes and transcripts (Steps 7–21). Black and curved blue lines in the figure represent input to and output from the programs, respectively. Optional inputs are represented by dotted lines.

**Figure 2**
Merging transcript assemblies using StringTie’s merge function. In this example, four partial assemblies from four different samples are merged into two transcripts A and B. Samples 1 and 2 are both consistent with the reference annotation, which is used here to merge and extend them to create transcript A. Samples 3 and 4 are consistent with each other but not with the annotation, and these are merged to create transcript B.

**Figure 3**
Distribution of FPKM values across the 12 samples. Samples from the same sex are shown in the same color: males in blue, and females in orange.

**Figure 4**
FPKM distributions in males and females for transcript NM_012227 from gene GTPBP6 (GTP Binding Protein 6), a gene that is known to be more highly expressed in males.

**Figure 5**
Structure and expression levels of five distinct isoforms of the XIST gene in sample ERR188234. Expression levels are shown in varying shades of yellow. The third isoform is expressed at a much higher level than the others, as indicated by the darker color.

**Figure 6**
Overall distribution of differential expression p-values in females and males. Frequencies of p-values are shown for (a) transcripts and (b) genes.

See this image and copyright information in PMC

Cited by

Gene Expression Profiling Regulated by lncRNA H19 Using Bioinformatic Analyses in Glioma Cell Lines.
Chae Y, Roh J, Im M, Jang W, Kim B, Kang J, Youn B, Kim W. Chae Y, et al. Cancer Genomics Proteomics. 2024 Nov-Dec;21(6):608-621. doi: 10.21873/cgp.20477. Cancer Genomics Proteomics. 2024. PMID: 39467632 Free PMC article.
Proteomic and Transcriptomic Analyses Reveal Pathological Changes in the Entorhinal Cortex Region that Correlate Well with Dysregulation of Ion Transport in Patients with Alzheimer's Disease.
Jia Y, Wang X, Chen Y, Qiu W, Ge W, Ma C. Jia Y, et al. Mol Neurobiol. 2021 Aug;58(8):4007-4027. doi: 10.1007/s12035-021-02356-3. Epub 2021 Apr 27. Mol Neurobiol. 2021. PMID: 33904022
Viral genomic, metagenomic and human transcriptomic characterization and prediction of the clinical forms of COVID-19.
Rodriguez C, de Prost N, Fourati S, Lamoureux C, Gricourt G, N'debi M, Canoui-Poitrine F, Désveaux I, Picard O, Demontant V, Trawinski E, Lepeule R, Surgers L, Vindrios W, Lelièvre JD, Mongardon N, Langeron O, Cohen JL, Mekontso-Dessap A, Woerther PL, Pawlotsky JM. Rodriguez C, et al. PLoS Pathog. 2021 Mar 29;17(3):e1009416. doi: 10.1371/journal.ppat.1009416. eCollection 2021 Mar. PLoS Pathog. 2021. PMID: 33780519 Free PMC article. Clinical Trial.
Integrated Analysis of Long Non-Coding RNA and mRNA to Reveal Putative Candidate Genes Associated with Backfat Quality in Beijing Black Pig.
Liu X, Tian W, Wang L, Zhang L, Liang J, Wang L. Liu X, et al. Foods. 2022 Nov 15;11(22):3654. doi: 10.3390/foods11223654. Foods. 2022. PMID: 36429246 Free PMC article.
SgRVE6, a LHY-CCA1-Like Transcription Factor From Fine-Stem Stylo, Upregulates NB-LRR Gene Expression and Enhances Cold Tolerance in Tobacco.
Chen S, Huang HA, Chen JH, Fu CC, Zhan PL, Ke SW, Zhang XQ, Zhong TX, Xie XM. Chen S, et al. Front Plant Sci. 2020 Aug 19;11:1276. doi: 10.3389/fpls.2020.01276. eCollection 2020. Front Plant Sci. 2020. PMID: 32973836 Free PMC article.

See all "Cited by" articles

References

1. Lister R, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–536. - PMC - PubMed
1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–669. - PubMed
1. Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28:511–U174. - PMC - PubMed
1. Kim D, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

[1] Lister R, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–536. - PMC - PubMed

[2] Lister R, et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–536. - PMC - PubMed

[3] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

[4] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed

[5] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–669. - PubMed

[6] Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008;5:613–669. - PubMed

[7] Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28:511–U174. - PMC - PubMed

[8] Trapnell C, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28:511–U174. - PMC - PubMed

[9] Kim D, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. - PMC - PubMed

[10] Kim D, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Affiliations

Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical

Molecular Biology Databases