Abstract
Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
References
Blencowe, B.J. Alternative splicing: new insights from global analyses. Cell 126, 37–47 (2006).
Ponting, C.P., Oliver, P.L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136, 629–641 (2009).
Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).
Salzberg, S.L. Recent advances in RNA sequence analysis. F1000 Biol. Rep. 2, 64 (2010).
Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469–477 (2011).
Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).
Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).
Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).
Feng, J., Li, W. & Jiang, T. Inference of isoforms from short sequence reads. J. Comput. Biol. 18, 305–321 (2011).
Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Li, J.J., Jiang, C.R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 19867–19872 (2011).
Li, W., Feng, J. & Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18, 1693–1707 (2011).
Mezlini, A.M. et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23, 519–529 (2013).
Tomescu, A.I., Kuosmanen, A., Rizzi, R. & Makinen, V. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics 14 (suppl. 5), S15 (2013).
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).
Zhao, Q.Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).
Behr, J. et al. MITIE: simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29, 2529–2538 (2013).
Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40, 10073–10083 (2012).
Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).
Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).
Zimin, A.V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
Rehrauer, H., Opitz, L., Tan, G., Sieverling, L. & Schlapbach, R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14, 370 (2013).
Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).
Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).
Ford, L. & Fulkerson, D. Flows in Networks (Princeton University Press, Princeton, NJ, 1962).
Goldberg, A. & Tarjan, R. A new approach to the maximum-flow problem. JACM 35, 921–940 (1988).
Dantzig, G. Linear Programming and Extensions (Princeton University Press, Princeton, NJ, 1962).
Goldberg, A., Plotkin, S. & Tardos, E. Combinatorial algorithms for the generalized circulation problem. Math. Oper. Res. 16, 351–381 (1991).
Acknowledgements
These studies were supported in part by US National Institutes of Health grants R01-HG006677 (S.L.S.), R01-HG006102 (S.L.S.), R01-GM105705 (G.M.P.), R01-CA120185 (J.T.M.), P01-CA134292 (J.T.M.), and the Cancer Prevention and Research Institute of Texas (J.T.M.).
Author information
Authors and Affiliations
Contributions
M.P. designed the StringTie method with input from S.L.S. M.P. and G.M.P. implemented the algorithms. C.M.A. ran all programs on the RNA-seq data and tuned their performance. J.T.M. and T.-C.C. produced the kidney cell line data and gave feedback on StringTie's performance. M.P. and S.L.S. wrote the paper. S.L.S. supervised the entire project. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–13, Supplementary Tables 1–11 and Supplementary Discussion (PDF 1024 kb)
Supplementary Software 1
StringTie code (ZIP 351 kb)
Source data
Rights and permissions
About this article
Cite this article
Pertea, M., Pertea, G., Antonescu, C. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295 (2015). https://doi.org/10.1038/nbt.3122
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.3122
This article is cited by
-
Impacts of longitudinal water curtain cooling system on transcriptome-related immunity in ducks
BMC Genomics (2024)
-
Comparative transcriptome analysis reveals major genes, transcription factors and biosynthetic pathways associated with leaf senescence in rice under different nitrogen application
BMC Plant Biology (2024)
-
Uncovering impaired mitochondrial and lysosomal function in adipose-derived stem cells from obese individuals with altered biological activity
Stem Cell Research & Therapy (2024)
-
Comparative transcriptomic analysis provides insights into the genetic networks regulating oil differential production in oil crops
BMC Biology (2024)
-
Characterization of sexual maturity-associated N6-methyladenosine in boar testes
BMC Genomics (2024)