Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Abstract

Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of Trinity.
Figure 2: Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse.
Figure 3: Trinity improves the yeast annotation.
Figure 4: Trinity resolves closely paralogous genes.
Figure 5: Comparison of Trinity to other mapping-first and assembly-first methods.
Figure 6: Trinity reconstructs polymorphic transcripts in whitefly.

Similar content being viewed by others

Accession codes

Primary accessions

Gene Expression Omnibus

Sequence Read Archive

References

  1. Birol, I. et al. De novo transcriptome assembly with ABySS. Bioinformatics 25, 2872–2877 (2009).

    Article  CAS  PubMed  Google Scholar 

  2. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Haas, B.J. & Zody, M.C. Advancing RNA-Seq analysis. Nat. Biotechnol. 28, 421–423 (2010).

    Article  CAS  PubMed  Google Scholar 

  5. Yassour, M. et al. Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing. Proc. Natl. Acad. Sci. USA 106, 3264–3269 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

    Article  CAS  PubMed  Google Scholar 

  7. De Bruijn, N.G. A combinatorical problem. Koninklijke Nederlandse Akademie v. Wetenschappen 46, 758–764 (1946).

    Google Scholar 

  8. Good, I.J. Normal recurring decimals. J. Lond. Math. Soc. 21, 167–169 (1946).

    Article  Google Scholar 

  9. Pevzner, P.A., Tang, H. & Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Zerbino, D.R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hertz-Fowler, C. et al. GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res. 32, D339–D343 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Levin, J.Z. et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Methods 7, 709–715 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Parkhomchuk, D. et al. Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37, e123 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Rhind, N. et al. Comparative functional genomics of the fission yeasts. Science published online, doi:10.1126/science.1203357 (21 April 2011).

  16. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Wilhelm, B.T. et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453, 1239–1243 (2008).

    Article  CAS  PubMed  Google Scholar 

  18. Xu, Z. et al. Bidirectional promoters generate pervasive transcription in yeast. Nature 457, 1033–1037 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wu, T.D. & Watanabe, C.K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).

    Article  CAS  PubMed  Google Scholar 

  20. Wu, C.H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).

    Article  CAS  PubMed  Google Scholar 

  21. Wapinski, I., Pfeffer, A., Friedman, N. & Regev, A. Natural history and evolutionary principles of gene duplication in fungi. Nature 449, 54–61 (2007).

    Article  CAS  PubMed  Google Scholar 

  22. Molnar, M. et al. Characterization of rec7, an early meiotic recombination gene in Schizosaccharomyces pombe. Genetics 157, 519–532 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. Nakamura, T., Kishida, M. & Shimoda, C. The Schizosaccharomyces pombe spo6+ gene encoding a nuclear protein with sequence similarity to budding yeast Dbf4 is required for meiotic second division and sporulation. Genes Cells 5, 463–479 (2000).

    Article  CAS  PubMed  Google Scholar 

  24. Watanabe, T. et al. Comprehensive isolation of meiosis-specific genes identifies novel proteins and unusual non-coding transcripts in Schizosaccharomyces pombe. Nucleic Acids Res. 29, 2327–2337 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Yassour, M. et al. Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species. Genome Biol. 11, R87 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Matlin, A.J., Clark, F. & Smith, C.W.J. Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005).

    Article  CAS  PubMed  Google Scholar 

  27. Robertson, G. et al. De novo assembly and analysis of RNA-seq data. Nat. Methods 7, 909–912 (2010).

    Article  CAS  PubMed  Google Scholar 

  28. Graveley, B.R. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17, 100–107 (2001).

    Article  CAS  PubMed  Google Scholar 

  29. Wang, X.-W. et al. De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics 11, 400 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Salzberg, S.L. & Yorke, J.A. Beware of mis-assembled genomes. Bioinformatics 21, 4320–4321 (2005).

    Article  CAS  PubMed  Google Scholar 

  31. Shannon, C.E. Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 50–64 (1951).

    Article  Google Scholar 

  32. Price, A.L., Jones, N.C. & Pevzner, P.A. De novo identification of repeat families in large genomes. Bioinformatics 21 Suppl 1, i351–i358 (2005).

    Article  CAS  PubMed  Google Scholar 

  33. Grabherr, M.G. et al. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics 26, 1145–1151 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Kent, W.J. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank L. Gaffney for help with figure preparation, J. Bochicchio for project management, the Broad Sequencing Platform for all sequencing work, A. Papanicolaou and M. Ott for Inchworm software testing and code enhancements, and F. Ribeiro for helpful discussions regarding error pruning. The work was supported in part by a grant from the National Human Genome Research Institute (NIH 1 U54 HG03067, Lander), the Howard Hughes Medical Institute, a National Institutes of Health PIONEER award, a Burroughs Wellcome Fund–Career Award at the Scientific Interface (A.R.), the US-Israel Binational Science Foundation (N.F. and A.R.), and funds from the National Institute of Allergy and Infectious Diseases under contract no. HHSN27220090018C. M.Y. was supported by a Clore Fellowship. K.L.-T. is a recipient of the European Young Investigator Award (EYRYI) funded by the European Science Foundation. A.R. is a researcher of the Merkin Foundation for Stem Cell Research at the Broad Institute.

Author information

Authors and Affiliations

Authors

Contributions

M.G.G., M.Y., B.J.H., K.L.-T., N.F. and A.R. conceived and designed the study. B.J.H., M.G.G. and M.Y. developed the Inchworm, Chrysalis and Butterfly components, respectively. N.R., F.D.P., B.W.B., C.N., K.L.-T. contributed to the study's conception and execution. J.Z.L., D.A.T., X.A., L.F., R.R., I.A., N.H., A.R. and A.G. designed and performed all experiments. Q.Z., Z.C. and E.M. contributed computational analyses. M.G.G., B.J.H. and M.Y. designed, implemented and evaluated all methods. A.R., N.F., M.G.G., B.J.H. and M.Y. wrote the manuscript, with input from all authors. A.R. and N.F. contributed equally to this paper.

Corresponding authors

Correspondence to Nir Friedman or Aviv Regev.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Tables 1–3, Supplementary Methods, Supplementary Note and Supplementary Figures 1–9 (PDF 394 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grabherr, M., Haas, B., Yassour, M. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–652 (2011). https://doi.org/10.1038/nbt.1883

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.1883

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics