A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

doi:10.1371/journal.pone.0017915

Comparative Study

. 2011 Mar 14;6(3):e17915.

doi: 10.1371/journal.pone.0017915.

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Wenyu Zhang¹, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, Bairong Shen

Affiliations

PMID: 21423806
PMCID: PMC3056720
DOI: 10.1371/journal.pone.0017915

Comparative Study

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Wenyu Zhang et al. PLoS One. 2011.

. 2011 Mar 14;6(3):e17915.

doi: 10.1371/journal.pone.0017915.

Authors

Wenyu Zhang¹, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, Bairong Shen

Affiliation

¹ Center for Systems Biology, Soochow University, Suzhou, Jiangsu, China.

PMID: 21423806
PMCID: PMC3056720
DOI: 10.1371/journal.pone.0017915

Abstract

The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Overview of *de novo* short reads assemblers.**
Programs developed from year of 2005 to 2010 are classified according to the assembly strategies. Currently, there are mainly four sorts of assemblers, while the other ones are denoted as “Other Strategies”. Different box symbols are utilized to distinguish assemblers that for short reads from different platforms.

**Figure 2. Computational running time and maximum memory occupancy of 36-mer short reads assembly procedures.**
(A) the computational times of each assembler for different datasets. (B) the maximum RAM used during the assembly process. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.

**Figure 3. Computational running time and maximum memory occupancy of 75-mer short reads assembly procedures.**
(A) the computational times of each assembler for different datasets. (B) the maximum RAM used during the assembly process. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.

**Figure 4. Accuracy and integrity for 36-mer datasets assembly.**
For short reads assembly, accurate and high genome coverage contigs are expected. Here, the quality of consequential contigs is shown with (A) the accuracy of assembled contigs and (B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.

**Figure 5. Accuracy and integrity for 75-mer datasets assembly.**
For short reads assembly, accurate and high genome coverage contigs are expected. Here, the quality of consequential contigs is shown with (A) the accuracy of assembled contigs and (B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.

**Figure 6. Statistics for assembled contigs of 36-mer short reads.**
Indicatrix that illustrates the feature of size distribution are adopted for analysis. “#” denotes the RAM of machine is not enough, and “N/A” means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.

**Figure 7. Statistics for assembled contigs of 75-mer short reads.**
Indicatrix that illustrates the feature of size distribution are adopted for analysis. “#” denotes the RAM of machine is not enough, and “N/A” means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.

**Figure 8. Size comparison of datasets used in this study.**
This figure shows the relative size comparison of short reads datasets with different legends. SE denotes Single-end short reads dataset, while PE denotes Paired-end short reads dataset.

**Figure 9. Pipeline for evaluation of short reads assembly programs.**
Four reference genomes with different size are exploited to generate short reads bearing base errors. The performance of assemblers is evaluated through computational time, accuracy, integrity and contig size, etc.

**Figure 10. Analysis of complexity of reference genome sequences.**
Tandem repeats finder (Version 4.04) is utilized to detect the number of repeat elements with length less than 2000 bp, the parameter “minimum alignment score ” is set to 70 and 150 for two types of short reads. The increase of genome size, repeat numbers and GC content may imply the increasing in genome assembly complexity.

**Figure 11. Features of selected short reads assembly programs.**
Noncommercial programs based on varied sorts of assembly approaches were selected for testing on synthetic Solexa short reads. “*” indicates any operating systems with perl interpreter, while “OLC” is for overlap-layout-consensus, and “N/A” for not available. The features for different programs are obtained from related references and documents of the latest version software (version information is not listed here).

See this image and copyright information in PMC

Cited by

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains.
Do DT, Yang MR, Vo TNS, Le NQK, Wu YW. Do DT, et al. Comput Struct Biotechnol J. 2024 Apr 16;23:1864-1876. doi: 10.1016/j.csbj.2024.04.035. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38707536 Free PMC article.
Eukaryotic genomes from a global metagenomic data set illuminate trophic modes and biogeography of ocean plankton.
Alexander H, Hu SK, Krinos AI, Pachiadaki M, Tully BJ, Neely CJ, Reiter T. Alexander H, et al. mBio. 2023 Dec 19;14(6):e0167623. doi: 10.1128/mbio.01676-23. Epub 2023 Nov 10. mBio. 2023. PMID: 37947402 Free PMC article.
Using a combination of short- and long-read sequencing to investigate the diversity in plasmid- and chromosomally encoded extended-spectrum beta-lactamases (ESBLs) in clinical Shigella and Salmonella isolates in Belgium.
Berbers B, Vanneste K, Roosens NHCJ, Marchal K, Ceyssens PJ, De Keersmaecker SCJ. Berbers B, et al. Microb Genom. 2023 Jan;9(1):mgen000925. doi: 10.1099/mgen.0.000925. Microb Genom. 2023. PMID: 36748573 Free PMC article.
Tool evaluation for the detection of variably sized indels from next generation whole genome and targeted sequencing data.
Wang N, Lysenkov V, Orte K, Kairisto V, Aakko J, Khan S, Elo LL. Wang N, et al. PLoS Comput Biol. 2022 Feb 17;18(2):e1009269. doi: 10.1371/journal.pcbi.1009269. eCollection 2022 Feb. PLoS Comput Biol. 2022. PMID: 35176018 Free PMC article.
Genome-wide transcriptome analysis of the early developmental stages of Echinococcus granulosus protoscoleces reveals extensive alternative splicing events in the spliceosome pathway.
Mohammadi MA, Harandi MF, McManus DP, Mansouri M. Mohammadi MA, et al. Parasit Vectors. 2021 Nov 12;14(1):574. doi: 10.1186/s13071-021-05067-9. Parasit Vectors. 2021. PMID: 34772444 Free PMC article.

See all "Cited by" articles

References

1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. - PubMed
1. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001;98:9748–9753. - PMC - PubMed
1. Pop M, Kosack D. Using the TIGR assembler in shotgun sequencing projects. Methods Mol Biol. 2004;255:279–294. - PubMed
1. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. - PMC - PubMed
1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007;17:1697–1706. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

[1] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. - PubMed

[2] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. - PubMed

[3] Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001;98:9748–9753. - PMC - PubMed

[4] Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001;98:9748–9753. - PMC - PubMed

[5] Pop M, Kosack D. Using the TIGR assembler in shotgun sequencing projects. Methods Mol Biol. 2004;255:279–294. - PubMed

[6] Pop M, Kosack D. Using the TIGR assembler in shotgun sequencing projects. Methods Mol Biol. 2004;255:279–294. - PubMed

[7] Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. - PMC - PubMed

[8] Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. - PMC - PubMed

[9] Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007;17:1697–1706. - PMC - PubMed

[10] Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007;17:1697–1706. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Affiliation

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources