Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2011 Mar 14;6(3):e17915.
doi: 10.1371/journal.pone.0017915.

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Affiliations
Comparative Study

A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies

Wenyu Zhang et al. PLoS One. .

Abstract

The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of de novo short reads assemblers.
Programs developed from year of 2005 to 2010 are classified according to the assembly strategies. Currently, there are mainly four sorts of assemblers, while the other ones are denoted as “Other Strategies”. Different box symbols are utilized to distinguish assemblers that for short reads from different platforms.
Figure 2
Figure 2. Computational running time and maximum memory occupancy of 36-mer short reads assembly procedures.
(A) the computational times of each assembler for different datasets. (B) the maximum RAM used during the assembly process. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
Figure 3
Figure 3. Computational running time and maximum memory occupancy of 75-mer short reads assembly procedures.
(A) the computational times of each assembler for different datasets. (B) the maximum RAM used during the assembly process. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
Figure 4
Figure 4. Accuracy and integrity for 36-mer datasets assembly.
For short reads assembly, accurate and high genome coverage contigs are expected. Here, the quality of consequential contigs is shown with (A) the accuracy of assembled contigs and (B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
Figure 5
Figure 5. Accuracy and integrity for 75-mer datasets assembly.
For short reads assembly, accurate and high genome coverage contigs are expected. Here, the quality of consequential contigs is shown with (A) the accuracy of assembled contigs and (B) the genome coverage of the assembled contigs. No data is shown when the RAM is insufficient or the assembly tool is not suitable for the dataset.
Figure 6
Figure 6. Statistics for assembled contigs of 36-mer short reads.
Indicatrix that illustrates the feature of size distribution are adopted for analysis. “#” denotes the RAM of machine is not enough, and “N/A” means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
Figure 7
Figure 7. Statistics for assembled contigs of 75-mer short reads.
Indicatrix that illustrates the feature of size distribution are adopted for analysis. “#” denotes the RAM of machine is not enough, and “N/A” means the data is not available. The N50 size and N80 size represent the maximum read length for which all contigs greater than or equal to the threshold covered 50% or 80% of the reference genome.
Figure 8
Figure 8. Size comparison of datasets used in this study.
This figure shows the relative size comparison of short reads datasets with different legends. SE denotes Single-end short reads dataset, while PE denotes Paired-end short reads dataset.
Figure 9
Figure 9. Pipeline for evaluation of short reads assembly programs.
Four reference genomes with different size are exploited to generate short reads bearing base errors. The performance of assemblers is evaluated through computational time, accuracy, integrity and contig size, etc.
Figure 10
Figure 10. Analysis of complexity of reference genome sequences.
Tandem repeats finder (Version 4.04) is utilized to detect the number of repeat elements with length less than 2000 bp, the parameter “minimum alignment score ” is set to 70 and 150 for two types of short reads. The increase of genome size, repeat numbers and GC content may imply the increasing in genome assembly complexity.
Figure 11
Figure 11. Features of selected short reads assembly programs.
Noncommercial programs based on varied sorts of assembly approaches were selected for testing on synthetic Solexa short reads. “*” indicates any operating systems with perl interpreter, while “OLC” is for overlap-layout-consensus, and “N/A” for not available. The features for different programs are obtained from related references and documents of the latest version software (version information is not listed here).

Similar articles

Cited by

References

    1. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. - PubMed
    1. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci U S A. 2001;98:9748–9753. - PMC - PubMed
    1. Pop M, Kosack D. Using the TIGR assembler in shotgun sequencing projects. Methods Mol Biol. 2004;255:279–294. - PubMed
    1. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. - PMC - PubMed
    1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res. 2007;17:1697–1706. - PMC - PubMed

Publication types

LinkOut - more resources