Velvet: algorithms for de novo short read assembly using de Bruijn graphs

doi:10.1101/gr.074492.107

. 2008 May;18(5):821-9.

doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Daniel R Zerbino¹, Ewan Birney

Affiliations

PMID: 18349386
PMCID: PMC2336801
DOI: 10.1101/gr.074492.107

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Daniel R Zerbino et al. Genome Res. 2008 May.

. 2008 May;18(5):821-9.

doi: 10.1101/gr.074492.107. Epub 2008 Mar 18.

Authors

Daniel R Zerbino¹, Ewan Birney

Affiliation

¹ EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

PMID: 18349386
PMCID: PMC2336801
DOI: 10.1101/gr.074492.107

Abstract

We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

PubMed Disclaimer

Figures

**Figure 1.**
Schematic representation of our implementation of the de Bruijn graph. Each node, represented by a single rectangle, represents a series of overlapping k-mers (in this case, k = 5), listed directly *above* or *below*. (Red) The last nucleotide of each k-mer. The sequence of those final nucleotides, copied in large letters in the rectangle, is the sequence of the node. The twin node, directly attached to the node, either *below* or *above*, represents the reverse series of reverse complement k-mers. Arcs are represented as arrows *between* nodes. The last k-mer of an arc’s origin overlaps with the first of its destination. Each arc has a symmetric arc. Note that the two nodes on the *left* could be merged into one without loss of information, because they form a chain.

**Figure 2.**
Example of Tour Bus error correction. (A) Original graph. (B) The search starts from A and spreads toward the *right*. The progression of the *top* path (through B′ and C′) is stopped because D was previously visited. The nucleotide sequences corresponding to the alternate paths B′C′ and BC are extracted from the graph, aligned, and compared. (C) The two paths are judged similar, so the longer one, B′C′, is merged into the shorter one, BC. The merging is directed by the alignment of the consensus sequences, indicated in red lines in B. Note that node X, which was connected to node B′, is now connected to node B. The search progresses, and the *bottom* path (through C′ and D′) arrives second in E. Once again, the corresponding paths, C′D′ and CD are compared. (D) CD and C′D′ are judged similar enough. The longer path is merged into the shorter one.

**Figure 3.**
Simulations of Tour Bus. The genome of *E. coli* and 5-Mb samples of DNA from three other species (*S. cerevisiae*, *C. elegans*, *and H. sapiens*, respectively) were used to generate 35-bp read sets of varying read depths (X-axis of each plot). We measured the contig length N50 (Y-axis, log scale) after tip-clipping (black curve) then after the subsequent bubble smoothing (red curve). In the first column are the results for perfect, error-free reads. In the second column, we inserted errors in the reads at a rate of 1%. In the third column, we generated a slightly variant genome from the original by inserting random SNPs at a rate of 1 in 500. The reads were then generated with errors from both variants, thus simulating a diploid assembly.

**Figure 4.**
Effect of coverage on contig length with experimental *Streptococcus* data.

**Figure 5.**
Breadcrumb algorithm. Two long contigs produced after error correction, A and B, are joined by several paired reads (red and blue arcs). The path between the two can be broken up because of a repeat internal to the connecting sequence, because of an overlap with a distinct part of the genome, or because of some unresolved errors. The small square nodes represent either nodes of the path between A and B, or other nodes of the graph connected to the former. Finding the exact path in the graph from A to B is not straightforward because of all the alternate paths that need to be explored. However, if we mark all the nodes that are paired up to either A or B (with a blue circle), we can define a subgraph much simpler to explore. Ideally, only a linear path connects both nodes.

**Figure 6.**
Breadcrumb performance on simulated data sets. As in Figure 3, we sampled 5-Mb DNA sequences from four different species (*E. coli*, *S. cerevisiae*, *C. elegans*, and *H. sapiens*, respectively) and generated 50× read sets. The horizontal lines represent the N50 reached at the end of Tour Bus (see Fig. 3) (broken black line) and after applying a 4× coverage cutoff (broken red line). Note how the difference in N50 between the graph of perfect reads and that of erroneous reads is significantly reduced by this last cutoff. (Black curves) The results after the basic Breadcrumb algorithm; (red curves) the results after super-contigging.

See this image and copyright information in PMC

Cited by

Plastid phylogenomics of Robinsonia (Senecioneae; Asteraceae), endemic to the Juan Fernández Islands: insights into structural organization and molecular evolution.
Cho MS, Yang J, Kim SH, Crawford DJ, Stuessy TF, López-Sepúlveda P, Kim SC. Cho MS, et al. BMC Plant Biol. 2024 Oct 28;24(1):1016. doi: 10.1186/s12870-024-05711-3. BMC Plant Biol. 2024. PMID: 39465373 Free PMC article.
New persistent plant RNA virus carries mutations to weaken viral suppression of antiviral RNA interference.
Zhu LJ, Zhu Y, Zou C, Su LY, Zhang CT, Wang C, Bai YN, Chen B, Li R, Wu Q, Ding SW, Wu JG, Han YH. Zhu LJ, et al. Mol Plant Pathol. 2024 Oct;25(10):e70020. doi: 10.1111/mpp.70020. Mol Plant Pathol. 2024. PMID: 39462907 Free PMC article.
Genetic variation, structural analysis, and virulence implications of BimA and BimC in clinical isolates of Burkholderia pseudomallei in Thailand.
Cagape CMS, Seng R, Saiprom N, Tandhavanant S, Chewapreecha C, Boonyuen U, West TE, Chantratita N. Cagape CMS, et al. Sci Rep. 2024 Oct 23;14(1):24966. doi: 10.1038/s41598-024-74922-3. Sci Rep. 2024. PMID: 39443499 Free PMC article.
A novel sequence type of carbapenem-resistant hypervirulent Klebsiella pneumoniae strains from a county-level tertiary hospital in Southeastern China.
Jin L, Ye T, Pan X. Jin L, et al. Medicine (Baltimore). 2024 Oct 18;103(42):e40120. doi: 10.1097/MD.0000000000040120. Medicine (Baltimore). 2024. PMID: 39432660 Free PMC article.

See all "Cited by" articles

References

1. Batzoglou S. Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M., et al., editors. Encyclopedia of genomics, proteomics and bioinformatics. John Wiley and Sons; New York: 2005. Part 4.
1. Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed
1. Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
1. Bokhari S.H., Sauer J.R., Sauer J.R. A parallel graph decomposition algorithm for DNA sequencing with nanopores. Bioinformatics. 2005;21:889–896. - PubMed
1. Chaisson M., Pevzner P.A., Tang H., Pevzner P.A., Tang H., Tang H. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

G0300762/MRC_/Medical Research Council/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations

[1] Batzoglou S. Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M., et al., editors. Encyclopedia of genomics, proteomics and bioinformatics. John Wiley and Sons; New York: 2005. Part 4.

[2] Batzoglou S. Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M., et al., editors. Encyclopedia of genomics, proteomics and bioinformatics. John Wiley and Sons; New York: 2005. Part 4.

[3] Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed

[4] Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed

[5] Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed

[6] Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed

[7] Bokhari S.H., Sauer J.R., Sauer J.R. A parallel graph decomposition algorithm for DNA sequencing with nanopores. Bioinformatics. 2005;21:889–896. - PubMed

[8] Bokhari S.H., Sauer J.R., Sauer J.R. A parallel graph decomposition algorithm for DNA sequencing with nanopores. Bioinformatics. 2005;21:889–896. - PubMed

[9] Chaisson M., Pevzner P.A., Tang H., Pevzner P.A., Tang H., Tang H. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed

[10] Chaisson M., Pevzner P.A., Tang H., Pevzner P.A., Tang H., Tang H. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Affiliation

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources