Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2001 Aug 14;98(17):9748-53.
doi: 10.1073/pnas.171285098.

An Eulerian path approach to DNA fragment assembly

Affiliations
Comparative Study

An Eulerian path approach to DNA fragment assembly

P A Pevzner et al. Proc Natl Acad Sci U S A. .

Abstract

For the last 20 years, fragment assembly in DNA sequencing followed the "overlap-layout-consensus" paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical "overlap-layout-consensus" approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old "repeat problem" in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparative analysis of euler, phrap, cap, and tigr assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the ideal assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs; for example, two identically colored boxes in different places show the positions of contigs that were incorrectly assembled into a single contig. In some cases, a single colored box shows a contig that was assembled incorrectly (i.e., there was a rearrangement within this contig). Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, phrap misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome.
Figure 2
Figure 2
(a) DNA sequence with a triple repeat R; (b) the layout graph; (c) construction of the de Bruijn graph by gluing repeats; (d) de Bruijn graph.
Figure 3
Figure 3
A repeat v1vn and a system of paths overlapping with this repeat. The uppermost path contains the repeat and defines the correct pairing between the corresponding entrance and exit. If this path were not present, the repeat v1vn would become a tangle.
Figure 4
Figure 4
Equivalent transformations: (a) x, y-detachment and (b) x-cut.
Figure 5
Figure 5
(a) P is consistent with Px,y1 but inconsistent with Px,y2; (b) P is inconsistent with both Px,y1 and Px,y2; (c) P is consistent with both Px,y1 and Px,y2.

Similar articles

Cited by

References

    1. Green P. Documentation for phrap. 1994. (http://www.genome.washington.edu/UWGC/analysis-tools/phrap.htm).
    1. Bonfield J K, Smith K F, Staden R. Nucleic Acids Res. 1995;23:4992–4999. - PMC - PubMed
    1. Sutton G, White O, Adams M, Kerlavage A. Genome Sci Technol. 1995;1:9–19.
    1. Kececioglu J, Myers E. Algorithmica. 1995;13:7–51.
    1. Myers E M. J Comput Biol. 1995;2:275–290. - PubMed

Publication types

LinkOut - more resources