Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Apr;13(4):721-31.
doi: 10.1101/gr.926603. Epub 2003 Mar 12.

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Affiliations
Comparative Study

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Michael Brudno et al. Genome Res. 2003 Apr.

Abstract

To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The LAGAN algorithm. (A) A global alignment between two sequences is a path between the top-left and the bottom-right corner of their alignment matrix. (B) LAGAN first finds all local alignments between the two sequences. (C) LAGAN computes a maximal-scoring ordered subset of the alignments, the anchors, and puts together a rough global map. (D) LAGAN limits the search for an optimal alignment to the area included in the boxes and around the anchors, and computes the optimal Needleman-Wunsch alignment limited to that area. LAGAN uses memory proportional to the area of the largest box plus the memory to hold the optimal alignment.
Figure 2.
Figure 2.
Visualization of a multiple alignment using VISTA. (A) MLAGAN alignments can be visualized using VISTA, if they are projected to pairwise alignments with respect to one reference sequence. This plot shows the conservation between human and chimpanzee, cow, mouse, and fugu around the first intron of the cMet gene. The human/chimpanzee conservation is uniformly very high; human/cow and human/mouse show varying levels of conservation. The human/chicken alignment also shows some conservation in the non-coding areas. The human/fugu alignment shows conservation only within the first coding exon, and to a lesser degree within the regions upstream and downstream of that exon. (B) First introns of cMet, comparison of CLUSTALW and MLAGAN alignments. We compared the alignment generated by LAGAN and CLUSTALW for the first intron of the cMet gene in eight mammalian sequences (human, baboon, cat, dog, cow, pig, mouse, and rat). The alignments between all of the species except rodents were similar. VISTA plots of the projections to human and mouse are shown. CLUSTALW (top) misaligned the mouse sequence around 4 Kb and 10 Kb, whereas MLAGAN (bottom) found significant conservation in these regions.
Figure 3.
Figure 3.
Multiple alignment of a misannotated exon of the cMet gene. According to the MLAGAN alignment, the RefSeq-annotated start position of the exon is not well conserved among the mammals. There is also a large gap (∼250 positions) in the alignment after the first 20 nucleotides of this exon, generated by sequence in cat that has no clear homolog in any of the other species aligned. However, a consensus splice acceptor site and the start position of an Acembly gene prediction exon (exon 9 in Met.f and Met.e, exon 10 in Met.a and Met.h), beginning 54 nucleotides downstream of the RefSeq exon 10, is neatly conserved across all the mammals studied. Additionally, no human ESTs correspond to the RefSeq annotation, whereas all ESTs (as annotated in the UCSC genome browser) align precisely with the Acembly gene prediction exon.
Figure 4.
Figure 4.
Limited area of dynamic programming around an anchor. An anchor between (i, j) and (i‘, j‘) in the rough global alignment limits the search area of Needleman-Wunsch. The alignment is required to pass through the diagonal from (i − r, j + r) to (i + r, j − r), stay within the diagonals from (i − r, j + r) to (i‘ − r, j‘ + r) and from (i + r, j − r) to (i‘ + r, j‘ − r), and exit through the diagonal from (i‘ − r, j‘ + r) to (i‘ + r, j‘ − r).
Figure 5.
Figure 5.
Generation of anchors during progressive alignment. Multi-sequence X/Y is aligned to sequence Z. Anchors between X and Z (top) and anchors between Y and Z (middle) are remapped to coordinates in the X/Y multi-sequence, and given a new score. Then, the Longest Increasing Subsequence algorithm is applied to select a subset of the remapped anchors, as the anchors between X/Y and Z.
Figure 6.
Figure 6.
The stacking effect. (A) The correct multiple alignment of seven sequences. (B) The multiple alignment of the same seven sequences, with the stacking effect. When only gap-open and gap-continue penalties are used, the stacked alignment (B) incurs one gap-open penalty less than the correct alignment (A), and therefore is optimal. When gap-end penalties are used, the correct alignment (A) is optimal because it incurs two gap-end penalties less than the stacked alignment (B).
Figure 7.
Figure 7.
Part of a multiple alignment between eight mammalian sequences, before and after iterative refinement. Refinement improves the overall alignment by correctly placing the two T’s in the fifth and sixth sequence, and changing a 1-bp deletion in sequence 5 into an insertion in sequence 6.

Similar articles

Cited by

References

    1. Altschul S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed
    1. Altschul S.F., Madden, T.L., Schoffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
    1. Anson E.L. and Myers, E.W. 1997. Re-Aligner: A program for refining DNA sequence multialignments. J. Comp. Biol. 4: 369-383. - PubMed
    1. Barton G.J. and Sternberg, M.J.E. 1987. A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198: 327-337. - PubMed
    1. Batzoglou S., Pachter, L., Mesirov, J., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. - PMC - PubMed

Publication types

LinkOut - more resources