LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

doi:10.1101/gr.926603

Comparative Study

. 2003 Apr;13(4):721-31.

doi: 10.1101/gr.926603. Epub 2003 Mar 12.

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Michael Brudno¹, Chuong B Do, Gregory M Cooper, Michael F Kim, Eugene Davydov; NISC Comparative Sequencing Program; Eric D Green, Arend Sidow, Serafim Batzoglou

Affiliations

PMID: 12654723
PMCID: PMC430158
DOI: 10.1101/gr.926603

Comparative Study

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Michael Brudno et al. Genome Res. 2003 Apr.

. 2003 Apr;13(4):721-31.

doi: 10.1101/gr.926603. Epub 2003 Mar 12.

Authors

Michael Brudno¹, Chuong B Do, Gregory M Cooper, Michael F Kim, Eugene Davydov; NISC Comparative Sequencing Program; Eric D Green, Arend Sidow, Serafim Batzoglou

Affiliation

¹ Department of Computer Science, Stanford University, Stanford, California 94305-9010, USA.

PMID: 12654723
PMCID: PMC430158
DOI: 10.1101/gr.926603

Abstract

To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.

PubMed Disclaimer

Figures

**Figure 1.**
The LAGAN algorithm. (A) A global alignment between two sequences is a path between the top-left and the bottom-right corner of their alignment matrix. (B) LAGAN first finds all local alignments between the two sequences. (C) LAGAN computes a maximal-scoring ordered subset of the alignments, the *anchors*, and puts together a rough global map. (D) LAGAN limits the search for an optimal alignment to the area included in the boxes and around the anchors, and computes the optimal Needleman-Wunsch alignment limited to that area. LAGAN uses memory proportional to the area of the largest box plus the memory to hold the optimal alignment.

**Figure 2.**
Visualization of a multiple alignment using VISTA. (A) MLAGAN alignments can be visualized using VISTA, if they are projected to pairwise alignments with respect to one reference sequence. This plot shows the conservation between human and chimpanzee, cow, mouse, and fugu around the first intron of the *cMet* gene. The human/chimpanzee conservation is uniformly very high; human/cow and human/mouse show varying levels of conservation. The human/chicken alignment also shows some conservation in the non-coding areas. The human/fugu alignment shows conservation only within the first coding exon, and to a lesser degree within the regions upstream and downstream of that exon. (B) First introns of *cMet*, comparison of CLUSTALW and MLAGAN alignments. We compared the alignment generated by LAGAN and CLUSTALW for the first intron of the *cMet* gene in eight mammalian sequences (human, baboon, cat, dog, cow, pig, mouse, and rat). The alignments between all of the species except rodents were similar. VISTA plots of the projections to human and mouse are shown. CLUSTALW (*top*) misaligned the mouse sequence around 4 Kb and 10 Kb, whereas MLAGAN (*bottom*) found significant conservation in these regions.

**Figure 3.**
Multiple alignment of a misannotated exon of the *cMet* gene. According to the MLAGAN alignment, the RefSeq-annotated start position of the exon is not well conserved among the mammals. There is also a large gap (∼250 positions) in the alignment after the first 20 nucleotides of this exon, generated by sequence in cat that has no clear homolog in any of the other species aligned. However, a consensus splice acceptor site and the start position of an Acembly gene prediction exon (exon 9 in *Met.f* and *Met.e*, exon 10 in *Met.a* and *Met.h*), beginning 54 nucleotides downstream of the RefSeq exon 10, is neatly conserved across all the mammals studied. Additionally, no human ESTs correspond to the RefSeq annotation, whereas all ESTs (as annotated in the UCSC genome browser) align precisely with the Acembly gene prediction exon.

**Figure 4.**
Limited area of dynamic programming around an anchor. An anchor between (*i, j*) and (*i‘, j‘*) in the rough global alignment limits the search area of Needleman-Wunsch. The alignment is required to pass through the diagonal from (i − *r, j* + r) to (i + *r, j* − r), stay within the diagonals from (i − r, j + r) to (i‘ − r, j‘ + r) and from (i + r, j − r) to (i‘ + r, j‘ − r), and exit through the diagonal from (i‘ − *r, j‘* + r) to (i‘ + *r, j‘* − r).

**Figure 5.**
Generation of anchors during progressive alignment. Multi-sequence X/Y is aligned to sequence Z. Anchors between X and Z (*top*) and anchors between Y and Z (*middle*) are remapped to coordinates in the X/Y multi-sequence, and given a new score. Then, the Longest Increasing Subsequence algorithm is applied to select a subset of the remapped anchors, as the anchors between X/Y and Z.

**Figure 6.**
The stacking effect. (A) The correct multiple alignment of seven sequences. (B) The multiple alignment of the same seven sequences, with the stacking effect. When only gap-open and gap-continue penalties are used, the stacked alignment (B) incurs one gap-open penalty less than the correct alignment (A), and therefore is optimal. When gap-end penalties are used, the correct alignment (A) is optimal because it incurs two gap-end penalties less than the stacked alignment (B).

**Figure 7.**
Part of a multiple alignment between eight mammalian sequences, before and after iterative refinement. Refinement improves the overall alignment by correctly placing the two T’s in the fifth and sixth sequence, and changing a 1-bp deletion in sequence 5 into an insertion in sequence 6.

See this image and copyright information in PMC

Cited by

Evolutionary dynamics of the accessory genome of Listeria monocytogenes.
den Bakker HC, Desjardins CA, Griggs AD, Peters JE, Zeng Q, Young SK, Kodira CD, Yandava C, Hepburn TA, Haas BJ, Birren BW, Wiedmann M. den Bakker HC, et al. PLoS One. 2013 Jun 25;8(6):e67511. doi: 10.1371/journal.pone.0067511. Print 2013. PLoS One. 2013. PMID: 23825666 Free PMC article.
An independent genome duplication inferred from Hox paralogs in the American paddlefish--a representative basal ray-finned fish and important comparative reference.
Crow KD, Smith CD, Cheng JF, Wagner GP, Amemiya CT. Crow KD, et al. Genome Biol Evol. 2012;4(9):937-53. doi: 10.1093/gbe/evs067. Epub 2012 Jul 31. Genome Biol Evol. 2012. PMID: 22851613 Free PMC article.
GLADX: an automated approach to analyze the lineage-specific loss and pseudogenization of genes.
Dainat J, Paganini J, Pontarotti P, Gouret P. Dainat J, et al. PLoS One. 2012;7(6):e38792. doi: 10.1371/journal.pone.0038792. Epub 2012 Jun 18. PLoS One. 2012. PMID: 22723889 Free PMC article.
Two Korean Endemic Clematis Chloroplast Genomes: Inversion, Reposition, Expansion of the Inverted Repeat Region, Phylogenetic Analysis, and Nucleotide Substitution Rates.
Choi KS, Ha YH, Gil HY, Choi K, Kim DK, Oh SH. Choi KS, et al. Plants (Basel). 2021 Feb 19;10(2):397. doi: 10.3390/plants10020397. Plants (Basel). 2021. PMID: 33669616 Free PMC article.
Conserved Noncoding Sequences Regulate lhx5 Expression in the Zebrafish Forebrain.
Sun L, Chen F, Peng G. Sun L, et al. PLoS One. 2015 Jul 6;10(7):e0132525. doi: 10.1371/journal.pone.0132525. eCollection 2015. PLoS One. 2015. PMID: 26147098 Free PMC article.

See all "Cited by" articles

References

1. Altschul S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed
1. Altschul S.F., Madden, T.L., Schoffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
1. Anson E.L. and Myers, E.W. 1997. Re-Aligner: A program for refining DNA sequence multialignments. J. Comp. Biol. 4: 369-383. - PubMed
1. Barton G.J. and Sternberg, M.J.E. 1987. A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198: 327-337. - PubMed
1. Batzoglou S., Pachter, L., Mesirov, J., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

[1] Altschul S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed

[2] Altschul S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed

[3] Altschul S.F., Madden, T.L., Schoffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed

[4] Altschul S.F., Madden, T.L., Schoffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed

[5] Anson E.L. and Myers, E.W. 1997. Re-Aligner: A program for refining DNA sequence multialignments. J. Comp. Biol. 4: 369-383. - PubMed

[6] Anson E.L. and Myers, E.W. 1997. Re-Aligner: A program for refining DNA sequence multialignments. J. Comp. Biol. 4: 369-383. - PubMed

[7] Barton G.J. and Sternberg, M.J.E. 1987. A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198: 327-337. - PubMed

[8] Barton G.J. and Sternberg, M.J.E. 1987. A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198: 327-337. - PubMed

[9] Batzoglou S., Pachter, L., Mesirov, J., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. - PMC - PubMed

[10] Batzoglou S., Pachter, L., Mesirov, J., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Affiliation

LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous