progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

doi:10.1371/journal.pone.0011147

. 2010 Jun 25;5(6):e11147.

doi: 10.1371/journal.pone.0011147.

progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

Aaron E Darling¹, Bob Mau, Nicole T Perna

Affiliations

PMID: 20593022
PMCID: PMC2892488
DOI: 10.1371/journal.pone.0011147

progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

Aaron E Darling et al. PLoS One. 2010.

. 2010 Jun 25;5(6):e11147.

doi: 10.1371/journal.pone.0011147.

Authors

Aaron E Darling¹, Bob Mau, Nicole T Perna

Affiliation

¹ Genome Center and Department of Computer Science, University of Wisconsin, Madison, Wisconsin, United States of America. aarondarling@ucdavis.edu

PMID: 20593022
PMCID: PMC2892488
DOI: 10.1371/journal.pone.0011147

Abstract

Background: Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms.

Methodology/principal findings: We describe a new method to align two or more genomes that have undergone rearrangements due to recombination and substantial amounts of segmental gain and loss (flux). We demonstrate that the new method can accurately align regions conserved in some, but not all, of the genomes, an important case not handled by our previous work. The method uses a novel alignment objective score called a sum-of-pairs breakpoint score, which facilitates accurate detection of rearrangement breakpoints when genomes have unequal gene content. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The new genome alignment algorithm demonstrates high accuracy in situations where genomes have undergone biologically feasible amounts of genome rearrangement, segmental gain and loss. We apply the new algorithm to a set of 23 genomes from the genera Escherichia, Shigella, and Salmonella. Analysis of whole-genome multiple alignments allows us to extend the previously defined concepts of core- and pan-genomes to include not only annotated genes, but also non-coding regions with potential regulatory roles. The 23 enterobacteria have an estimated core-genome of 2.46Mbp conserved among all taxa and a pan-genome of 15.2Mbp. We document substantial population-level variability among these organisms driven by segmental gain and loss. Interestingly, much variability lies in intergenic regions, suggesting that the Enterobacteriacae may exhibit regulatory divergence.

Conclusions: The multiple genome alignments generated by our software provide a platform for comparative genomic and population genomic studies. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. The difference between *positional homology* alignment and *glocal* alignment.**
Three example linear genomes are broken into genes labeled A,B,C,D, and R. R is a multi-copy (repetitive) gene, with different copies labeled using numeric subscripts. Each copy of R is assumed to be identical in sequence, so that orthology/paralogy is unknowable from nucleotide substitution (as is often the case with mobile DNA repeat elements). Genes shifted downward in a given genome are inverted (reverse complement) relative to the reference genome. The *positional homology* alignment would ideally create two local alignment blocks where each block has exactly one alignment row for each genome. Only positionally-conserved copies of the repetitive gene family R become aligned to each other. The *glocal* alignment would ideally create four local alignment blocks wherein all copies of the repetitive gene family become aligned to each other.

**Figure 2. Overview of the alignment algorithm using three example genomes A, B, and C.**

**Figure 3. Illustration of terminology used for tree nodes.**
Rotating the figure 90 degrees counterclockwise explains the descriptive use of left and right.

**Figure 4. Quantifying indel accuracy.**
The correct alignment is shown at left and four possible predicted alignments are shown as A, B, C, and D. Nucleotides have been assigned a numerical identifier. The correct alignment has a single indel which partitions the alignment into three sections: the left aligned block, the indel, and the right aligned block. Predicted alignments must have one correctly aligned nucleotide pair in each of the three sections to count a true positive indel prediction.

**Figure 5. The accuracy of aligners on sequences evolved without rearrangement and with increasing nucleotide substitution and indel rates.**
Aligners were tested on 100 combinations of indel and substitution rate, with performance averaged over three replicates. All methods lose accuracy as mutation rates grow, and the most accurate alignment method depends on the particular mutation rates. progressiveMauve and MLAGAN exhibit the best indel sensitivity and positive predictive value (PPV), while TBA is more sensitive than other methods at extremely high mutation rates. MLAGAN did not align genomes without indels within the allotted 10 hours, resulting in the black row at the bottom. The asterisk in this figure indicates the combination of indel rate and substitution rate expected to be similar to our 23 target genomes.

**Figure 6. Accuracy of mauveAligner, progressiveMauve, and TBA when aligning genomes with inversions and segmental gain and loss.**
In the experiments shown at top, the inversion rate increases along the -axis and the substitution rate along the -axis. The most distant taxa have 0.05 indels per site. progressiveMauve clearly outperforms mauveAligner 1.3.0 over the entire space of inversion rates. It should be noted that in applications such as the UCSC browser alignments TBA was used in conjuction with a separate synteny-mapping method to identify rearrangements , so the performance results given here are not cause for alarm. Experiments at bottom quantify aligner performance in the presence of small- and large-scale gain and loss events. The -axis gives the average number of large gain and loss events [lengthUnif(10kbp, 50kbp)] between the most distant taxa, while the -axis gives small gain and loss events [lengthGeo(200bp)]. Substitution and indel rates are those indicated by the asterisk in Figure 5, and the most distant taxa have 42 inversions on average. The asterisk in this figure indicates a simulation scenario expected to be similar to our 23 target genomes. Once again progressiveMauve outperforms other methods, but all methods break down when faced with substantial large-scale gain and loss. Of note, when mauveAligner 1.3.0 attains high PPV it usually does so with very poor sensitivity.

formula image — **Figure 6. Accuracy of mauveAligner, progressiveMauve, and TBA when aligning genomes with inversions and segmental gain and loss.**
In the experiments shown at top, the inversion rate increases along the -axis and the substitution rate along the -axis. The most distant taxa have 0.05 indels per site. progressiveMauve clearly outperforms mauveAligner 1.3.0 over the entire space of inversion rates. It should be noted that in applications such as the UCSC browser alignments TBA was used in conjuction with a separate synteny-mapping method to identify rearrangements , so the performance results given here are not cause for alarm. Experiments at bottom quantify aligner performance in the presence of small- and large-scale gain and loss events. The -axis gives the average number of large gain and loss events [lengthUnif(10kbp, 50kbp)] between the most distant taxa, while the -axis gives small gain and loss events [lengthGeo(200bp)]. Substitution and indel rates are those indicated by the asterisk in Figure 5, and the most distant taxa have 42 inversions on average. The asterisk in this figure indicates a simulation scenario expected to be similar to our 23 target genomes. Once again progressiveMauve outperforms other methods, but all methods break down when faced with substantial large-scale gain and loss. Of note, when mauveAligner 1.3.0 attains high PPV it usually does so with very poor sensitivity.

**Figure 7. Frequency with which gaps are correctly predicted as a singular gap as a function of gap size.**
**Left** Average size distribution of gaps in an alignment of the nine genomes evolved at mutation rates which correspond to previous estimates for the *E. coli*, *Shigella*, and *Salmonella*. The gap size distribution was averaged over 10 simulations. **Right** Fraction of TP indel predictions that are singular TP indel predictions by true gap size. Ten replicate simulations of evolution with gene gain, gene loss, indels, and nucleotide substitution were performed and alignments were computed using each aligner. Predicted indels were classified according to the definitions given in Figure 4, namely, a singular True Positive implies the true gap is predicted as a single gap. Remaining True Positive indels have the true gap broken up into two or more predicted gaps. For each aligner, the fraction of singular predicted gaps is shown as a function of gap size. Missing points indicate a lack of TP indel predictions in that size category. All aligners do well in predicting small gaps, but large gaps present problems. Most aligners, including Pecan which uses an extra pair-HMM state to model long gaps, tend to predict long gaps as a series of short gaps interspersed with alignments of unrelated sequence. We refer to such behavior as “gap dribble.” progressiveMauve was run with default parameters (proMauve), without the Homology HMM (proMauve_no_HMM), with the option to assume genomes are collinear (proMauve_col), and finally assuming collinearity and without the HMM (proMauve_col_no_HMM).

**Figure 8. Venn diagram of the pan-genome (left) and core genome (right) of *E. coli/Shigella* and *S. enterica*.**

**Figure 9. A Mauve visualization of the hypervariable intergenic regions surrounding *yhdE*.**
Each genome is laid out in a horizontal track, with annotated coding regions shown as white boxes. A colored similarity plot is shown for each genome, the height of which is proportional to the level of sequence identity in that region. When the similarity plot points downward it indicates an alignment to the reverse strand of the genome. Colors in the similarity plot indicate the combination of organisms containing a particular segment of the genome. Segments colored pink/mauve are conserved among all organisms, while purple segments are conserved in everything but *Salmonella*, and segments colored in olive green are conserved among non-uropathogenic *E. coli*. The visualization system is interactive and written in Java, and works on all computers supporting Java 1.4 or later.

See this image and copyright information in PMC

Cited by

Draft Genome Sequence of Erythromycin-Resistant Streptococcus gallolyticus subsp. gallolyticus NTS 31106099 Isolated from a Patient with Infective Endocarditis and Colorectal Cancer.
Kambarev S, Caté C, Corvec S, Pecorari F. Kambarev S, et al. Genome Announc. 2015 Apr 23;3(2):e00370-15. doi: 10.1128/genomeA.00370-15. Genome Announc. 2015. PMID: 25908147 Free PMC article.
Whole-Genome Sequence Analysis and Genome-Wide Virulence Gene Identification of Riemerella anatipestifer Strain Yb2.
Wang X, Ding C, Wang S, Han X, Yu S. Wang X, et al. Appl Environ Microbiol. 2015 Aug;81(15):5093-102. doi: 10.1128/AEM.00828-15. Epub 2015 May 22. Appl Environ Microbiol. 2015. PMID: 26002892 Free PMC article.
Mapping the Evolution of Hypervirulent Klebsiella pneumoniae.
Struve C, Roe CC, Stegger M, Stahlhut SG, Hansen DS, Engelthaler DM, Andersen PS, Driebe EM, Keim P, Krogfelt KA. Struve C, et al. mBio. 2015 Jul 21;6(4):e00630. doi: 10.1128/mBio.00630-15. mBio. 2015. PMID: 26199326 Free PMC article.
Type 1 and type 2 strains of Mycoplasma pneumoniae form different biofilms.
Simmons WL, Daubenspeck JM, Osborne JD, Balish MF, Waites KB, Dybvig K. Simmons WL, et al. Microbiology (Reading). 2013 Apr;159(Pt 4):737-747. doi: 10.1099/mic.0.064782-0. Epub 2013 Feb 14. Microbiology (Reading). 2013. PMID: 23412845 Free PMC article.
Characterization of the complete mitochondrial genome of Miamiensis avidus causing flatfish scuticociliatosis.
Kim NY, Ahn SJ, Seo JS, Jeon EJ, Cho MY, Choi HS. Kim NY, et al. Genetica. 2022 Dec;150(6):407-420. doi: 10.1007/s10709-022-00167-5. Epub 2022 Oct 21. Genetica. 2022. PMID: 36269500

See all "Cited by" articles

References

1. Kumar S, Filipski A. Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 2007;17:127–135. - PubMed
1. Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23 - PubMed
1. Prakash A, Tompa M. Measuring the accuracy of genome-size multiple alignments. Genome Biology. 2007;8:R124+. - PMC - PubMed
1. Dewey CN, Pachter L. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet. 2006;15(Suppl 1) - PubMed
1. Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Kumar S, Filipski A. Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 2007;17:127–135. - PubMed

[2] Kumar S, Filipski A. Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 2007;17:127–135. - PubMed

[3] Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23 - PubMed

[4] Lunter G. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics. 2007;23 - PubMed

[5] Prakash A, Tompa M. Measuring the accuracy of genome-size multiple alignments. Genome Biology. 2007;8:R124+. - PMC - PubMed

[6] Prakash A, Tompa M. Measuring the accuracy of genome-size multiple alignments. Genome Biology. 2007;8:R124+. - PMC - PubMed

[7] Dewey CN, Pachter L. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet. 2006;15(Suppl 1) - PubMed

[8] Dewey CN, Pachter L. Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet. 2006;15(Suppl 1) - PubMed

[9] Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. - PubMed

[10] Fitch WM. Homology: a personal view on some of the problems. Trends Genet. 2000;16:227–231. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

Affiliation

progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources