Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Mar 10:6:49.
doi: 10.1186/1471-2105-6-49.

A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities

Affiliations

A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities

Olivier Bastien et al. BMC Bioinformatics. .

Abstract

Background: Popular methods to reconstruct molecular phylogenies are based on multiple sequence alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence alignments, respect probabilistic properties of Z-scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction.

Results: We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise alignment score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-scores. Deduced trees, called TULIP trees, are consistent with multiple-alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny.

Conclusion: The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-score computations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Geometric and probabilistic representation of the configuration space of homologous proteins (CSHP). For any sequence a taken as a reference (aref), one can build a configuration space (CSHP, aref) where all sequences that are homologous to a can be set. When two sequences a and b are aligned with a score s(a,b), then b is positioned in (CSHP, aref) and a in (CSHP, bref). The sequence alignment length determines the number of configuration dimensions; pair-wise amino acid scores determine the unique solution for its positioning. The q-dissimilarity (q = e-s) defines a proximity between sequences allowing a geometric representation (CSHP, q). Remarkable properties are i) the conservation of mutual information, [I(a;b) = I(b;a) ⇒ q(a,b) = q(b,a)], between (CSHP, aref) and (CSHP, bref), ii) a probabilistic representation of homologies based on q-dissimilarities by Venn diagrams (A and B) and iii) the assignment of a topology relying on protein evolution assumptions. Evolutionary paths for a and b lineages, sharing an unknown ancestor u, have a probabilistic expression, bounded above (see text), supporting a phylogenetic topology (TULIP trees).
Figure 2
Figure 2
Glucose-6-phosphate isomerase phylogeny. (A) Multiple alignment based (MAB) tree. (B) TULIP tree. Both trees were constructed using the BLOSUM 62 similarity matrix. For MAB tree construction, bootstrap support was estimated using 1000 replicates. To build TULIP trees, Z-scores were estimated with 2000 sequence shuffling. Topology supported by high bootstrap results in the MAB tree (figures in black), are consistently recovered in the corresponding pair-wise alignement based TULIP tree.
Figure 3
Figure 3
Enolase phylogenic incongruence. When aligned, the enolase region corresponding to amino acids 73–118 of the Oryza sativa gene, exhibit two insertions (red boxes) that are only present in land plants, charophytes and alveolates. In alveolates, these insertions are consistent with a horizontal gene transfer. However, to date, evolutionary reconstructions based on enolase sequences did not allow any phylogenetic branch gathering for these clades [36].
Figure 4
Figure 4
Solution of the enolase phylogenic incongruence. (A) Multiple alignment based (MAB) tree. (B) TULIP tree. Both trees were constructed using the BLOSUM 62 similarity matrix. For MAB tree construction, bootstrap support was estimated using 1000 replicates. To build TULIP trees, Z-scores were estimated with 2000 sequence shuffling. Clades that contain a unique insertional signature (Land plants, green box; Charophytes and Chlorarachnion, blue box; Alveolates; yellow box) are not gathered in the MAB tree, as previously reported [36]. By contrast, in the TULIP tree, the phylogeny of enolase proteins is reconciled with the insertional signature detection in Land plants, Charophytes, Chlorarachnion and Alveolates.
Figure 5
Figure 5
(A) Gumbel score distribution simulated for enolases used in the present paper (B) graphical determination of ψ

Similar articles

Cited by

References

    1. Zuckerkandl E, Pauling L. Molecules as documents of evolutionary history. J Theor Biol. 1965;8:357–366. - PubMed
    1. Zukerkandl E. The evolution of hemoglobin. Sci Am. 1965;212:110–118. - PubMed
    1. Fitch WM, Margoliash E. Construction of phylogenetic trees. Science. 1967;155:279–284. - PubMed
    1. Arnheim N, Taylor CE. Non-Darwinian evolution: consequences for neutral allelic variation. Nature. 1969;223:900–903. - PubMed
    1. Dayhoff MO. Computer analysis of protein evolution. Sci Am. 1969;221:86–95. - PubMed

Publication types

MeSH terms

LinkOut - more resources