Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Oct;36(17):e110.
doi: 10.1093/nar/gkn485. Epub 2008 Aug 1.

An approach of orthology detection from homologous sequences under minimum evolution

Affiliations

An approach of orthology detection from homologous sequences under minimum evolution

Kyung Mo Kim et al. Nucleic Acids Res. 2008 Oct.

Abstract

In the field of phylogenetics and comparative genomics, it is important to establish orthologous relationships when comparing homologous sequences. Due to the slight sequence dissimilarity between orthologs and paralogs, it is prone to regarding paralogs as orthologs. For this reason, several methods based on evolutionary distance, phylogeny and BLAST have tried to detect orthologs with more precision. Depending on their algorithmic implementations, each of these methods sometimes has increased false negative or false positive rates. Here, we developed a novel algorithm for orthology detection that uses a distance method based on the phylogenetic criterion of minimum evolution. Our algorithm assumes that sets of sequences exhibiting orthologous relationships are evolutionarily less costly than sets that include one or more paralogous relationships. Calculation of evolutionary cost requires the reconstruction of a neighbor-joining (NJ) tree, but calculations are unaffected by the topology of any given NJ tree. Unlike tree reconciliation, our algorithm appears free from the problem of incorrect topologies of species and gene trees. The reliability of the algorithm was tested in a comparative analysis with two other orthology detection methods using 95 manually curated KOG datasets and 21 experimentally verified EXProt datasets. Sensitivity and specificity estimates indicate that the concept of minimum evolution could be valuable for the detection of orthologs.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A conceptual representation describing orthologous and paralogous relationships related to gene duplication. (a) A simple phyletic history of gene duplication. In the hypothetical phylogenetic tree, the primary ancestral gene X duplicated into two ancestral descendants α (red) and β (blue). It follows that they have diverged along with speciation into three species. (b) Unfolded phylogenetic trees showing orthologous and paralogous relationships. There are two orthologous clusters [(αA, αB), αC] and [(βA, βB), βC]. Within each orthologous cluster, the letters marked beside a branch indicates the length of a branch. (c) Comparison of minimum evolution scores between orthologous and paralogous relationships.
Figure 2.
Figure 2.
Conceptual representation of the detection of orthologs in a given alignment. (a) In an input alignment, a sequence name consists of a sequence identifier and the species information. The letters before an underbar denote a sequence identifier, while those after the underbar indicate an abbreviation of the scientific name of a species. (b) An upper box includes sequences with one occurrence per species, while the lower box includes paralogous sequences with more than one occurrence per species. (c) All possible combinations of sequences in which one species is represented only once by one of its sequences in a combination. (d) Datasets in which (b) dataset was merged into one of (c) datasets. (e) Collection of phylogenetic trees reconstructed from the merged datasets. (f) Calculation of minimum evolution scores for obtained phylogenetic trees. (g) Selection of the smallest minimum evolution score. (h) Determination of orthologous sequences.
Figure 3.
Figure 3.
Real running time of Mestortho according to the number of sequences for each dataset. To plot the graph, 21 EXProt and 90 KOG alignments were used. The remaining five KOG alignments with sequence sizes of 67, 76, 49, 60 and 71 had running times of 26301, 18200, 274, 539 and 4540 s, respectively. The equation of the fit curve is y = 0.002x2.365 (R2 = 0.798).
Figure 4.
Figure 4.
Tree reconciliation versus Mestortho. (a) Phylogenetic tree reconstructed from the sequences of myoglobin, and α- and β-hemoglobins. Boxed sequences have no genetic distance between them. The sequences within the three dotted ellipses have orthologous relationships according to previous reports. (b) Results of Mestortho for the aligned dataset. The letter C and N in the first row of each Table denotes ‘co-orthology’ and ‘no genetic distance’. The sequences having relationships of co-orthology and no genetic distance were marked by red and blue bars, respectively. The abbreviations Hpor, Hsap and Mmus indicate H. portusjacksoni (Port Jackson shark), H. sapiens (human) and Mus musculus (mouse).
Figure 5.
Figure 5.
The confidence interval of True datasets at a one-tailed 95% significance level. Among 116 datasets (KOG+EXProt), 33 alignments were used to calculate the confidence interval. For each of 33 datasets, the black dot indicates the MES of False sequences, while the shaded rectangle indicates the MES range of the one-tailed 95% confidence interval of True sequences. A number in parentheses is the number of False sequences in each of the datasets.
Figure 6.
Figure 6.
Two examples of errors when running Mestortho. (a and b) In the phylogenetic trees, A, B and C indicate species, and α and β denote two descendants after gene duplication. In each of the trees, red branches indicate the lineages detected as orthologous sequences by Mestortho. (c) The phylogenetic tree of (b) is obtained from an EXProt dataset (EC 1.15.1.1) corresponding to the model (b). The monophyletic sequence group is marked by a triangle. The branches of genes detected as orthologs by Mestortho are indicated in bold. Among taxa of the phylogenetic tree, True sequences are also marked in bold and red. The asterisk symbol indicates the reference sequence of the EXProt dataset. The dotted line shows the paralogous relationships between sequences.

Similar articles

Cited by

References

    1. Owen R. London: Longman, Brown, Green and Longmans; 1843. Lectures on the Comparative Anatomy and Physiology of the Invertebrate Animals.
    1. Fitch WM. Homology a personal view on some of the problems. Trends Genet. 2000;16:227–231. - PubMed
    1. Graur D, Li W-H, editors. 2nd edn. Sunderland, MA: Sinauer Associates; 2000. Fundamentals of Molecular Evolution.
    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst. Zool. 1970;19:99–106. - PubMed
    1. Sonnhammer ELL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 2002;18:619–620. - PubMed

Publication types