Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep;26(9):1257-67.
doi: 10.1101/gr.203950.115. Epub 2016 Jul 19.

In silico phylogenomics using complete genomes: a case study on the evolution of hominoids

Affiliations

In silico phylogenomics using complete genomes: a case study on the evolution of hominoids

Igor Rodrigues Costa et al. Genome Res. 2016 Sep.

Abstract

The increasing availability of complete genome data is facilitating the acquisition of phylogenomic data sets, but the process of obtaining orthologous sequences from other genomes and assembling multiple sequence alignments remains piecemeal and arduous. We designed software that performs these tasks and outputs anonymous loci (AL) or anchored enrichment/ultraconserved element loci (AE/UCE) data sets in ready-to-analyze formats. We demonstrate our program by applying it to the hominoids. Starting with human, chimpanzee, gorilla, and orangutan genomes, our software generated an exhaustive data set of 292 ALs (∼1 kb each) in ∼3 h. Not only did analyses of our AL data set validate the program by yielding a portrait of hominoid evolution in agreement with previous studies, but the accuracy and precision of our estimated ancestral effective population sizes and speciation times represent improvements. We also used our program with a published set of 512 vertebrate-wide AE "probe" sequences to generate data sets consisting of 171 and 242 independent loci (∼1 kb each) in 11 and 13 min, respectively. The former data set consisted of flanking sequences 500 bp from adjacent AEs, while the latter contained sequences bordering AEs. Although our AE data sets produced the expected hominoid species tree, coalescent-based estimates of ancestral population sizes and speciation times based on these data were considerably lower than estimates from our AL data set and previous studies. Accordingly, we suggest that loci subjected to direct or indirect selection may not be appropriate for coalescent-based methods. Complete in silico approaches, combined with the burgeoning genome databases, will accelerate the pace of phylogenomics.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
ALFIE software pipeline. (A) Anonymous loci (AL) finding module: User inputs complete genome sequences in a FASTA format and a general feature format (GFF) file for the query genome. Program first applies a user-defined “distance filter,” which removes all known functional elements + flanking sequences of user-specified lengths (purple color blocks). Remaining (presumably neutral) intergenic regions (orange color blocks), called candidate ALs, are retrieved and cut into consecutive segments of user-defined length and saved in FASTA files. (B) Anchor loci (AE/UCE) finding module: User inputs genome sequences in FASTA format. Program finds locations of target AEs/UCEs in a reference human genome with a coordinate file that currently contains 512 vertebrate AEs (included in package). Module retrieves flanking regions with user-defined length (e.g., 500 bp). User also specifies distance (in base pairs) between flanking sequences and their AEs/UCEs. Paired flanking sequences (i.e., candidate AE/UCE loci) are saved in FASTA files. (C) Downstream analyses: AL or AE/UCE candidate loci are used as query sequences in BLAST searches against target genomes. Single-copy loci are retained and subsequently aligned. A user-specified distance filter retains loci that are likely independent from other sampled loci. Each pair of AE/UCE flanking sequences is concatenated to form independent loci. Lastly, ALFIE outputs ready-to-analyze data sets.
Figure 2.
Figure 2.
Mutational profiles observed in three phylogenomic data sets obtained from human, chimpanzee, gorilla, and orangutan genomes. The data sets include 292 AL, 171 anchored enrichment (AE) loci, and 242 AE loci (for descriptions of each data set, see main text). (AC) Distributions of nucleotide substitution models for each data set (for details about the different models, see Posada 2008). (DF) Distributions of transition/transversion (Ti/Tv) rate ratios for each data set. Note, Ti/Tv values are only shown for loci that had a best-fitting nucleotide substitution model containing this parameter (i.e., 185, 133, and 230 loci, respectively). Also, five of the 171 AE loci and six of the 242 AE loci exhibited unusually high Ti/Tv values and thus were not included in these analyses. Source data are in Supplemental Tables S3 through S5.
Figure 3.
Figure 3.
Bayesian estimation of ancestral effective population sizes and speciation times in the hominoids based on four phylogenomic data sets. (A) Inferred ancestral effective population sizes (Na) for the human–chimpanzee ancestor (NHC), human–chimpanzee–gorilla ancestor (NHCG), and human–chimpanzee–gorilla–orangutan ancestor (NHCGO). (B) Inferred speciation time in millions of years ago (Mya) between: human and chimpanzee lineages (τH-C), divergence leading to gorilla lineage (τHC-G), and divergence leading to orangutan lineage (τHCG-O). “AL 53” = 53 AL of Chen and Li (2001); “AL 292” = 292 AL; “AE 171” = 171 AE loci with each flanking sequence 500 bp from the adjacent AE; and “AE 242” = 242 AE loci with each flanking sequence 0 bp from the adjacent AE. Shown are posterior means and 95% credibility intervals for each parameter, which were estimated using five different sets of priors: P1 (blue), P2 (red), P3 (green), P4 (yellow), and P5 (orange; see Methods). Analyses for each parameter are contained within black boxes, and vertical dashed gray lines separate the results of each data set. Source data are in Supplemental Tables S6 through S13.

Similar articles

Cited by

References

    1. Alföldi J, Di Palma F, Grabherr M, Williams C, Kong L, Mauceli E, Russell P, Lowe CB, Glor RE, Jaffe JD, et al. 2011. The genome of the green anole lizard and a comparative analysis with birds and mammals. Nature 477: 587–591. - PMC - PubMed
    1. Bertozzi T, Sanders KL, Sistrom MJ, Gardner MG. 2012. Anonymous nuclear loci in non-model organisms: making the most of high throughput genome surveys. Bioinformatics 28: 1807–1810. - PubMed
    1. Brito PH, Edwards SV. 2009. Multilocus phylogeography and phylogenetics using sequence based markers. Genetica 135: 439–455. - PubMed
    1. Burgess R, Yang Z. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol 25: 1979–1994. - PubMed
    1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2008. BLAST+: architecture and applications. BMC Bioinformatics 10: 421. - PMC - PubMed

Publication types