Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Aug;15(8):1034-50.
doi: 10.1101/gr.3715005. Epub 2005 Jul 15.

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Affiliations

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Adam Siepel et al. Genome Res. 2005 Aug.

Abstract

We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
State-transition diagram for the phylo-HMM used by phastCons, which consists of a state for conserved regions (c) and a state for nonconserved regions (n). Each state is associated with a phylogenetic model (ψc and ψn); these models are identical except for a scaling parameter ρ (0 ≤ ρ ≤ 1), which is applied to the branch lengths of ψc and represents the average rate of substitution in conserved regions as a fraction of the average rate in nonconserved regions (see Methods). Two parameters, μ and ν (0 ≤ μ, ν ≤ 1), define all state-transition probabilities, as illustrated. The probability of visiting each state first (indicated by arcs from the node labeled “begin”) is simply set equal to the probability of that state at equilibrium (stationarity). The model can be thought of as a probabilistic machine that “generates” a multiple alignment, consisting of alternating sequences of conserved (dark gray) and nonconserved (light gray) alignment columns (see example at bottom).
Figure 2.
Figure 2.
The assumed tree topologies for the vertebrate, insect, worm, and yeast data sets (top to bottom) and the branch lengths estimated for the conserved (left) and nonconserved (right) states of the phylo-HMM. The conserved and nonconserved phylogenies are identical, except for the scaling constant ρ, which was estimated at 0.33, 0.24, 0.36, and 0.32 (top to bottom). Horizontal lines indicate branch lengths and are drawn to scale, both within and between species groups. The estimated trees were unrooted; arbitrary roots were chosen for display purposes. Note that some distortions in the branch lengths occur due to alignment-related ascertainment biases (see text and Supplemental material).
Figure 3.
Figure 3.
Fractions of bases of various annotation types covered by predicted conserved elements (left) and fractions of bases in conserved elements belonging to various annotation types (right). Annotation types include coding regions of known genes (CDS), 5′ and 3′ UTRs of known genes, other regions aligned to mRNAs or spliced ESTs from GenBank (other mRNA), other transcribed regions according to data from Phase 2 of the Affymetrix/NCI Human Transcriptome project (other trans; see Methods), introns of known genes, and other regions (unannotated). All annotations were for the reference genome of each species group and all fractions were computed with respect to these genomes (see Methods). Dashed lines in column graphs indicate expected coverage if conserved elements were distributed uniformly. Transcriptome data was available for the vertebrates only, and UTRs and other mRNAs were omitted for yeast because of sparse data. Note that these graphs change somewhat (but not dramatically) under alternative calibration methods (see Supplemental material).
Figure 4.
Figure 4.
Screen shots of the conservation tracks in the (A) human and (B) S. cerevisiae UCSC Genome Browsers. Each conservation track has two parts, a plot of conservation scores, and beneath it, a display showing where each of the other genomes aligns to the reference genome. (Darker shading indicates higher BLASTZ scores; white indicates no alignment.) A separate track labeled “PhastCons Conserved Elements” shows predicted conserved elements and log-odds scores. In A, exons 7–11 of the RNA-edited human gene GRIA2 are shown. Peaks in the conservation plot generally correspond to exons and valleys to noncoding regions, but a 158-bp conserved noncoding element can be seen near the 3′ end of exon 11. This conserved element includes the editing complementary sequence (ECS) of the RNA editing site in exon 11. The displays seen when zooming in to the base level at a typical exon (left) and in the region of the RNA editing site (right; see arrow) are shown as insets. On the left, several synonymous substitutions are visible (highlighted bases) and the elevated conservation abruptly ends after the splice site, while on the right, there are fewer synonymous substitutions and the elevated conservation extends into the intron. In the base-level display, the vertical orange bars and numbers above them indicate “hidden” indels and their lengths—i.e., deletions in the human genome or insertions in other genomes. In B, the S. cerevisiae GAL1 gene and 5′-flanking region are shown. Strong cross-species conservation can be seen in the regulatory region upstream of the promoter, as well as in the protein-coding portion of the gene. The conserved element shown at bottom overlaps three GAL4-binding sites (highlighted in base-level view). A fourth GAL4-binding site also is reflected by a small bump in the conservation scores (left arrow), as is the promoter itself (right arrow).
Figure 5.
Figure 5.
Extreme conservation at the 3′ end of the ELAVL4 (HuD) gene, an RNA-binding gene associated with paraneoplastic encephalomyelitis sensory neuropathy and homologous to Drosophila genes with established roles in neurogenesis and sex determination (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). The 3117-bp conserved element that overlaps the 3′ UTR of this gene (arrow) is the fifth highest scoring conserved element in the human genome (log odds score 2475). Several conserved elements in introns are also visible.
Figure 6.
Figure 6.
Histograms of folding potential scores (FPSs) for (A) highly conserved elements (HCEs) in 3′ UTRs vs. a random sample of 3′ UTRs without HCEs, (B) HCEs in 3′ UTRs vs. HCEs in 5′ UTRs, and (C) HCEs in introns vs. HCEs in coding regions (vertebrate data in all cases). Scores are based on a phylogenetic stochastic context-free grammar, and represent the potential for local secondary structure in a sliding window of 150 bp (see Methods). In all three cases, the difference between the distributions is highly statistically significant (P = 8.8 e-66, P = 1.1 e-8, and P = 4.4 e-215, respectively; Wilcoxon rank sum test).

Similar articles

Cited by

References

    1. Aruscavage, P.J. and Bass, B.L. 2000. A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. RNA 6: 257–269. - PMC - PubMed
    1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25: 25–29. - PMC - PubMed
    1. Bejerano, G., Haussler, D., and Blanchette, M. 2004a. Into the heart of darkness: Large-scale clustering of human non-coding DNA. Bioinformatics 20: I40–I48. - PubMed
    1. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., and Haussler, D. 2004b. Ultraconserved elements in the human genome. Science 304: 1321–1325. - PubMed
    1. Bergman, C.M. and Kreitman, M. 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11: 1335–1345. - PubMed

Web site references

    1. http://www.cse.ucsc.edu/~acs/conservation; Supplemental data for this study.
    1. http://genome.ucsc.edu; UC Santa Cruz Genome Browser. - PubMed
    1. http://genome.ucsc.edu/cgi-bin/hgTables; UC Santa Cruz Table Browser.
    1. http://www.genetics.wustl.edu/saccharomycesgenomes/Contigs; download page for yeast sequence data, Washington University, St. Louis.
    1. http://www.broad.mit.edu/ftp/pub/annotation/fungi/comp_yeasts; download page for yeast sequence data, Broad Institute.

Publication types