Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

doi:10.1101/gr.3715005

. 2005 Aug;15(8):1034-50.

doi: 10.1101/gr.3715005. Epub 2005 Jul 15.

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Adam Siepel¹, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, Ladeana W Hillier, Stephen Richards, George M Weinstock, Richard K Wilson, Richard A Gibbs, W James Kent, Webb Miller, David Haussler

Affiliations

PMID: 16024819
PMCID: PMC1182216
DOI: 10.1101/gr.3715005

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Adam Siepel et al. Genome Res. 2005 Aug.

. 2005 Aug;15(8):1034-50.

doi: 10.1101/gr.3715005. Epub 2005 Jul 15.

Authors

Affiliation

¹ Center for Biomolecular Science and Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA. acs@soe.ucsc.edu

PMID: 16024819
PMCID: PMC1182216
DOI: 10.1101/gr.3715005

Abstract

We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.

PubMed Disclaimer

Figures

**Figure 1.**
State-transition diagram for the phylo-HMM used by phastCons, which consists of a state for conserved regions (c) and a state for nonconserved regions (n). Each state is associated with a phylogenetic model (ψ_c and ψ_n); these models are identical except for a scaling parameter ρ (0 ≤ ρ ≤ 1), which is applied to the branch lengths of ψ_c and represents the average rate of substitution in conserved regions as a fraction of the average rate in nonconserved regions (see Methods). Two parameters, μ and ν (0 ≤ μ, ν ≤ 1), define all state-transition probabilities, as illustrated. The probability of visiting each state first (indicated by arcs from the node labeled “begin”) is simply set equal to the probability of that state at equilibrium (stationarity). The model can be thought of as a probabilistic machine that “generates” a multiple alignment, consisting of alternating sequences of conserved (dark gray) and nonconserved (light gray) alignment columns (see example at *bottom*).

**Figure 2.**
The assumed tree topologies for the vertebrate, insect, worm, and yeast data sets (*top* to *bottom*) and the branch lengths estimated for the conserved (*left*) and nonconserved (*right*) states of the phylo-HMM. The conserved and nonconserved phylogenies are identical, except for the scaling constant ρ, which was estimated at 0.33, 0.24, 0.36, and 0.32 (*top* to *bottom*). Horizontal lines indicate branch lengths and are drawn to scale, both within and between species groups. The estimated trees were unrooted; arbitrary roots were chosen for display purposes. Note that some distortions in the branch lengths occur due to alignment-related ascertainment biases (see text and Supplemental material).

**Figure 3.**
Fractions of bases of various annotation types covered by predicted conserved elements (*left*) and fractions of bases in conserved elements belonging to various annotation types (*right*). Annotation types include coding regions of known genes (CDS), 5′ and 3′ UTRs of known genes, other regions aligned to mRNAs or spliced ESTs from GenBank (other mRNA), other transcribed regions according to data from Phase 2 of the Affymetrix/NCI Human Transcriptome project (other trans; see Methods), introns of known genes, and other regions (unannotated). All annotations were for the reference genome of each species group and all fractions were computed with respect to these genomes (see Methods). Dashed lines in column graphs indicate expected coverage if conserved elements were distributed uniformly. Transcriptome data was available for the vertebrates only, and UTRs and other mRNAs were omitted for yeast because of sparse data. Note that these graphs change somewhat (but not dramatically) under alternative calibration methods (see Supplemental material).

**Figure 4.**
Screen shots of the conservation tracks in the (A) human and (B) *S. cerevisiae* UCSC Genome Browsers. Each conservation track has two parts, a plot of conservation scores, and *beneath* it, a display showing where each of the other genomes aligns to the reference genome. (Darker shading indicates higher BLASTZ scores; white indicates no alignment.) A separate track labeled “PhastCons Conserved Elements” shows predicted conserved elements and log-odds scores. In A, exons 7–11 of the RNA-edited human gene *GRIA2* are shown. Peaks in the conservation plot generally correspond to exons and valleys to noncoding regions, but a 158-bp conserved noncoding element can be seen near the 3′ end of exon 11. This conserved element includes the editing complementary sequence (ECS) of the RNA editing site in exon 11. The displays seen when zooming in to the base level at a typical exon (*left*) and in the region of the RNA editing site (*right*; see arrow) are shown as *insets*. On the *left*, several synonymous substitutions are visible (highlighted bases) and the elevated conservation abruptly ends after the splice site, while on the *right*, there are fewer synonymous substitutions and the elevated conservation extends into the intron. In the base-level display, the vertical orange bars and numbers above them indicate “hidden” indels and their lengths—i.e., deletions in the human genome or insertions in other genomes. In B, the *S. cerevisiae GAL1* gene and 5′-flanking region are shown. Strong cross-species conservation can be seen in the regulatory region upstream of the promoter, as well as in the protein-coding portion of the gene. The conserved element shown at *bottom* overlaps three *GAL4*-binding sites (highlighted in base-level view). A fourth *GAL4*-binding site also is reflected by a small bump in the conservation scores (*left* arrow), as is the promoter itself (*right* arrow).

**Figure 5.**
Extreme conservation at the 3′ end of the *ELAVL4* (*HuD*) gene, an RNA-binding gene associated with paraneoplastic encephalomyelitis sensory neuropathy and homologous to *Drosophila* genes with established roles in neurogenesis and sex determination (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM). The 3117-bp conserved element that overlaps the 3′ UTR of this gene (arrow) is the fifth highest scoring conserved element in the human genome (log odds score 2475). Several conserved elements in introns are also visible.

**Figure 6.**
Histograms of folding potential scores (FPSs) for (A) highly conserved elements (HCEs) in 3′ UTRs vs. a random sample of 3′ UTRs without HCEs, (B) HCEs in 3′ UTRs vs. HCEs in 5′ UTRs, and (C) HCEs in introns vs. HCEs in coding regions (vertebrate data in all cases). Scores are based on a phylogenetic stochastic context-free grammar, and represent the potential for local secondary structure in a sliding window of 150 bp (see Methods). In all three cases, the difference between the distributions is highly statistically significant (P = 8.8 e-66, P = 1.1 e-8, and P = 4.4 e-215, respectively; Wilcoxon rank sum test).

See this image and copyright information in PMC

References

1. Aruscavage, P.J. and Bass, B.L. 2000. A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. RNA 6: 257–269. - PMC - PubMed
1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25: 25–29. - PMC - PubMed
1. Bejerano, G., Haussler, D., and Blanchette, M. 2004a. Into the heart of darkness: Large-scale clustering of human non-coding DNA. Bioinformatics 20: I40–I48. - PubMed
1. Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., and Haussler, D. 2004b. Ultraconserved elements in the human genome. Science 304: 1321–1325. - PubMed
1. Bergman, C.M. and Kreitman, M. 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11: 1335–1345. - PubMed

Web site references

1. http://www.cse.ucsc.edu/~acs/conservation; Supplemental data for this study.
1. http://genome.ucsc.edu; UC Santa Cruz Genome Browser. - PubMed
1. http://genome.ucsc.edu/cgi-bin/hgTables; UC Santa Cruz Table Browser.
1. http://www.genetics.wustl.edu/saccharomycesgenomes/Contigs; download page for yeast sequence data, Washington University, St. Louis.
1. http://www.broad.mit.edu/ftp/pub/annotation/fungi/comp_yeasts; download page for yeast sequence data, Broad Institute.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase
- Saccharomyces Genome Database

[1] Aruscavage, P.J. and Bass, B.L. 2000. A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. RNA 6: 257–269. - PMC - PubMed

[2] Aruscavage, P.J. and Bass, B.L. 2000. A phylogenetic analysis reveals an unusual sequence conservation within introns involved in RNA editing. RNA 6: 257–269. - PMC - PubMed

[3] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25: 25–29. - PMC - PubMed

[4] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. Nat. Genet. 25: 25–29. - PMC - PubMed

[5] Bejerano, G., Haussler, D., and Blanchette, M. 2004a. Into the heart of darkness: Large-scale clustering of human non-coding DNA. Bioinformatics 20: I40–I48. - PubMed

[6] Bejerano, G., Haussler, D., and Blanchette, M. 2004a. Into the heart of darkness: Large-scale clustering of human non-coding DNA. Bioinformatics 20: I40–I48. - PubMed

[7] Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., and Haussler, D. 2004b. Ultraconserved elements in the human genome. Science 304: 1321–1325. - PubMed

[8] Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., and Haussler, D. 2004b. Ultraconserved elements in the human genome. Science 304: 1321–1325. - PubMed

[9] Bergman, C.M. and Kreitman, M. 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11: 1335–1345. - PubMed

[10] Bergman, C.M. and Kreitman, M. 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11: 1335–1345. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Affiliation

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

Authors

Affiliation

Abstract

Figures

References

Web site references

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

References

Web site references

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases