Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;4(11):1176-87.
doi: 10.1093/gbe/evs081.

Population diversity of ORFan genes in Escherichia coli

Affiliations

Population diversity of ORFan genes in Escherichia coli

Guoqin Yu et al. Genome Biol Evol. 2012.

Abstract

The origin and evolution of "ORFans" (suspected genes without known relatives) remain unclear. Here, we take advantage of a unique opportunity to examine the population diversity of thousands of ORFans, based on a collection of 35 complete genomes of isolates of Escherichia coli and Shigella (which is included phylogenetically within E. coli). As expected from previous studies, ORFans are shorter and AT-richer in sequence than non-ORFans. We find that ORFans often are very narrowly distributed: the most common pattern is for an ORFan to be found in only one genome. We compared within-species population diversity of ORFan genes with those of two control groups of non-ORFan genes. Patterns of population variation suggest that most ORFans are not artifacts, but encode real genes whose protein-coding capacity is conserved, reflecting selection against nonsynonymous mutations. Nevertheless, nonsynonymous nucleotide diversity is higher than for non-ORFans, whereas synonymous diversity is roughly the same. In particular, there is a several-fold excess of ORFans in the highest decile of diversity relative to controls, which might be due to weaker purifying selection, positive selection, or a subclass of ORFans that are decaying.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.—
Fig. 1.—
Clades used to define comparison groups with different phylogenetic depths. The t1 and t2 clades were chosen due to high bootstrap support (>96%) in a phylogeny of species computed as described (Materials and Methods) and available as supplementary figure S1, Supplementary Material online.
F<sc>ig</sc>. 2.—
Fig. 2.—
Scheme for creating matching control clusters for each ORFan cluster. The pseudocode shown here describes the method used to generate customized control clusters. If ORFan_list is not equal to the intersection of t_list and ORFan_list, then the putative control cluster cannot be used because it does not have the right set of strains.
F<sc>ig</sc>. 3.—
Fig. 3.—
ORFan composition as a function of genome size in 35 Escherichia coli strains. For each genome, counts are shown for three categories of putative protein-coding genes, along with regression lines. Two of the categories are mutually exclusive: each gene in a genome is either from a cluster (in the NCBI Protein Clusters database) that has a curated functional annotation (solid circle), or it is from a cluster annotated as “hypothetical protein” (plus symbols). The solid squares show the counts of ORFans, the vast majority of which are noncurated (see text). As genome size increases, the number of proteins with assigned functions remains nearly constant. The increase in genome size is not mainly attributable to ORFans, but is attributable to other genes for which functions are unknown.
F<sc>ig</sc>. 4.—
Fig. 4.—
Genic features of ORFans compared with non-ORFans. (A), average size (in base pairs). ORFans are shorter than non-ORFans. (B), average percent of GC at first (GC1), second (GC2), and third (GC3) position of codons. Except for GC2 in (B), the three classes of clusters ORFans differ significantly in genic features. ORFans have lower GC content at first and third positions of codons. Bars represent 95% confidence intervals. The letters denote significantly different results by the Wilcoxon test (results from Student’s t-test are the same).
F<sc>ig</sc>. 5.—
Fig. 5.—
Distribution of ORFan and non-ORFan genes among genomes of Escherichia coli strains. (A) The average number of E. coli stains per protein cluster in the ORFan and non-ORFan cluster groups; (B) the frequency distribution of number of E. coli stains per cluster used in the ORFan and non-ORFan comparison groups (this excludes ORFan clusters with only one member, which is the most common size of a cluster). ORFans typically have narrow distributions, while non-ORFans in the t2 comparison group are present in most genomes. Non-ORFans in the t1 group have an intermediate distribution. The letters in (A) denote significantly different results by the Wilcoxon test (results from Student’s t-test are the same).
F<sc>ig</sc>. 6.—
Fig. 6.—
Mean population statistics compared between ORFans and non-ORFans. This figure shows a comparison of means for pi (upper panel), dS (middle panel), and dN (lower panel), whereas the complete distributions are compared, via deciles, in figure 7. There are two different sets of mean values for ORFan clusters, because the comparisons with t1 and t2 use overlapping but nonidentical sets of ORFan clusters, due to the need to create matching controls with the same strain composition (see Materials and Methods). Although synonymous diversity is not much different, nonsynonymous diversity, as well as total diversity (Pi), is significantly different between ORFans and non-ORFans in the t2 control group. The letters denote significantly different results by the Wilcoxon test; Student’s t-test gives a slightly different result for dN, not shown, with no significant distinction between t1 and the ORFan clusters (i.e., the pattern is a–a–a–b).
F<sc>ig</sc>. 7.—
Fig. 7.—
Distribution of diversity statistics compared between ORFans and non-ORFan controls. The three rows show the distributions for pi (upper), dS (middle) and dN (lower), for both t1 (left column) and t2 (right column) control sets. To understand the shape of the distribution of population statistics for ORFans, values for ORFans were gathered into decile bins defined by the non-ORFan control clusters, that is, each bin comprises 10% of the distribution of non-ORFan values. The value on the Y axis for the first decile bin in (A), for instance, represents the frequency with which the Pi value for an ORFan ranks in the top 10% of the values in its customized t1 control group. (C, E) The same comparison for dN and dS; (B, D, and F) the distribution of Pi, dN, and dS (respectively) relative to the t2 comparison group. The null expectation is a straight line at a value of 10%, with a slight anomaly at the low (right) end of the distribution due to zero values (in cases where zero values exceed 10% of the control distribution, zero values in the ORFan distribution will be placed in whichever bin is counted first, which in this case tends to leave a shortage in the last bin). The symmetric but slightly U-shaped distribution of dS values indicates that ORFans exhibit greater variance, but otherwise have the same distribution of synonymous differences as non-ORFans. However, the deviation from the distribution of dN (and Pi) values is asymmetric, with a 2-fold or more excess of ORFan clusters with diversity in the top 10% or 20% of the distribution relative to non-ORFan controls.

Similar articles

Cited by

References

    1. Awano T, et al. A frame shift mutation in canine TPP1 (the ortholog of human CLN2) in a juvenile dachshund with neuronal ceroid lipofuscinosis. Mol Genet Metabol. 2006;89:254–260. - PubMed
    1. Benson MD, et al. A new human hereditary amyloidosis: the result of a stop-codon mutation in the apolipoprotein AII gene. Genomics. 2001;72:272–277. - PubMed
    1. Cai JJ, Petrov DA. Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Genome Biol Evol. 2010;2:393–409. - PMC - PubMed
    1. Cai J, Zhao RP, Jiang HF, Wang W. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics. 2008;179:487–496. - PMC - PubMed
    1. Charlebois RL, Clarke GD, Beiko RG, Jean A. Characterization of species-specific genes using a flexible, Web-based querying system. FEMS Microbiol Lett. 2003;225:213–220. - PubMed

Publication types