Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2017 Mar 29;91(8):e02381-16.
doi: 10.1128/JVI.02381-16. Print 2017 Apr 15.

Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species

Affiliations
Comparative Study

Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species

Francesca Di Giallonardo et al. J Virol. .

Abstract

Viruses use the cellular machinery of their hosts for replication. It has therefore been proposed that the nucleotide and dinucleotide compositions of viruses should match those of their host species. If this is upheld, it may then be possible to use dinucleotide composition to predict the true host species of viruses sampled in metagenomic surveys. However, it is also clear that different taxonomic groups of viruses tend to have distinctive patterns of dinucleotide composition that may be independent of host species. To determine the relative strength of the effect of host versus virus family in shaping dinucleotide composition, we performed a comparative analysis of 20 RNA virus families from 15 host groupings, spanning two animal phyla and more than 900 virus species. In particular, we determined the odds ratios for the 16 possible dinucleotides and performed a discriminant analysis to evaluate the capability of virus dinucleotide composition to predict the correct virus family or host taxon from which it was isolated. Notably, while 81% of the data analyzed here were predicted to the correct virus family, only 62% of these data were predicted to their correct subphylum/class host and a mere 32% to their correct mammalian order. Similarly, dinucleotide composition has a weak predictive power for different hosts within individual virus families. We therefore conclude that dinucleotide composition is generally uniform within a virus family but less well reflects that of its host species. This has obvious implications for attempts to accurately predict host species from virus genome sequences alone.IMPORTANCE Determining the processes that shape virus genomes is central to understanding virus evolution and emergence. One question of particular importance is why nucleotide and dinucleotide frequencies differ so markedly between viruses. In particular, it is currently unclear whether host species or virus family has the biggest impact on dinucleotide frequencies and whether dinucleotide composition can be used to accurately predict host species. Using a comparative analysis, we show that dinucleotide composition has a strong phylogenetic association across different RNA virus families, such that dinucleotide composition can predict the family from which a virus sequence has been isolated. Conversely, dinucleotide composition has a poorer predictive power for the different host species within a virus family and across different virus families, indicating that the host has a relatively small impact on the dinucleotide composition of a virus genome.

Keywords: dinucleotide bias; evolution.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Schematic depiction of the dinucleotide odds ratio bias across the animal RNA virus data sets analyzed here. The figure shows both dinucleotide underrepresentation (cool colors) and overrepresentation (warm colors). The degree of under- or overrepresentation is depicted by the different shadings: light, 50 to 74% of component virus data sets; medium, 75 to 89%; dark, 90 to 100%. (A) Virus families; (B) host categories.
FIG 2
FIG 2
Dinucleotide odds ratios. The figure shows observed over expected ratios (odds ratios) from the aggregated set of 1,024 data sets for the four dinucleotides CpA, CpG, UpA, and UpG. Dinucleotides are regarded as underrepresented if the odds ratio is below 0.78 and overrepresented if it is over 1.23 (dashed lines). Boxplots show the 25 to 75% data range and the median, whiskers indicate the 99.3% data coverage, and outliers are shown as dots. (A) ssRNA(+) viruses are in purple shades (Arteriviridae, Astroviridae, Caliciviridae, Coronaviridae, Dicistroviridae, Flaviviridae, Hepeviridae, Nodaviridae, Picornaviridae, and Togaviridae), ssRNA(−) viruses are in green shades (Arenaviridae, Bunyaviridae, bunya-arena-like, Filoviridae, Orthomyxoviridae, Paramyxoviridae, Rhabdoviridae, and mononega-like), and dsRNA viruses are in red shades (Birnaviridae and Reoviridae). (B) Data sets from nonmammalian vertebrates are shown in pink colors (“Birds,” “Fish,” and “Reptiles”), those from mammalian hosts are shown in brown colors (“Bats,” “Carnivores,” “Cetartiodactyla,” “Equines,” “Insectivores,” “Lagomorphs,” “Primates,” and “Rodents”), those from arthropod hosts are shown in blue colors (“Chelicerata,” “Crustacea,” and “Insecta”), and vector-borne viruses were grouped into their own category and shown in gray.
FIG 3
FIG 3
Confusion plots for the discriminant analysis of dinucleotide odds ratios across virus families and virus hosts. The heat maps show the mean percentages of data sets that were predicted into each category. Rows represent the true categories and columns the predicted categories. The correctly predicted sensitivities for each category are shown in bold and positioned along the diagonal. Dark red and black indicate high sensitivity rates and yellow and white low sensitivity rates. On the right side of the heat maps the number of data sets per category (n) and the false-discovery rate (FDR) are indicated. (A) Heat map for the virus families. Overall, 81% of the data were predicted correctly, compared to 10% with the baseline random model. (B) Heat map for the virus host separated by subphylum/class (left) and mammalian orders (right). For the analysis at the subphylum/class level, 62% of the data were predicted correctly overall, compared to 29% with the baseline random model. In the case of the mammalian orders, 32% of the data were predicted correctly, compared to 12% with the baseline random model.
FIG 4
FIG 4
Canonical score plot of the host categories by class. The figure shows a scatterplot of the two linear discriminant functions that explain the largest amount of variability from the linear discriminant analysis (50% and 19% for LD1 and LD2, respectively).

Similar articles

Cited by

References

    1. Burge C, Campbell AM, Karlin S. 1992. Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A 89:1358–1362. doi:10.1073/pnas.89.4.1358. - DOI - PMC - PubMed
    1. Karlin S, Ladunga I, Blaisdell BE. 1994. Heterogeneity of genomes: measures and values. Proc Natl Acad Sci U S A 91:12837–12841. doi:10.1073/pnas.91.26.12837. - DOI - PMC - PubMed
    1. Karlin S, Mrazek J. 1997. Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci U S A 94:10227–10232. doi:10.1073/pnas.94.19.10227. - DOI - PMC - PubMed
    1. Jabbari K, Bernardi G. 2004. Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene 333:143–149. doi:10.1016/j.gene.2004.02.043. - DOI - PubMed
    1. Karlin S, Burge C. 1995. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11:283–290. doi:10.1016/S0168-9525(00)89076-9. - DOI - PubMed

Publication types

Substances

LinkOut - more resources