Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 24;5(3):e00408-20.
doi: 10.1128/mSphere.00408-20.

Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their Short- and Long-Term Evolutionary Trajectories

Affiliations

Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and Other Coronaviruses: Causes and Consequences for Their Short- and Long-Term Evolutionary Trajectories

P Simmonds. mSphere. .

Abstract

The pandemic of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has motivated an intensive analysis of its molecular epidemiology following its worldwide spread. To understand the early evolutionary events following its emergence, a data set of 985 complete SARS-CoV-2 sequences was assembled. Variants showed a mean of 5.5 to 9.5 nucleotide differences from each other, consistent with a midrange coronavirus substitution rate of 3 × 10-4 substitutions/site/year. Almost one-half of sequence changes were C→U transitions, with an 8-fold base frequency normalized directional asymmetry between C→U and U→C substitutions. Elevated ratios were observed in other recently emerged coronaviruses (SARS-CoV, Middle East respiratory syndrome [MERS]-CoV), and decreasing ratios were observed in other human coronaviruses (HCoV-NL63, -OC43, -229E, and -HKU1) proportionate to their increasing divergence. C→U transitions underpinned almost one-half of the amino acid differences between SARS-CoV-2 variants and occurred preferentially in both 5' U/A and 3' U/A flanking sequence contexts comparable to favored motifs of human APOBEC3 proteins. Marked base asymmetries observed in nonpandemic human coronaviruses (U ≫ A > G ≫ C) and low G+C contents may represent long-term effects of prolonged C→U hypermutation in their hosts. The evidence that much of sequence change in SARS-CoV-2 and other coronaviruses may be driven by a host APOBEC-like editing process has profound implications for understanding their short- and long-term evolution. Repeated cycles of mutation and reversion in favored mutational hot spots and the widespread occurrence of amino acid changes with no adaptive value for the virus represent a quite different paradigm of virus sequence change from neutral and Darwinian evolutionary frameworks and are not incorporated by standard models used in molecular epidemiology investigations.IMPORTANCE The wealth of accurately curated sequence data for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), its long genome, and its low substitution rate provides a relatively blank canvas with which to investigate effects of mutational and editing processes imposed by the host cell. The finding that a large proportion of sequence change in SARS-CoV-2 in the initial months of the pandemic comprised C→U mutations in a host APOBEC-like context provides evidence for a potent host-driven antiviral editing mechanism against coronaviruses more often associated with antiretroviral defense. In evolutionary terms, the contribution of biased, convergent, and context-dependent mutations to sequence change in SARS-CoV-2 is substantial, and these processes are not incorporated by standard models used in molecular epidemiology investigations.

Keywords: APOBEC; COVID-19; SARS; SARS coronavirus 2; SARS-CoV-2; coronavirus; hypermutation.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Association between sequence divergence and dN/dS ratio. A comparison of dN/dS ratios in recently emerged coronaviruses (red circles), other human coronaviruses and relatives infecting other species (blue circles), and a collection of bat sarbecoviruses (SARS-like) (pink circle). A power law line of best fit showed a significant correlation between divergence and dN/dS ratio (P = 0.000006). Sequences of the three data sets of EBOV control sequences were included (gray triangles).
FIG 2
FIG 2
Association of excess C→U transitions with divergence. (A) Numbers of sites in the SARS-CoV-2 genome with each of the four transitions. Bar heights represent the means from the three sequence samples; error bars show one standard deviation (SD). (B) Relationship between sequence diversity and a normalized metric of asymmetry between the numbers of C→U and U→C transitions (where 1.0 is the expected number). Power law regression line was significant at a P value of <0.0001. (C) Association of dN/dS ratio with C→U/U→C asymmetry. The power law regression lines were significant at P values of 0.001 and 0.0004, respectively. Points are colored as in Fig. 1.
FIG 3
FIG 3
Positions of C→U transitions in the SARS-CoV-2 genome in each of the three replicate SARS-CoV-2 sequence data sets were matched to a genome diagram of SARS-CoV-2 (using the annotation from the prototype sequence MN908947). The numbers of transitions at each site are shown on a log scale, with the shortest bars indicating individual substitutions.
FIG 4
FIG 4
Phylogeny of SARS-CoV-2 and positions of sequences with C→U changes. A neighbor-joining tree of 865 SARS-CoV-2 complete genome sequences was constructed in MEGA6 (41). Labels show the position of sequences containing a selection of C→U transitions at the genome positions indicated in the key.
FIG 5
FIG 5
Amino acid changes induced by different nucleotide substitutions. Numbers of individual amino acids changes observed in the combined SARS-CoV-2 data set (864 sequences) at a 5% variability threshold. Bars are colored based on the underlying nucleotide changes. Inset graph shows the relative proportions of transitions leading to amino acid changes.
FIG 6
FIG 6
Influence of 5′ and 3′ base contexts on C↔U and G↔A transition frequencies. Totals of each transition in the SARS-CoV-2 sequence data set split into subtotals based on the identity of the 5′ (left) and 3′ (right) base. Bar heights represent the means from the three sequence samples; error bars show standard deviations. A further division into the 16 combinations of 5′ and 3′ base contexts is provided in Fig. S1 in the supplemental material.
FIG 7
FIG 7
Base frequencies in different coronaviruses. Relationship between G+C content and frequencies of individual bases in coronaviruses. The associations between C depletion and U enrichment with G+C content were both significant by linear regression at P = 5 × 10−7 and P = 5 × 10−6, respectively. No significant associations were observed between G+C content and G (P = 0.05) or A (P = 0.62) frequencies. Arrows are color coded as for Fig. 1.
FIG 8
FIG 8
Suppression of CpG dinucleotides in SARS-CoV-2 and other coronaviruses. Comparison of CpG frequencies of SARS-CoV-2, other coronaviruses, and a set of other mammalian RNA viruses; each data point represents an individual currently classified species; accession numbers are listed in Table S1. CpG frequencies were expressed as the ratio of their observed frequency to the expected frequency based on their G+C content (y axis).

Similar articles

Cited by

References

    1. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, Ren R, Leung KSM, Lau EHY, Wong JY, Xing X, Xiang N, Wu Y, Li C, Chen Q, Li D, Liu T, Zhao J, Liu M, Tu W, Chen C, Jin L, Yang R, Wang Q, Zhou S, Wang R, Liu H, Luo Y, Liu Y, Shao G, Li H, Tao Z, Yang Y, Deng Z, Liu B, Ma Z, Zhang Y, Shi G, Lam TTY, Wu JT, Gao GF, Cowling BJ, Yang B, Leung GM, Feng Z. 2020. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. N Engl J Med 382:1199–1207. doi:10.1056/NEJMoa2001316. - DOI - PMC - PubMed
    1. Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, Si HR, Zhu Y, Li B, Huang CL, Chen HD, Chen J, Luo Y, Guo H, Jiang RD, Liu MQ, Chen Y, Shen XR, Wang X, Zheng XS, Zhao K, Chen QJ, Deng F, Liu LL, Yan B, Zhan FX, Wang YY, Xiao GF, Shi ZL. 2020. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579:270–273. doi:10.1038/s41586-020-2012-7. - DOI - PMC - PubMed
    1. Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, Huang B, Shi W, Lu R, Niu P, Zhan F, Ma X, Wang D, Xu W, Wu G, Gao GF, Tan W, China Novel Coronavirus Investigating and Research Team. 2020. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med 382:727–733. doi:10.1056/NEJMoa2001017. - DOI - PMC - PubMed
    1. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ. 2020. A new coronavirus associated with human respiratory disease in China. Nature 579:265–269. doi:10.1038/s41586-020-2008-3. - DOI - PMC - PubMed
    1. Mavian C, Marini S, Manes C, Capua I, Prosperi M, Salemi M. 20 March 2020. Regaining perspective on SARS-CoV-2 molecular tracing and its implications. medRxiv https://www.medrxiv.org/content/10.1101/2020.03.16.20034470v1. - DOI

Publication types

MeSH terms

LinkOut - more resources