The Viral Genomes Project aims to provide molecular standards for viral genomic research. The project has produced over 1,600 records for more than 1,200 different species. The National Center for Biotechnology Information (NCBI) provides access to this data through the Entrez search and retrieval engine and offers visualization of the sequence information at various levels of detail. Taxonomically organized displays, precomputed sequence comparison data, and direct access to analytical tools provide researchers with the ability to analyze and compare viral genomes and proteomes in a fast and convenient manner. The Viral Genomes Project is a collaborative effort between NCBI staff and many dedicated scientists worldwide. The URL for the database is http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html.
As the number of viral records in the public sequence databases (GenBank, EMBL, and DDBJ) grows, retrieving a viral genomic sequence of interest with associated information is becoming increasingly complex. High redundancy in the databases is a common problem for all organisms; in the case of viruses, however, the large number of available strains, isolates, and mutants further exacerbates the problem. For example, a search of Entrez Nucleotide currently retrieves more than 95,500 records for Human immunodeficiency virus 1 (HIV-1) and more than 22,500 records for Hepatitis C virus (HCV) alone; the total number of viral nucleotide records exceeds 220,000. Among these are both partial and complete genomic sequences, including partial sequences marked as a complete genome by submitters. Historically, sequence databases were merely archives of sequences directly submitted by users. Although a stricter submission procedure has been applied in recent years and therefore the quality of sequence records has greatly improved, a significant number of records are still underannotated, and the information in the old sequence records is often outdated. Furthermore, viral genomes are remarkably variable, consisting of either single-stranded or double-stranded DNA or RNA in either linear or circular form and comprising one or more segments. This variability makes viral records especially prone to inaccuracies in molecular information annotation.
To cope with these problems, NCBI has created the Viral Genomes Project as a part of the NCBI Genomes Project (19). Only complete or, occasionally, nearly complete viral genomic sequences missing only nontranslated portions (usually the ends of a genomic molecule) are being collected for this project, thereby greatly reducing redundancy. All available complete viral genomic sequences are being collected in order to faithfully represent the great genome variability found in many viruses. For example, 314 complete genome sequences of HIV-1 from various strains and isolates are included in the Entrez Genome collection. But only one sequence (NC_001802) has been selected as a reference (RefSeq) to serve as a molecular standard.
RefSeq records are manually curated to correct and update content in the original sequence records, which often involves consultations with the original submitters and/or other outside experts. The collection of preselected reference sequences greatly facilitates comparison of the genomes of different viruses. As of December 2003, the Viral Genomes Project contained 1,677 viral reference genomic sequences representing 1,223 virus species, which make a significant contribution to the NCBI RefSeq collection (13). Figure 1 shows the growth of the viral RefSeq collection during the past 3 years.
While a number of databases provide information on viral sequences, most of them are limited to certain families or groups (reviewed in references 4, 3, 6, and 12; http://www.dpvweb.net/index.php). The most comprehensive and well-established viral database, ICTVdB, provides “searchable descriptions of virus isolates, species, genera, families, orders; images of many viruses; and links to genomic and protein databanks” (2). ICTVdB has been a primary resource for information about biological properties of viruses. It plays a major role in viral taxonomic classification, on which our project relied heavily. ICTVdB does present links to viral sequences, but these sequences are original records from public sequence databases and therefore may contain inaccurate or outdated information. The Viral Genomes Project described here is the first comprehensive resource that provides access to the curated set of complete viral genomes in an easily navigable way and offers a collection of tools and precomputed results which greatly facilitate viral genome analysis. These precomputed analyses and tools include the global alignment of genome neighbors, available in both text and graphical forms; viral protein clusters (VOGs), (putative) functional and evolutionary groups of viral proteins derived from RefSeq genomes (which eliminates redundancy) and classified by sequence similarity; convenient VOG displays, including those integrated with the Conserved Domain Database (CDD), an NCBI collection of conserved protein domains; and a BLAST search against a selected set of viral proteins. To start exploring the Viral Genomes resources, go to http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html.
CONSTRUCTION OF VIRAL GENOMES RESOURCE
Identification of a complete viral genome.
The creation of an entry in the Viral Reference Genome collection begins with the identification of a complete genome source record in GenBank. Source sequences are initially selected by an automatic screening if either of the following conditions applies. (i) The topology of the sequence is circular or (ii) the definition of the sequence contains any of the following phrases: “complete genome,” “complete chromosome,” “sequence of the genome of,” or “complete genomic sequence.” The candidate sequences are then added to the Viral Reference Genome collection, once the NCBI curators verify that they are complete or contain complete coding regions with only 5′ and/or 3′ untranslated regions missing.
Some complete viral genome sequences are not detected by this automatic procedure because the source record either did not correctly indicate a circular topology or did not include the key words that NCBI curators use for automatic screening. To overcome this problem, viral sequences undergo an additional screening based on the sequence length. In this procedure, all viral sequences are retrieved from GenBank and grouped by virus genera. If there are viral reference sequences in a genus, sequences longer than 90% of the shortest reference sequence in the genus are selected. If there are no viral reference sequences in a genus, sequences longer than 90% of the longest sequences in the genus are selected. NCBI curators then check the selected sequences for their completeness.
Additionally, complete viral genome sequences are identified with the aid of external scientific advisors, experts on particular families or groups of viruses, who also assist in the curatorial process. The list of advisors and their contact information is available at http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viradvisors.html.
More than 200 viral sequences have been identified and added to the viral genomes collection with the advisors' help. We invite all virologists to contribute to the Viral Genomes public resource by sending us their suggestions concerning any virus or virus group (including those for which we already have advisors). To help the NCBI identify complete viral genomes, it is strongly recommended that submitters use the term “complete genome” in the definition of their sequences.
Taxonomic classification.
The Viral Genomes Project is tightly linked with the taxonomy database. In this database, each organism or taxonomy node has its own unique identification number, a tax_id. Viral Genomes and GenBank sequence records contain tax_id information as well as organism names and taxonomic lineages. The Viral Genomes Project makes use of the tax_ids to build a taxonomical hierarchy in its tools and views (see Visualization).
The NCBI taxonomy database contains names and classifications of more than 100,000 organisms for which sequence data are available. As of December 2003, it included the names of about 6,700 viral species and a total of about 9,300 individual viral names for individual strains, serotypes, isolates, and genotypes.
The names and classifications of viruses in the taxonomy database follow, to a large extent, the most recent report of the International Committee on the Taxonomy of Viruses, ICTV7 (18). As the ICTV reports appear infrequently, the NCBI taxonomy database attempts to stay current by also accepting new names and classification schemes on a case-by-case basis as provided in the reports of the ICTV executive meetings (9) and based on the advice of outside experts.
However, many sequence submissions are for viruses that are not listed in the ICTV report and sometimes not even described in the published literature. In spite of this, the taxonomy database can index these organisms and associated records. For example, the taxonomy database lists the TT virus (11) and bacteriophage Mx8 (unpublished), although they are not present in the ICTV report. Finding an appropriate taxonomic position for a virus usually involves comparative sequence analysis. The complete inventory of viruses and their classifications can be explored at the NCBI taxonomy web site at http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/.
REFERENCE SEQUENCES AND GENOME NEIGHBORS
Each virus species is represented by a single reference genome sequence. In the cases of multiple variants, strains, and isolates, preference is given to a sequence of a well-studied and practically important strain or isolate. For example, the genome of Poliovirus, NC_002058, is represented by the sequence of the notorious pathogenic strain Mahoney (GenBank accession no. V01149), published in 1981 (14). Other sequences of the same virus (regardless of types, strains, or isolates) become Genome neighbors for the reference sequence, provided that they meet all of the criteria that were used to select complete genomic sequences. Please note that Genome neighbors are not the same as GenBank Related Sequences, which represent records selected by sequence similarity.
RefSeq records are continually reviewed. Any existing RefSeq can be replaced with a better-studied, more practically important, and/or better-annotated genomic sequence. Recent review of the Flaviviridae family revealed that an old RefSeq record for HCV based on the GenBank record D90208 of 1993 was missing an important 82-base sequence at the 3′ end, which was first detected in HCV in 1996 by Kolykhalov et al. (5). Therefore, a new RefSeq record (NC_004102) was created, based on GenBank sequence AF009606, which represents the first cDNA clone to produce infectious and pathogenic HCV RNA.
MULTICOMPONENT (SEGMENTED) VIRUSES
Multicomponent viral genomes are manually assembled by matching strain and isolate information for available sequences of complete components. When several sequences are available for the same segment of the same strain and/or isolate, preference is given to a sequence obtained in the same laboratory as those of the other components. For example, the bipartite genome of Tobacco rattle virus was assembled as follows. From 11 available complete sequences, 3 sequences of RNA 1 (AF166084, AF31416, and AF406990) and one sequence of RNA 2 (Z36974) were primary candidates for RefSeq since they belong to the same strain. Of these, AF406990 and Z36974 were submitted by the same laboratory and therefore were chosen to represent the genome.
Only about half of provisional records indicate strain or isolate data; in other cases, NCBI gathers the information from the current literature or the ICTV resource.
CURATORIAL PROCESS FOR A VIRUS REFSEQ RECORD
The curatorial process includes the correction and update of the record, along with the addition of relevant biological information taken from the literature, other sequence records, original submitters, and outside advisors. The most common corrections are made to the type and topology of the genomic molecule (double strand or single strand, linear or circular) as well as to taxonomy lineage. The NCBI also adds a “specific host” field to all phage RefSeq records for users interested in studying phage-host relationships.
A large part of the curatorial process involves improvement of genome annotation, which includes searches for missing genes, assignment of functional roles to protein products, correction of annotations for proteins expressed by frame shifting or read through, restoration of proteins disrupted by sequencing errors, and addition of information on the processing of viral polyproteins.
In collaboration with Mark Borodovsky, the GeneMark program (http://opal.biology.gatech.edu/GeneMark/VIOLIN) was used to predict open reading frames (ORFs) in all viral RefSeq genomes and to compare them with the original annotations. To date, almost 100 records have been manually annotated with additional GeneMark-predicted ORFs (10). Whenever possible, putative protein functions were inferred from the results of BLAST searches against Viral Clusters of Related Proteins (VOG; see next section) or the NCBI nonredundant protein database (nr). The new annotations were usually confirmed by additional information retrieved from the current literature. Examples include the complete genomic sequences of the large double-stranded DNA (dsDNA) viruses—Lymphocystis disease virus 1 and Sheeppox virus. The original GenBank records, L63545 and AY077832, respectively, contained no annotation, with the exception of a coat protein in the former record. Subsequently, 157 and 147 protein coding genes were predicted by the GeneMark program in these genomes, respectively, and added to the corresponding reference sequences (records NC_001824 and NC_004002). The annotation of Lymphocystis disease virus 1 RefSeq record I (NC_001824) was further reviewed manually and compared with author-supplied annotation kindly provided by C. A. Tidona (17).
A significant number of problematic annotations existed in source records where viral proteins were expressed by frame-shifting or stop codon read-through mechanisms. For example, the correct precursor polyproteins were missing from many records of Retroid viruses (including the HIV-1 RefSeq NC_001802) and from a few families of positive-strand single-stranded RNA (ssRNA) viruses, such as Arteriviridae, Coronaviridae, and Astroviridae. Subsequently, corrections were made to the records, allowing for the mature peptides to be curated.
The annotation of viral polyproteins is a satellite project that will be described elsewhere. Briefly, it involves the following: (i) the production of alignments of related viral polyproteins from both RefSeq and Genome neighbor records, usually grouped by species or genera; (ii) the incorporation of cleavage sites available from the sequence databases or from the current literature into the alignment; (iii) the analysis of the alignment for (potential) cleavage sites in the reference sequences; and (iv) the annotation of the missing (predicted) cleavage sites in corresponding RefSeq records. For example, for the genus Flavivirus, 62 nonredundant polyproteins were aligned and 438 previously annotated cleavage sites were indicated in the alignment, which allowed for the prediction of an additional 368 putative cleavage sites. As many as 17 (of 20) flavivirus RefSeq records have been updated accordingly and provided with appropriate comments. Many RefSeq records for other viruses that explore the strategy of polyprotein processing have been processed this way. The PV RefSeq NC_002058 now has all 11 mature peptides (polyprotein processing products), whereas its source record, V01149, has only 3 mature peptides. Similarly, no mature peptides are present in the original record of Equine arteritis virus, X53459, the best-studied (in terms of molecular biology) representative of the order Nidovirales, while the corresponding RefSeq entry NC_002532 now contains 12 mature peptides. More data on alternative polyprotein processing will be added in the future.
VIRAL GENOME ANALYSIS (VOG)
Viral COG—clusters of related viral proteins.
To facilitate comparative analysis of viral genomes and classify their proteins into functional and phylogenetic patterns, proteins from complete viral genomes were clustered, based on BLASTP (1) pairwise alignments, using the Clusters of Orthologous Groups (COG) approach (14a). Clusters of related viral proteins (VOG) are being constructed in a group-by-group fashion and are available at http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/vog.html dsDNA viruses, ssDNA viruses, and phage clusters are publicly available, while the VOGs for other groups are under construction. At least one of the six functional categories is assigned to each VOG, and taxonomic affiliations are listed. The VOG web pages are cross-linked with the virus family pages and/or individual genome views.
PRECOMPUTED CONSERVED DOMAINS
Along with the classical form of COG representation, the VOG web pages feature a new graphical interface that displays conserved protein domains found in the CDD database (8) for VOG members. Another VOG option—BLAST against viral proteins—allows one to determine if a query protein belongs to any of the existing clusters. A detailed description of the VOG tool will be provided elsewhere.
VISUALIZATION
The NCBI Viral Genomes resource was created to accommodate the variety of viral genomes and genomic segments, which can range from a few hundred nucleotides to a few hundred thousand nucleotides in length and from linear to circular in topology. The data is organized into eight large taxonomic groups (excluding unclassified viruses and unclassified bacteriophages; Fig. 2), each group containing families or floating genera for which complete genomes are available.
The architecture of the NCBI Viral Genomes website allows one to easily navigate between viral genome listings along the taxonomy tree (e.g., Viruses↔ssDNA viruses↔Geminiviridae) and within the same taxonomy level (e.g., from one family to another). In addition, it allows one to study the genome of interest at different levels of detail, from an entire genome↔[genome segment]↔gene↔translation product↔protein domains (or mature peptides). The VOG pages, which provide access to curated clusters of related viral proteins, are cross- linked with the virus family and group pages and the genomic view pages, thereby showing the functional or evolutionary relationships of the viral genomes covered.
To analyze a genome or to compare several genomes, one can select the group of interest and apply the tools and precomputed results provided by the Viral Genomes resource. For example, start with the ssDNA virus group that consists of six families. Figure 3 shows the steps taken in such an analysis for the family Geminiviridae.
Geminiviridae is the largest family of the ssDNA viruses and currently contains as many as 166 reference sequences and 181 Genome neighbors, all of which are available from the same page (Fig. 3A). The genomes may be comprised of one component (all mastreviruses and curtoviruses, some begomoviruses, and the topocuvirus) or of two components (most begomoviruses), each approximately 2.5 to 3.0 kb in size (Fig. 3B). The DNA A segment typically encodes five to six proteins (Fig. 3C), including replication protein Rep, coat protein, and transport and/or regulatory proteins. The DNA B segment encodes two different movement proteins. Precomputed VOG results help to compare these proteins to each other and to those of other ssDNA viruses. Figure 3D shows the clusters which contain at least one geminivirus protein. The first cluster, VOGs0100, includes the Rep proteins from all the Geminiviridae as well as from all the Circoviridae and Nanoviridae. Indeed, the Rep proteins of all these viruses are indispensable for replication of their circular ssDNA genomes and were previously shown to share common amino acid sequence motifs (reference 7 and references therein). Other geminivirus proteins form clusters within the family. The size of each cluster depends primarily on the level of conservation of the corresponding protein. For example, the geminivirus coat protein (VOGs0101) is present in all genomes, the replication enhancer protein REn (VOGs0102) is present in the Begomovirus and Curtovirus genomes, and the nuclear shuttle protein (VOGs106) is present in the Begomovirus genomes. Thus, VOGs allow one to obtain an instant impression of which proteins are shared by only a portion of genomes within a taxonomy group, which are group specific, and which are expanded through a few families. Mapping this information back to sequence records (under construction) helps visualize the distribution of related genes and gene arrays among viral genomes.
Graphical and textual alignments of BLAST results are available for each cluster (Fig. 3E), allowing for quick and in-depth analysis of sequence similarities among the members of the cluster.
The Graphic View page of a viral RefSeq is a good starting point for exploring a viral genome. The graphical view can be accessed directly by typing the name of a virus into one of the two search boxes (Fig. 2). Figure 4A shows part of such a page for Enterobacteria phage P2. This page displays a graphical view of genome features such as CDSs, RNA genes, protein coding genes, signals, and more. From the graphical view, one can go to the “protein view,” which summarizes the coding region information for each DNA or RNA strand (Fig. 4B). Each protein is shown on this page as a colored (depending on the nucleic acid strain) rectangle, which is hyperlinked to a corresponding BLINK page (Fig. 4C) that displays precomputed BLAST neighbors for this protein. A complementary list of VOG affiliations (if any) for all of the proteins encoded by this phage genome (Fig. 4D) allows one to get an instant impression of the functions of these proteins and to review each of the relevant VOGs in order to learn more about the functions and evolution of these and related proteins. Nucleotide and protein records in FASTA format or a protein table can be downloaded from a “Coding Regions” view (link shown on the left tool bar, Fig. 4A and B). To continue the comparison with other related genomes, one can return to the family page in one click.
OTHER RELATED RESOURCES
Additional viral genome web resources devoted to specific viruses or viral groups are being developed and maintained at NCBI as well. These include resources for retroviruses (http://www.ncbi.nlm.nih.gov/retroviruses; C. Chappey and U. Plikat, unpublished data) and the SARS coronavirus (http://www.ncbi.nlm.nih.gov/genomes/SARS; unpublished data). Both resources provide easy access to the curated genomic data (that are part of the general Viral Genomes collection described above) as well as to specialized tools and results of precomputed analysis. For example, the retroviruses resource includes the genotyping tool for HIV-1, HIV-2, primate T-lymphotropic viruses, Hepatitis B virus, and HCV.
The Viral Genomes Project is regularly updated as new data and more tools become available. Proteins from all viral reference sequences are to be subjected to VOG analysis.
Acknowledgments
We acknowledge Pavel Bolotov for technical database support, Boris Kiryutin for cluster analysis, Andrei Kochergin for programming help, and Alexandre Souvorov for graphical alignment displays.
The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.
REFERENCES
- 1.Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Buchen-Osmond, C. 2003. The universal virus database ICTVdB. Comput. Sci. Eng. 5:16-25. [Google Scholar]
- 3.Hiscock, D., and C. Upton. 2000. Viral Genomes DataBase: storing and analyzing genes and proteins from complete viral genomes. Bioinformatics 16:484-485. [DOI] [PubMed] [Google Scholar]
- 4.Kellam, P., and M. M. Alba. 2002. Virus bioinformatics: databases and recent applications. Appl. Bioinform. 1:37-42. [PubMed] [Google Scholar]
- 5.Kolykhalov, A. A., S. M. Feinstone, and C. M. Rice. 1996. Identification of a highly conserved sequence element at the 3′ terminus of hepatitis C virus genome RNA. J. Virol. 70:3363-3371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lafontaine, D., S. Mercure, and J. P. Perreault. 1997. Update of the viroid and viroid-like sequence database: addition of a hepatitis delta virus RNA section. Nucleic Acids Res. 25:123-125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mankertz, A., and B. Hillenbrand. 2001. Replication of porcine circovirus type 1 requires two proteins encoded by the viral rep gene. Virology 279:429-438. [DOI] [PubMed] [Google Scholar]
- 8.Marchler-Bauer, A., J. B. Anderson, C. DeWeese-Scott, N. D. Fedorova, L. Y. Geer, S. He, D. I. Hurwitz, J. D. Jackson, A. R. Jacobs, C. J. Lanczycki, C. A. Liebert, C. Liu, T. Madej, G. H. Marchler, R. Mazumder, A. N. Nikolskaya, A. R. Panchenko, B. S. Rao, B. A. Shoemaker, V. Simonyan, J. S. Song, P. A. Thiessen, S. Vasudevan, Y. Wang, R. A. Yamashita, J. J. Yin, and S. H. Bryant. 2003. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31:383-387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Mayo, M. A. 2002. Virus taxonomy—Houston 2002. Arch. Virol. 147:1071-1076. [DOI] [PubMed] [Google Scholar]
- 10.Mills, R., M. Rozanov, A. Lomsadze, T. Tatusova, and M. Borodovsky. 2003. Improving gene annotation of complete viral genomes. Nucleic Acids Res. 31:7041-7055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Okamoto, H., T. Nishizawa, N. Kato, M. Ukita, H. Ikeda, H. Iizuka, Y. Miyakawa, and M. Mayumi. 1998. Molecular cloning and characterization of a novel DNA virus (TTV) associated with posttransfusion hepatitis of unknown etiology. Hepatol. Res. 10:1-16. [Google Scholar]
- 12.Pelchat, M., L. Rocheleau, J. Perreault, and J. P. Perreault. 2003. SubViral RNA: a database of the smallest known auto-replicable RNA species. Nucleic Acids Res. 31:444-445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pruitt, K. D., T. Tatusova, and D. R. Maglott. 2003. NCBI Reference Sequence Project: update and current status. Nucleic Acids Res. 31:34-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Racaniello, V. R., and D. Baltimore. 1981. Cloned poliovirus complementary DNA is infectious in mammalian cells. Science 214:916-919. [DOI] [PubMed] [Google Scholar]
- 14a.Rozanov, M., U. Plikat, C. Chappey, A. Kochergin, and T. Tatusova. 2004. A web-based genotyping resource for viral sequences. Nucleic Acids Res., in press. [DOI] [PMC free article] [PubMed]
- 15.Tatusov, R. L., E. V. Koonin, and D. J. Lipman. 1997. A genomic perspective on protein families. Science 278:631-637. [DOI] [PubMed] [Google Scholar]
- 16.Tatusova, T. A., I. Karsch-Mizrachi, and J. A. Ostell. 1999. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics 15:536-543. [DOI] [PubMed] [Google Scholar]
- 17.Tidona, C. A., and G. Darai. 1997. The complete DNA sequence of lymphocystis disease virus. Virology 230:207-216. [DOI] [PubMed] [Google Scholar]
- 18.van Regenmortel, M. H. V., C. M. Fauquet, D. H. L. Bishop, E. B. Carstens, M. K. Estes, S. M. Lemon, J. Maniloff, M. A. Mayo, D. J. McGeoch, C. R. Pringle, and R. B. Wickner. 2000. Virus taxonomy: seventh report of the international committee on taxonomy of viruses. Academic Press, New York, N.Y.
- 19.Wheeler, D. L., D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner. 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31:28-33. [DOI] [PMC free article] [PubMed] [Google Scholar]