Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug 5;3(3):lqab067.
doi: 10.1093/nargab/lqab067. eCollection 2021 Sep.

PHROG: families of prokaryotic virus proteins clustered using remote homology

Affiliations

PHROG: families of prokaryotic virus proteins clustered using remote homology

Paul Terzian et al. NAR Genom Bioinform. .

Abstract

Viruses are abundant, diverse and ancestral biological entities. Their diversity is high, both in terms of the number of different protein families encountered and in the sequence heterogeneity of each protein family. The recent increase in sequenced viral genomes constitutes a great opportunity to gain new insights into this diversity and consequently urges the development of annotation resources to help functional and comparative analysis. Here, we introduce PHROG (Prokaryotic Virus Remote Homologous Groups), a library of viral protein families generated using a new clustering approach based on remote homology detection by HMM profile-profile comparisons. Considering 17 473 reference (pro)viruses of prokaryotes, 868 340 of the total 938 864 proteins were grouped into 38 880 clusters that proved to be a 2-fold deeper clustering than using a classical strategy based on BLAST-like similarity searches, and yet to remain homogeneous. Manual inspection of similarities to various reference sequence databases led to the annotation of 5108 clusters (containing 50.6 % of the total protein dataset) with 705 different annotation terms, included in 9 functional categories, specifically designed for viruses. Hopefully, PHROG will be a useful tool to better annotate future prokaryotic viral sequences thus helping the scientific community to better understand the evolution and ecology of these entities.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview figure of the dataset, the clustering procedure, results and website main page. (A) Protein sets of reference viruses from the NCBI (RefVirus) and (pro)viruses detected using VirSorter (ProVirus) were collected. It should be noted that Provirus proteins (green) were not initially annotated. (B) The four steps of the clustering procedure: (i) the protein network built from pairwise sequence similarities, each dot/vertex representing a protein (green for ProVirus proteins, red for annotated RefVirus proteins and pink for unannotated RefVirus), linked by edges if the two proteins are similar. (ii) Protein clusters are identified by applying MCL on to this network, and clusters are depicted as gray circle. (iii) Clusters (and singletons) are compared using protein profiles and edges are drawn for pairs of protein clusters that are similar. (iv) This network of clusters is here again clustered into PHROGs, depicted as dark brown and blue for unannotated and annotated PHROGs. For example, PHROG2 is made of two protein clusters and one singleton. (C) Description of the number of annotated and unannotated PHROGs and singletons, with the number and origin of proteins involved (red and green for RefVirus and ProVirus, respectively). (D) The PHROGs Web site main page, where users can search for PHROGs or viruses of interest.
Figure 2.
Figure 2.
Number of viruses considering their viral family and the class of their host (when not specified, viral families have a dsDNA genome). The size of the balloons are proportional to the number of viruses and the color reflects the proportion of proviruses.
Figure 3.
Figure 3.
Cumulated number of clustered proteins. For example, point a means that for the standard clustering procedure, ∼234 000 proteins are in clusters that contain at least 200 proteins, whereas for the PHROG procedure, ∼390 000 proteins are in clusters >200 (point b). The inset at the top right is a zoom of the left part of the curve. The 5 largest PHROGs are highlighted by a cross at the bottom right (the two largest PHROGs at the bottom right gather 5795 and 5879 proteins).
Figure 4.
Figure 4.
Identity percent (A) and coverage (B) for protein pairs in the same clusters. The clusters were separated according to interval of size, the first value «3» representing clusters that contain 3, 4, 5 or 6 proteins, «7» being clusters containing between 7 and 19 proteins, and the last interval «2000» being clusters >2000 proteins. Using the multiple alignments of each cluster, (i) the identity percent between two proteins is the number of amino acids that are identical in the two aligned proteins divided by the length of the smallest of the two proteins, and (ii) the coverage is the proportion of amino acids of one protein that is aligned to any amino acid (not to a gap) of the other protein. Subsamples of 1000 values where taken to draw each boxplot.
Figure 5.
Figure 5.
For each annotation term, the number of PHROGs with this annotation and the number of proteins in these PHROGs, each term being colored according to its functional category.
Figure 6.
Figure 6.
Each vertex represents an annotation and two annotations are linked if considered as significantly colocalized (see Materials and Methods section). Edges were attributed a weight equal to their significance score. Only the 112 most frequent annotations (used for >650 proteins) are displayed. Annotations for which genes were colocalized with genes having the same annotation are drawn as squares. Among these 112, the 21 annotations not significantly colocalized to any other are displayed on the right.

Similar articles

Cited by

References

    1. Breitbart M., Rohwer F.. Here a virus, there a virus, everywhere the same virus. Trends Microbiol. 2005; 13:278–284. - PubMed
    1. Suttle C.A. Marine viruses — major players in the global ecosystem. Nat. Rev. Microbiol. 2007; 5:801–812. - PubMed
    1. Reyes A., Haynes M., Hanson N., Angly F.E., Heath A.C., Rohwer F., Gordon J.I.. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature. 2010; 466:334–338. - PMC - PubMed
    1. Gregory A.C., Zayed A.A., Conceição-Neto N., Temperton B., Bolduc B., Alberti A., Ardyna M., Arkhipova K., Carmichael M., Cruaud C.et al. .. Marine DNA viral macro- and microdiversity from pole to pole. Cell. 2019; 177:1109–1123. - PMC - PubMed
    1. Roux S., Adriaenssens E.M., Dutilh B.E., Koonin E.V., Kropinski A.M., Krupovic M., Kuhn J.H., Lavigne R., Brister J.R., Varsani A.et al. .. Minimum information about an uncultivated virus genome (MIUVIG). Nat. Biotechnol. 2019; 37:29–37. - PMC - PubMed