The Pfam protein families database

doi:10.1093/nar/gkr1065

. 2012 Jan;40(Database issue):D290-301.

doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

The Pfam protein families database

Marco Punta¹, Penny C Coggill, Ruth Y Eberhardt, Jaina Mistry, John Tate, Chris Boursnell, Ningze Pang, Kristoffer Forslund, Goran Ceric, Jody Clements, Andreas Heger, Liisa Holm, Erik L L Sonnhammer, Sean R Eddy, Alex Bateman, Robert D Finn

Affiliations

PMID: 22127870
PMCID: PMC3245129
DOI: 10.1093/nar/gkr1065

The Pfam protein families database

Marco Punta et al. Nucleic Acids Res. 2012 Jan.

. 2012 Jan;40(Database issue):D290-301.

doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

Authors

Affiliation

¹ Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. mp13@sanger.ac.uk

PMID: 22127870
PMCID: PMC3245129
DOI: 10.1093/nar/gkr1065

Abstract

Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.

PubMed Disclaimer

Figures

**Figure 1.**
New Pfam features since release 24.0. (A) The Pfam-A family page for Avidin (PF01382), showing the embedded contents of the associated Wikipedia article. The ‘infobox’ is highlighted. (B) The ‘sunburst’ representation of the tree showing the species distribution of the Pfam-A family Peptidase_M10 (PF00413). (C) The PfamAlyzer applet, showing the results of searching for all architectures that include the domains IMPDH and CBS. The PfamAlyzer applet allows querying of Pfam for proteins with particular domains, domain combinations or architectures.

**Figure 2.**
Pfam users in the world. A world map showing the usage of Pfam website at the Wellcome Trust Sanger Institute, UK. Usage statistics were obtained from our Google Urchin tracking database and plotted using the Google map API. Circle size is proportional to number of visits from each country for those with >5000 visits. Countries contributing <5000 visits are all shown with the same sized marker. Data refer to the period between 1 and 30 June 2011.

**Figure 3.**
Heat map showing sequence gathering threshold (GA) changes between Pfam releases 24.0 and 26.0. Yellow squares represent high density; red squares represent low density. Squares on the diagonal correspond to GAs that are unchanged; squares in the region above the diagonal are GAs that have increased; and squares below the diagonal are GAs that have decreased. For the sake of clarity, we chose to show a zoomed-in version of the complete plot, which also includes a number of points outside of the range seen here. The plot was created using R (21).

**Figure 4.**
Distribution of sequence gathering (GA) thresholds and of corresponding E-values. (A) Distribution of sequence GAs for all Pfam-A families. Note that intervals are such that, for example, ‘25–26’ translates into 25 ≤ sequence GA(bits) < 26. (B) Same as the histogram in panel (A), with log10(E-values) in place of GAs. E-values are calculated from GAs according to the following formula: E = N × exp[−λ·(x − τ)], where x is the bit score GA, λ and τ are parameters derived from the HMM model (λ is the slope parameter, τ is the location parameter) and N is the database size (in this case the size of UniProtKB) (22). (C) Box-plot of all Pfam families’ GAs (left side; median = 22.1, 25th percentile = 20.8, 75th percentile = 25.0), and for all families excluding those where both sequence and domain thresholds equal 25.0 or 27.0 (right side; median = 21.2, 25th percentile = 20.6, 75th percentile = 22.8). (D) Same as (C) with log10(E-values) in place of GAs. E-values calculated as in panel (B). Left side: median = 0.096, 25th percentile = 0.012, 75th percentile = 0.24. Right side: median = 0.18, 25th percentile = 0.057, 75th percentile = 0.27. Note that values reported here for median and percentiles are for E-values and not log10(E-values).

**Figure 5.**
DUF families’ statistics. (A) Comparison between number of DUFs added (blue) and number of DUFs renamed or otherwise removed (red) since Pfam 22.0 (data shown for releases 23.0–26.0, as indicated by labels on the graph). (B) Number of PIR representative clusters of genomes (23) in DUF families. We used Representative Proteomes version 2.0, comprising a total of 671 clusters for a 35% membership cut-off. (C) Co-occurrence between DUFs and other families. The term ‘architecture’ refers to a combination of families occurring within the same protein sequence. Note that we only considered architectures with at least five member sequences. (D) DUF families and protein structure. ‘Families that have structure’ means that a PDB structure is available for a member of the family; ‘families in a clan that has structure’ means that a PDB structure is available for a member of the same clan.

See this image and copyright information in PMC

Cited by

Genome-Wide Identification of the Maize Chitinase Gene Family and Analysis of Its Response to Biotic and Abiotic Stresses.
Wang T, Wang C, Liu Y, Zou K, Guan M, Wu Y, Yue S, Hu Y, Yu H, Zhang K, Wu D, Du J. Wang T, et al. Genes (Basel). 2024 Oct 15;15(10):1327. doi: 10.3390/genes15101327. Genes (Basel). 2024. PMID: 39457451 Free PMC article.
High-quality chromosome-level genome assembly of female Artemia franciscana reveals sex chromosome and Hox gene organization.
Jo E, Cho M, Choi S, Lee SJ, Choi E, Kim J, Kim JY, Kwon S, Lee JH, Park H. Jo E, et al. Heliyon. 2024 Sep 28;10(19):e38687. doi: 10.1016/j.heliyon.2024.e38687. eCollection 2024 Oct 15. Heliyon. 2024. PMID: 39435060 Free PMC article.
Transcriptional rewiring in CD8⁺ T cells: implications for CAR-T cell therapy against solid tumours.
Srinivasan S, Armitage J, Nilsson J, Waithman J. Srinivasan S, et al. Front Immunol. 2024 Sep 27;15:1412731. doi: 10.3389/fimmu.2024.1412731. eCollection 2024. Front Immunol. 2024. PMID: 39399500 Free PMC article. Review.
Regulatory logic and transposable element dynamics in nematode worm genomes.
Fierst JL, Eggers VK. Fierst JL, et al. bioRxiv [Preprint]. 2024 Sep 16:2024.09.15.613132. doi: 10.1101/2024.09.15.613132. bioRxiv. 2024. PMID: 39345564 Free PMC article. Preprint.
Candidate gene analysis of rice grain shape based on genome-wide association study.
Xin W, Chen N, Wang J, Liu Y, Sun Y, Han B, Wang X, Liu Z, Liu H, Zheng H, Yang L, Zou D, Wang J. Xin W, et al. Theor Appl Genet. 2024 Sep 29;137(10):241. doi: 10.1007/s00122-024-04724-8. Theor Appl Genet. 2024. PMID: 39342533

See all "Cited by" articles

References

1. Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. - PubMed
1. The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. - PMC - PubMed
1. Coggill P, Finn RD, Bateman A. Identifying protein domains with the Pfam database. Curr. Protoc. Bioinformatics. 2008;Chapter 2 Unit 2 5. - PubMed
1. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. - PMC - PubMed
1. Daub J, Gardner PP, Tate J, Ramskold D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: community annotation of RNA families. RNA. 2008;14:2462–2464. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

[1] Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. - PubMed

[2] Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol. 2003;328:749–767. - PubMed

[3] The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. - PMC - PubMed

[4] The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–D148. - PMC - PubMed

[5] Coggill P, Finn RD, Bateman A. Identifying protein domains with the Pfam database. Curr. Protoc. Bioinformatics. 2008;Chapter 2 Unit 2 5. - PubMed

[6] Coggill P, Finn RD, Bateman A. Identifying protein domains with the Pfam database. Curr. Protoc. Bioinformatics. 2008;Chapter 2 Unit 2 5. - PubMed

[7] Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. - PMC - PubMed

[8] Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. - PMC - PubMed

[9] Daub J, Gardner PP, Tate J, Ramskold D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: community annotation of RNA families. RNA. 2008;14:2462–2464. - PMC - PubMed

[10] Daub J, Gardner PP, Tate J, Ramskold D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: community annotation of RNA families. RNA. 2008;14:2462–2464. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Pfam protein families database

Affiliation

The Pfam protein families database

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources