Automatic, context-specific generation of Gene Ontology slims

doi:10.1186/1471-2105-11-498

. 2010 Oct 7:11:498.

doi: 10.1186/1471-2105-11-498.

Automatic, context-specific generation of Gene Ontology slims

Melissa J Davis¹, Muhammad Shoaib B Sehgal, Mark A Ragan

Affiliations

PMID: 20929524
PMCID: PMC3098080
DOI: 10.1186/1471-2105-11-498

Automatic, context-specific generation of Gene Ontology slims

Melissa J Davis et al. BMC Bioinformatics. 2010.

. 2010 Oct 7:11:498.

doi: 10.1186/1471-2105-11-498.

Authors

Melissa J Davis¹, Muhammad Shoaib B Sehgal, Mark A Ragan

Affiliation

¹ The University of Queensland, Brisbane, QLD 4072, Australia. m.ragan@imb.uq.edu.au

PMID: 20929524
PMCID: PMC3098080
DOI: 10.1186/1471-2105-11-498

Abstract

Background: The use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual.

Results: Here we present an objective framework for generating customised ontology slims for specific annotated datasets, exploiting information latent in the structure of the ontology graph and in the annotation data. This framework combines ontology engineering approaches, and a data-driven algorithm that draws on graph and information theory. We illustrate this method by application to GO, generating GO slims at different information thresholds, characterising their depth of semantics and demonstrating the resulting gains in statistical power.

Conclusions: Our GO slim creation pipeline is available for use in conjunction with any GO-annotated dataset, and creates dataset-specific, objectively defined slims. This method is fast and scalable for application to other biomedical ontologies.

PubMed Disclaimer

Figures

**Figure 1**
**Frequency distribution of Information Content (I) values obtained for BP (triangle), CC (square) and MF (diamond) terms annotated in the SGD Yeast data**. The number of terms with values of I between -0.5 and 0.5 is shown, and includes 96.6% of BP terms, 96.3% of CC terms, and 97.5% of MF terms.

**Figure 2**
GO slim generated using an information content threshold of 0.1, showing the top level of the Biological Process hierarchy with full expansion of the children of *response to stimulus*.

**Figure 3**
GO slim generated using an information content threshold of 0.2, showing the top level of the Biological Process hierarchy with full expansion of the children of *response to stimulus*.

**Figure 4**
GO slim generated using an information content threshold of 0.3, showing the top level of the Biological Process hierarchy with full expansion of the children of *response to stimulus*.

**Figure 5**
**Comparison of the goslim_yeast maintained by SGD with the GO slim generated by our method across a range of threshold values**. Threshold τ is given on the x axis, while the y axis presents the number of GO terms according to a log scale. The dotted line represents the percentage of terms in the IC goslim that are also found in the SGD goslim yeast set (minimum: 1.5%, maximum 22.4%).

**Figure 6**
Specification of GO slim subsets in the GO OBO file: (A) GO slim subsets are initially specified with a subsetdef: statement in the header; (B) each individual term that is a member of the GO slim is annotated with a subset: property in the term definition statement.

**Figure 7**
**Selection of a specific GO slim subset in the OBO Edit application: screen shot from OBO Edit showing the term filter set to view the human_scl category**. The Search & Filter tab is selected in the top right panel, and the term filter is set to select terms for which the category contains human_scl. The term *cell* is highlighted in the viewing panel on the left, and the term definition and other information (e.g. category membership) is displayed on the right. Categories to which a term belongs are ticked.

**Figure 8**
**DAG fragment to illustrate mapping between terms used in annotation to terms selected for GO slim using all paths through the graph to the root term**. In this DAG fragment, terms 1-3 are used to annotate gene products, and term 11 represents the root of the graph fragment. Paths from terms used in annotating gene products back to the root term are used to map from the full graph to the slim graph. All possible mappings are created, and account for the kinds of relations (either *is_a*, or *part_of*) used to construct the graph.

See this image and copyright information in PMC

Cited by

A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets.
Ayllón-Benítez A, Mougin F, Allali J, Thiébaut R, Thébault P. Ayllón-Benítez A, et al. PLoS One. 2018 Nov 27;13(11):e0208037. doi: 10.1371/journal.pone.0208037. eCollection 2018. PLoS One. 2018. PMID: 30481204 Free PMC article.
Comparative Proteomic Analysis of Cotton Fiber Development and Protein Extraction Method Comparison in Late Stage Fibers.
Mujahid H, Pendarvis K, Reddy JS, Nallamilli BR, Reddy KR, Nanduri B, Peng Z. Mujahid H, et al. Proteomes. 2016 Feb 3;4(1):7. doi: 10.3390/proteomes4010007. Proteomes. 2016. PMID: 28248216 Free PMC article.
Using predictive specificity to determine when gene set analysis is biologically meaningful.
Ballouz S, Pavlidis P, Gillis J. Ballouz S, et al. Nucleic Acids Res. 2017 Feb 28;45(4):e20. doi: 10.1093/nar/gkw957. Nucleic Acids Res. 2017. PMID: 28204549 Free PMC article.
Gene set selection via LASSO penalized regression (SLPR).
Frost HR, Amos CI. Frost HR, et al. Nucleic Acids Res. 2017 Jul 7;45(12):e114. doi: 10.1093/nar/gkx291. Nucleic Acids Res. 2017. PMID: 28472344 Free PMC article.
Prediction of protein group function by iterative classification on functional relevance network.
Khan IK, Jain A, Rawi R, Bensmail H, Kihara D. Khan IK, et al. Bioinformatics. 2019 Apr 15;35(8):1388-1394. doi: 10.1093/bioinformatics/bty787. Bioinformatics. 2019. PMID: 30192921 Free PMC article.

See all "Cited by" articles

References

1. GeneOntologyConsortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2003. pp. D258–D261. - PMC - PubMed
1. Wilson RJ, Goodman JL, Strelets VB, Gelbart W, Bitsoi L, Crosby M, Dirkmaat A, Emmert D, Gramates L, Falls K, FlyBase: Integration and improvements to query tools. Nucleic Acids Research. 2008. pp. D588–D593. - PMC - PubMed
1. Bult C, Eppig J, Kadin J, Richardson J, Blake J, Airey M, Anagnostopoulos A, Babiuk R, Baldarelli R, Baya M, The Mouse Genome Database (MGD): Mouse biology and model systems. Nucleic Acids Research. 2008. pp. D724–D728. - PMC - PubMed
1. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J. et al.WormBase 2007. Nucleic Acids Research. 2008;36(Supplement 1):D612–617. - PMC - PubMed
1. Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W. et al.The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research. 2001;29(1):102–105. doi: 10.1093/nar/29.1.102. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

[1] GeneOntologyConsortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2003. pp. D258–D261. - PMC - PubMed

[2] GeneOntologyConsortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2003. pp. D258–D261. - PMC - PubMed

[3] Wilson RJ, Goodman JL, Strelets VB, Gelbart W, Bitsoi L, Crosby M, Dirkmaat A, Emmert D, Gramates L, Falls K, FlyBase: Integration and improvements to query tools. Nucleic Acids Research. 2008. pp. D588–D593. - PMC - PubMed

[4] Wilson RJ, Goodman JL, Strelets VB, Gelbart W, Bitsoi L, Crosby M, Dirkmaat A, Emmert D, Gramates L, Falls K, FlyBase: Integration and improvements to query tools. Nucleic Acids Research. 2008. pp. D588–D593. - PMC - PubMed

[5] Bult C, Eppig J, Kadin J, Richardson J, Blake J, Airey M, Anagnostopoulos A, Babiuk R, Baldarelli R, Baya M, The Mouse Genome Database (MGD): Mouse biology and model systems. Nucleic Acids Research. 2008. pp. D724–D728. - PMC - PubMed

[6] Bult C, Eppig J, Kadin J, Richardson J, Blake J, Airey M, Anagnostopoulos A, Babiuk R, Baldarelli R, Baya M, The Mouse Genome Database (MGD): Mouse biology and model systems. Nucleic Acids Research. 2008. pp. D724–D728. - PMC - PubMed

[7] Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J. et al.WormBase 2007. Nucleic Acids Research. 2008;36(Supplement 1):D612–617. - PMC - PubMed

[8] Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J. et al.WormBase 2007. Nucleic Acids Research. 2008;36(Supplement 1):D612–617. - PMC - PubMed

[9] Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W. et al.The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research. 2001;29(1):102–105. doi: 10.1093/nar/29.1.102. - DOI - PMC - PubMed

[10] Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W. et al.The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research. 2001;29(1):102–105. doi: 10.1093/nar/29.1.102. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic, context-specific generation of Gene Ontology slims

Affiliation

Automatic, context-specific generation of Gene Ontology slims

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials