Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct 7:11:498.
doi: 10.1186/1471-2105-11-498.

Automatic, context-specific generation of Gene Ontology slims

Affiliations

Automatic, context-specific generation of Gene Ontology slims

Melissa J Davis et al. BMC Bioinformatics. .

Abstract

Background: The use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual.

Results: Here we present an objective framework for generating customised ontology slims for specific annotated datasets, exploiting information latent in the structure of the ontology graph and in the annotation data. This framework combines ontology engineering approaches, and a data-driven algorithm that draws on graph and information theory. We illustrate this method by application to GO, generating GO slims at different information thresholds, characterising their depth of semantics and demonstrating the resulting gains in statistical power.

Conclusions: Our GO slim creation pipeline is available for use in conjunction with any GO-annotated dataset, and creates dataset-specific, objectively defined slims. This method is fast and scalable for application to other biomedical ontologies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Frequency distribution of Information Content (I) values obtained for BP (triangle), CC (square) and MF (diamond) terms annotated in the SGD Yeast data. The number of terms with values of I between -0.5 and 0.5 is shown, and includes 96.6% of BP terms, 96.3% of CC terms, and 97.5% of MF terms.
Figure 2
Figure 2
GO slim generated using an information content threshold of 0.1, showing the top level of the Biological Process hierarchy with full expansion of the children of response to stimulus.
Figure 3
Figure 3
GO slim generated using an information content threshold of 0.2, showing the top level of the Biological Process hierarchy with full expansion of the children of response to stimulus.
Figure 4
Figure 4
GO slim generated using an information content threshold of 0.3, showing the top level of the Biological Process hierarchy with full expansion of the children of response to stimulus.
Figure 5
Figure 5
Comparison of the goslim_yeast maintained by SGD with the GO slim generated by our method across a range of threshold values. Threshold τ is given on the x axis, while the y axis presents the number of GO terms according to a log scale. The dotted line represents the percentage of terms in the IC goslim that are also found in the SGD goslim yeast set (minimum: 1.5%, maximum 22.4%).
Figure 6
Figure 6
Specification of GO slim subsets in the GO OBO file: (A) GO slim subsets are initially specified with a subsetdef: statement in the header; (B) each individual term that is a member of the GO slim is annotated with a subset: property in the term definition statement.
Figure 7
Figure 7
Selection of a specific GO slim subset in the OBO Edit application: screen shot from OBO Edit showing the term filter set to view the human_scl category. The Search & Filter tab is selected in the top right panel, and the term filter is set to select terms for which the category contains human_scl. The term cell is highlighted in the viewing panel on the left, and the term definition and other information (e.g. category membership) is displayed on the right. Categories to which a term belongs are ticked.
Figure 8
Figure 8
DAG fragment to illustrate mapping between terms used in annotation to terms selected for GO slim using all paths through the graph to the root term. In this DAG fragment, terms 1-3 are used to annotate gene products, and term 11 represents the root of the graph fragment. Paths from terms used in annotating gene products back to the root term are used to map from the full graph to the slim graph. All possible mappings are created, and account for the kinds of relations (either is_a, or part_of) used to construct the graph.

Similar articles

Cited by

References

    1. GeneOntologyConsortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2003. pp. D258–D261. - PMC - PubMed
    1. Wilson RJ, Goodman JL, Strelets VB, Gelbart W, Bitsoi L, Crosby M, Dirkmaat A, Emmert D, Gramates L, Falls K, FlyBase: Integration and improvements to query tools. Nucleic Acids Research. 2008. pp. D588–D593. - PMC - PubMed
    1. Bult C, Eppig J, Kadin J, Richardson J, Blake J, Airey M, Anagnostopoulos A, Babiuk R, Baldarelli R, Baya M, The Mouse Genome Database (MGD): Mouse biology and model systems. Nucleic Acids Research. 2008. pp. D724–D728. - PMC - PubMed
    1. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J. et al.WormBase 2007. Nucleic Acids Research. 2008;36(Supplement 1):D612–617. - PMC - PubMed
    1. Huala E, Dickerman AW, Garcia-Hernandez M, Weems D, Reiser L, LaFond F, Hanley D, Kiphart D, Zhuang M, Huang W. et al.The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research. 2001;29(1):102–105. doi: 10.1093/nar/29.1.102. - DOI - PMC - PubMed

Publication types