Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Mar 27:4:11.
doi: 10.1186/1471-2105-4-11. Epub 2003 Mar 27.

PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine

Affiliations
Comparative Study

PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine

Ian Donaldson et al. BMC Bioinformatics. .

Abstract

Background: The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND.

Results: Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days.

Conclusions: Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Representation of a pathway using the BIND data-model. Each letter represents a molecular object such as DNA, RNA, protein, complex or small molecule. A simple interaction record (dotted box) describes an interface between two molecular objects. This paper focuses on information extraction for this type of record. Interaction records may describe a new molecule(s) created as a result of the interaction (dashed box). Molecular results of one interaction record can become the interactors in subsequent interaction records. In this way, multiple interaction records can be strung together to describe a biological pathway (solid box).
Figure 2
Figure 2
The PreBIND information extraction system. Details are provided in the text of the Results and Discussion section.
Figure 3
Figure 3
The PreBIND interface. Users can freely view and submit feedback about all potential interactions present in PreBIND. Potential interactions that are judged by users to be legitimate may be submitted to the BIND database for review by curators. Information gathered in this way will be used to further train the support vector machine used in the initial search and help develop natural language analysis algorithms.
Figure 4
Figure 4
Performance of SVM and naïve-Bayes classifiers. The performance of the SVM for identifying interaction abstracts was evaluated using 10-fold cross-validation on a set of 1094 abstracts. The performance on this task is measured in precision and recall. There is an implicit tradeoff between precision and recall that can be varied if the decision boundary is set to some value other than 0. In this evaluation, when the decision boundary for the SVM is set to 1, recall and precision are 0.57 and 0.99 respectively. When the decision boundary is set to -0.99, recall and precision are 0.997 and 0.71 respectively. Finally, if the decision boundary is set to zero then precision and recall are both 92%. In other words, when the decision boundary is set to zero and the SVM is applied to all abstracts in PubMed, it will miss approximately 8% of interaction documents (recall) and 8% of the identified interaction documents will not be interaction documents (precision). Under similar conditions, the naïve-Bayes classifier described here would only have a precision and recall of 87%.

Similar articles

Cited by

References

    1. Bader GD, Hogue CW. BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics. 2000;16:465–477. doi: 10.1093/bioinformatics/16.5.465. - DOI - PubMed
    1. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue CW. BIND--The Biomolecular Interaction Network Database. Nucleic Acids Res. 2001;29:242–245. doi: 10.1093/nar/29.1.242. - DOI - PMC - PubMed
    1. Sekimizu T, Park HS, Tsujii J. Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Inform Ser Workshop Genome Inform. 1998;9:62–71. - PubMed
    1. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput. 2000:517–528. - PMC - PubMed
    1. Humphreys K, Demetriou G, Gaizauskas R. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp Biocomput. 2000:505–516. - PubMed

MeSH terms

Substances