Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2009 Dec 10;424(3):317-33.
doi: 10.1042/BJ20091474.

Calling International Rescue: knowledge lost in literature and data landslide!

Affiliations
Review

Calling International Rescue: knowledge lost in literature and data landslide!

Teresa K Attwood et al. Biochem J. .

Abstract

We live in interesting times. Portents of impending catastrophe pervade the literature, calling us to action in the face of unmanageable volumes of scientific data. But it isn't so much data generation per se, but the systematic burial of the knowledge embodied in those data that poses the problem: there is so much information available that we simply no longer know what we know, and finding what we want is hard - too hard. The knowledge we seek is often fragmentary and disconnected, spread thinly across thousands of databases and millions of articles in thousands of journals. The intellectual energy required to search this array of data-archives, and the time and money this wastes, has led several researchers to challenge the methods by which we traditionally commit newly acquired facts and knowledge to the scientific record. We present some of these initiatives here - a whirlwind tour of recent projects to transform scholarly publishing paradigms, culminating in Utopia and the Semantic Biochemical Journal experiment. With their promises to provide new ways of interacting with the literature, and new and more powerful tools to access and extract the knowledge sequestered within it, we ask what advances they make and what obstacles to progress still exist? We explore these questions, and, as you read on, we invite you to engage in an experiment with us, a real-time test of a new technology to rescue data from the dormant pages of published documents. We ask you, please, to read the instructions carefully. The time has come: you may turn over your papers...

PubMed Disclaimer

Figures

Figure 1
Figure 1. Graphical illustration of the growth of biomedical research publications (red; current total >19 million), alongside the accumulation of research data, including nucleic acid sequences (black; current total ~163 million), computer-annotated protein sequences (magenta; current total 9 million), manually annotated protein sequences (green; current total 500000) and protein structures (blue; current total 60000)
Figure 2
Figure 2. Illustration of the use of COHSE
GO terms are highlighted in a webpage; clicking on these reveals glossary information from GO; link targets to PubMed abstracts (such as the one here from Current Opinion in Plant Biology [45]) are provided by modifying the preferences to use an appropriate Google search. (http://cohse.cs.manchester.ac.uk/). The ‘Cellular Respiration’ panel is reproduced from Kimball's Biology Pages (http://biology-pages.info) with permission from Professor John W. Kimball. The PubMed record of Weber, A.P. (2004) Solute transporters as connecting elements between cytosol and plastid stroma. Current Opinion in Plant Biology 7, 247–253, has been reproduced with permission from the National Library of Medicine and Elsevier.
Figure 3
Figure 3. Illustration of Prospect mark-up in part of a Molecular BioSystems article
Terms found in the source ontologies, which may be toggled on or off via the greyed-out Tools and Resources toolbar to the right of the page, are highlighted in different colours: e.g. pink highlights denote compound terms, which link out to diagrams of their structures, synonyms, Simplified Molecular Input Line Entry Specification (SMILES) nomenclature, etc.; yellow highlights link to definitions from the Gold Book; blue highlights are biomedical terms and green highlights are chemical terms, both of which link out to relevant definitions, synonyms and ontologies. Fragments of linked webpages are overlaid on this Figure as ‘callouts’. (http://www.rsc.org/Publishing/Journals/ProjectProspect/). The extract from Molecular BioSystems ([48]; Koenigs, M.B., Richardson, E.A. and Dube, D.H. (2009) Metabolic profiling of Helicobacter pylori glycosylation. Volume 5, 909–912; http://dx.doi.org/10.1039/b902178g) has been reproduced by permission of The Royal Society of Chemistry.
Figure 4
Figure 4. Example output from the ChemSpider Journal of Chemistry
Marked-up chemical entities include chemical families, chemical names (pale orange highlights), chemical groups (dark green) and reaction types, with links out to Wikipedia where appropriate (e.g. overlaid here as a ‘callout’). Displayed mark-up is controlled via the Article Mark-up toolbar, shown on the right-hand side of the screen-shot. (http://www.chemmantis.com). The extract from The ChemSpider Journal of Chemistry ([49]; Walker, M.A. (2009) Some highlights in synthetic organic methodology, article 895), has been reproduced by permission of The Royal Society of Chemistry.
Figure 5
Figure 5. The structured summary for one of the pilot articles in the FEBS Letters experiment [54]
Two interactions are described, with relevant references to their MINT and UniProtKB entries.
Figure 6
Figure 6. A PLoS Computational Biology article marked up using BioLit
Terms found in the source ontologies are highlighted in different colours (blue, GO terms; pink, physicochemical methods and properties ontology; purple, physicochemical process ontology). PDB identifiers are underlined. Clicking on the marked-up entities invokes pop-up menus displaying term definitions, and sequence and structural details from the PDB, as appropriate. (http://biolit.ucsd.edu/doc/). Reproduced from [57]; Gu, J., Gribskov, M. and Bourne, P.E. (2006) Wiggle-predicting functionally flexible regions from primary sequence. PLoS Computational Biology 2, e90.
Figure 7
Figure 7. The PLoS NTD article marked up using the system developed by Shotton et al. [34]
Users may select from the coloured tabs at the top of the page to reveal entities of interest in the text: here, the protein (purple), disease (red), habitat (green) and organism (blue) tabs have been chosen. Organism terms are linked to uBio, a community initiative to create a comprehensive catalogue of the names of all (living and once-living) organisms (e.g. overlaid here as a ‘callout’). (http://www.ubio.org). Reproduced from [58]; Reis, R.B., Ribeiro, G.S., Felzemburgh, R.D., Santana, F.S., Mohr, S., Melendez, A.X., Queiroz, A., Santos, A.C., Ravines, R.R., Tassinari, W.S. et al. (2008) Impact of environment and social gradient on Leptospira infection in urban slums. PLoS Neglected Tropical Diseases 2, e228.
Figure 8
Figure 8. Illustration of Reflect mark-up of a Biochemical Journal article
The text, from [59], shows tagged protein (blue) and chemical (gold) entities, and those for which both protein and chemical names are available (purple); clicking on a tagged entity invokes a pop-up summary, including links to features such as the structure of the protein (or chemical), its domain composition, its sequence, etc. The system is tuned for speed over accuracy, so users need to be aware of likely errors. (http://reflect.ws/).
Figure 9
Figure 9. Lynch imagines being able to toggle between a published table of numerical values and their graphical representation
For readers viewing this article using UD, from this typical table of data from the European Journal of Pharmaceutical Sciences [62], explore the result of clicking on the UD logo. Reproduced from Corti, G., Maestrelli, F., Cirri, M., Zerrouk, N. and Mura, P. (2006) Development and evaluation of an in vitro method for prediction of human drug absorption II. Demonstration of the method suitability. European Journal of Pharmaceutical Science 27, 354–362, Copyright (2006) with permission from Elsevier.
Figure 10
Figure 10. Bourne imagines reading a description of a molecule's active site, being instantly able to access its atomic co-ordinates, and thence to explore the interactions described in the paper
In this 2009 BJ paper, Vandermarliere et al. [64] describe the catalytic site of Bacillus subtilis arabinoxylan arabinofuranohydrolase. The catalytic domain is shown in blue and the carbohydrate-binding module in green. For readers viewing this article using UD, explore further by clicking on the UD logo.
Figure 11
Figure 11. Bourne imagines being able to find all papers that reference a particular sequence motif described in a paper
In this 2008 Biophysical Chemistry article [66], Illingworth et al. describe the GXXG motifs characteristic of the LanC (lanthionine synthetase C)-like proteins (a), and also reference them elsewhere in the literature (b), including their appearance in nisin cyclase, whose three-dimensional structure was determined by Li et al. [67], and in the putative G protein-coupled receptor (GPCR) GCR2 [68] (c). For readers viewing this article using UD, to bring life to this image and visualize the GXXG motifs, click on the UD logo. Reproduced from Illingworth, C.J.R., Parkes, K.E., Snell, C.R., Mullineaux, P.M. and Reynolds, C.A (2008) Criteria for confirming sequence periodicity identified by Fourier transform analysis: application to GCR2, a candidate plant GPCR? Biophysical Chemistry 133, 28–35, Copyright (2008), with permission from Elsevier; and from Gao, Y., Zeng, Q., Guo, J., Cheng, J., Ellis, B. E. and Chen, J.-G. (2007) Genetic characterization reveals no role for the reported ABA receptor, GCR2, in ABA control of seed germination and early seedling development in Arabidopsis. The Plant Journal 52, 1001–1013 with permission from Wiley-Blackwell.
Figure 12
Figure 12. Comparison of a page from a ‘naked’ 2003 BJ article [59] (a) with a semantically enriched counterpart (b), annotated using more than 100 different ontologies
The colour overlay denotes the number of semantic relationships for particular areas (green areas having the least and red the most), illustrating the extent of the opportunities for mark-up that exist on a single page, and hence the need to balance both appropriate mark-up tools and appropriate levels of manual intervention to make this information usefully accessible to readers: mark-up too much information, and the reader is overwhelmed; mark-up too little, and the reader is denied access to the full semantic richness of the article. For readers viewing this article using UD, click on the UD logo.
Figure 13
Figure 13. Tools that could support the discovery of errors and inconsistencies could have profound consequences for the evolution of knowledge
In 2007, Liu et al. [71] reported in Science the discovery of a novel plant G protein-coupled receptor (GPCR), so-called GCR2 (a). Much of the supporting evidence rested on a ‘characteristic’ hydropathy profile (reported as a Supplementary Figure), which showed seven peaks, apparently consistent with known GPCR transmembrane (TM) domain topology (b). Illingworth et al. challenged this result, pointing to the clear similarity of GCR2 with LanC-like proteins and showing that the topology of the hydropathy profile was the result of the seven-fold symmetry of the inner helical toroid (the blue/green region in the centre of the structure) of this globular protein (c) [66]. It is interesting to compare a hydropathy plot (d) with that reported by Liu et al. (b), generated using the same DAS TM prediction server [72] – note the omission of the significance bars in the latter, which in the former show that only one of the seven peaks scores above the significance threshold for TM domains and hence argues strongly against this being a membrane protein. Compare the structure of a bona fide GPCR [bovine rhodopsin, PDB code 1F88 (e)] with the nisin cyclase structure shown in Illingworth's paper [PDB code 2G0D (c)]. Despite the obvious lack of sequence and structural similarity of GCR2 to genuine GPCRs, and its clear affiliation with the LanC-like proteins, this error has been propagated to the description line of its UniProt entry, even though the entry contains database cross-references to LanC-like proteins rather than GPCRs (f). For readers viewing this article using UD, click on the UD logos in the Figure to explore this scenario further. Reproduced from Illingworth, C.J.R., Parkes, K.E., Snell, C.R., Mullineaux, P.M. and Reynolds, C.A (2008) Criteria for confirming sequence periodicity identified by Fourier transform analysis: application to GCR2, a candidate plant GPCR? Biophysical Chemistry 133, 28–35, Copyright (2008), with permission from Elsevier; and from Liu, X. G., Yue, Y. L., Li, B., Nie, Y. L., Li, W., Wu, W. H. and Ma, L. G. (2007) A G protein-coupled receptor is a plasma membrane receptor for the plant hormone abscisic acid. Science 315, 1712–1716 (http://www.sciencemag.org/cgi/content/abstract/315/5819/1712), with permission from AAAS.

Similar articles

Cited by

References

    1. Roos D. Bioinformatics: trying to swim in a sea of data. Science. 2001;291:1260–1261. - PubMed
    1. Gerhold D., Rushmore T., Caskey C. T. DNA chips: promising toys have become powerful tools. Trends Biol. Sci. 1999;24:168–173. - PubMed
    1. Andrade M., Sander C. Bioinformatics: from genome data to biological knowledge. Curr. Opin. Biotechnol. 1997;8:675–683. - PubMed
    1. Hess K. R., Zhang W., Baggerly K. A., Stivers D. N., Coombes K. R., Zhang W. Micro-arrays: handling the deluge of data and extracting reliable information. Trends Biotechnol. 2001;19:463–468. - PubMed
    1. Editorial. Prepare for the deluge. Nat. Biotechnol. 2008;26:1099. - PubMed

Publication types