Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(1):e29509.
doi: 10.1371/journal.pone.0029509. Epub 2012 Jan 3.

Connecting the dots between PubMed abstracts

Affiliations

Connecting the dots between PubMed abstracts

M Shahriar Hossain et al. PLoS One. 2012.

Abstract

Background: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications.

Methodology: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for "connecting the dots" across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps.

Conclusions: WE DEMONSTRATE THE APPLICATION OF OUR STORYTELLING ALGORITHM TO THREE CASE STUDIES: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Constructing stories out of clique chains between PubMed IDs 19166837 (a document about IL-6) and 3000421 (a document about Poly ADP-ribose).
Figure (a) shows a path with the least stringent requirements. As the requirements become more and more stringent in (b) and (c), the story becomes longer. Figure (d) shows a significant story from both statistical and biological viewpoints. It connects one of the inflammatory cytokines IL-6 with brain injury and Poly(ADP-ribose) Polymerase (PARP) deficiency or inhibition. The generated hypothesis suggests that PAR/PARP could impact signaling of IL-6 or other interleukins. This story is covered in detail in Table 4. Fig. 9 depicts the dispersion plots of the four stories derived from the clique chains of this figure.
Figure 2
Figure 2. Storytelling pipeline.
The pipeline takes a set of input and output molecules, applies algorithmic approaches to handle the research issues in story generation, and at the end outputs significant stories.
Figure 3
Figure 3. The concept lattice for a given dataset.
Each concept is a pair: (document set, term set). We can find an approximate set of nearest neighbors for a document d from the document list of the concept containing d and the longest term set.
Figure 4
Figure 4. Classification of context/result sentences.
The training set contains 100 randomly selected documents (449 context sentences, 565 result sentences) (a) bin probabilities of the training dataset. Sentences closer to the title have higher probability of being a context and furthest sentences from the title tend to be result sentences. (b) phrase probabilities in the training set. Note that some phrases favor a sentence to be a context sentence and some favor it to be a result sentence.
Figure 5
Figure 5. An example of a story with broken context.
The story, connecting documents PubMedID: 8593581 and PubMedID: 6311640, is removed by our context overlap filter. The filter recognizes that the last pair of documents do not have any overlap in their context sentences. Each of the sentences of the abstracts of the story is marked by either c or r to indicate their role as a context or result sentence. Important terms are highlighted by colors (red is used for the terms in context sentences and blue for the terms in result sentences).
Figure 6
Figure 6. An example story that was pruned by our sentence cohesion filter.
The figure shows that a story between PubMedIDs 19563795 and 18619441 is pruned because it does not have a cohesive sentence-level path.
Figure 7
Figure 7. An example story that passes our sentence cohesion filter.
The figure shows a story between documents with PubMedIDs 19166837 and 3000421. The story contains many sentence-level cohesive paths from the start document to the end. One of these sentence-level paths is highlighted in this figure.
Figure 8
Figure 8. Dispersion plots.
The dispersion plot of a story gives an overall idea about overlap of terms between each pair of documents in the story. (a) shows a dispersion plot of a quality story with dispersion coefficient formula image, and (b) shows the dispersion plot of a story with dispersion coefficient formula image.
Figure 9
Figure 9. Dispersion plots and dispersion coefficients.
Dispersion plots a, b, c, and d of this figure are associated with the respective stories of Figure 1.
Figure 10
Figure 10. Variation in number of stories.
The number of stories generated decreases monotonically with stringent clique size and distance threshold requirements. Plot (a) shows the number of stories with different parameters but same 50,751 start-end document pairs of case study 1. Plot (b) shows a similar graph with 5,646 start-end document pairs of case study 2. Note that both the plots depict monotonic decrease in number of stories with stringent clique size and distance threshold parameters.
Figure 11
Figure 11. Relation between clique chain and distance threshold.
This plot shows the largest k for which at least one k-clique chain is discovered with different distance thresholds. Note that the stricter the distance threshold is the smaller such k is. Both the case studies exhibit the same monotonic relationship between clique size and distance threshold requirements.
Figure 12
Figure 12. A screenshot of the Storygrapher interface.
10 most significant stories beginning with documents labeled by stanniocalcin ending at documents marked by poly-ADP-ribose are displayed. Storygrapher helps in analyzing significant stories in a single spatial environment.
Figure 13
Figure 13. A screenshot of the Storygrapher interface.
The screenshot displays 10 most significant stories starting with documents marked by nicotinamide ending in documents marked by poly-ADP-ribose.
Figure 14
Figure 14. Distributions of dispersion.
The plots show the distributions of dispersion of the final set of stories for three case studies, The distributions show that the resulting stories of all the case studies have overall high dispersion coefficient.
Figure 15
Figure 15. Distribution of clusters in the stories.
Each of the plots shows the number of traditional clusters in the stories. The peak is found at four clusters in each of the case studies. The plots show that in each of the case studies, more than 40% of the stories pass through four clusters.

Similar articles

Cited by

References

    1. Skalamera D, Ranall MV, Wilson BM, Leo P, Purdon AS, et al. A High-Throughput Platform for Lentiviral Overexpression Screening of the Human ORFeome. PLoS One. 2011;6:e20057. - PMC - PubMed
    1. Carpenter AE, Sabatini DM. Systematic Genome-wide Screens of Gene Function. Nat Rev Genet. 2004;5:11–22. - PubMed
    1. Shatkay H, Feldman R. Mining the Biomedical Literature in the Genomic Era: an Overview. J Comput Biol. 2003;10:821–855. - PubMed
    1. Zhou D, He Y. Extracting Interactions between Proteins from the Literature. J Biomed Inform. 2008;41:393–407. - PubMed
    1. Kersey P, Apweiler R. Linking Publication, Gene and Protein Data. Nat Cell Biol. 2006;8:1183–1189. - PubMed

Publication types