Connecting the dots between PubMed abstracts

doi:10.1371/journal.pone.0029509

. 2012;7(1):e29509.

doi: 10.1371/journal.pone.0029509. Epub 2012 Jan 3.

Connecting the dots between PubMed abstracts

M Shahriar Hossain¹, Joseph Gresock, Yvette Edmonds, Richard Helm, Malcolm Potts, Naren Ramakrishnan

Affiliations

PMID: 22235301
PMCID: PMC3250456
DOI: 10.1371/journal.pone.0029509

Connecting the dots between PubMed abstracts

M Shahriar Hossain et al. PLoS One. 2012.

. 2012;7(1):e29509.

doi: 10.1371/journal.pone.0029509. Epub 2012 Jan 3.

Authors

M Shahriar Hossain¹, Joseph Gresock, Yvette Edmonds, Richard Helm, Malcolm Potts, Naren Ramakrishnan

Affiliation

¹ Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America. msh@vt.edu

PMID: 22235301
PMCID: PMC3250456
DOI: 10.1371/journal.pone.0029509

Abstract

Background: There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications.

Methodology: We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for "connecting the dots" across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps.

Conclusions: WE DEMONSTRATE THE APPLICATION OF OUR STORYTELLING ALGORITHM TO THREE CASE STUDIES: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Constructing stories out of clique chains between PubMed IDs 19166837 (a document about IL-6) and 3000421 (a document about Poly ADP-ribose).**
Figure (a) shows a path with the least stringent requirements. As the requirements become more and more stringent in (b) and (c), the story becomes longer. Figure (d) shows a significant story from both statistical and biological viewpoints. It connects one of the inflammatory cytokines IL-6 with brain injury and Poly(ADP-ribose) Polymerase (PARP) deficiency or inhibition. The generated hypothesis suggests that PAR/PARP could impact signaling of IL-6 or other interleukins. This story is covered in detail in Table 4. Fig. 9 depicts the dispersion plots of the four stories derived from the clique chains of this figure.

**Figure 2. Storytelling pipeline.**
The pipeline takes a set of input and output molecules, applies algorithmic approaches to handle the research issues in story generation, and at the end outputs significant stories.

**Figure 3. The concept lattice for a given dataset.**
Each concept is a pair: (document set, term set). We can find an approximate set of nearest neighbors for a document d from the document list of the concept containing d and the longest term set.

**Figure 4. Classification of context/result sentences.**
The training set contains 100 randomly selected documents (449 context sentences, 565 result sentences) (a) bin probabilities of the training dataset. Sentences closer to the title have higher probability of being a context and furthest sentences from the title tend to be result sentences. (b) phrase probabilities in the training set. Note that some phrases favor a sentence to be a context sentence and some favor it to be a result sentence.

**Figure 5. An example of a story with broken context.**
The story, connecting documents PubMedID: 8593581 and PubMedID: 6311640, is removed by our context overlap filter. The filter recognizes that the last pair of documents do not have any overlap in their context sentences. Each of the sentences of the abstracts of the story is marked by either c or r to indicate their role as a *context* or *result* sentence. Important terms are highlighted by colors (red is used for the terms in context sentences and blue for the terms in result sentences).

**Figure 6. An example story that was pruned by our sentence cohesion filter.**
The figure shows that a story between PubMedIDs 19563795 and 18619441 is pruned because it does not have a cohesive sentence-level path.

**Figure 7. An example story that passes our sentence cohesion filter.**
The figure shows a story between documents with PubMedIDs 19166837 and 3000421. The story contains many sentence-level cohesive paths from the start document to the end. One of these sentence-level paths is highlighted in this figure.

**Figure 8. Dispersion plots.**
The dispersion plot of a story gives an overall idea about overlap of terms between each pair of documents in the story. (a) shows a dispersion plot of a quality story with dispersion coefficient , and (b) shows the dispersion plot of a story with dispersion coefficient .

formula image — **Figure 8. Dispersion plots.**
The dispersion plot of a story gives an overall idea about overlap of terms between each pair of documents in the story. (a) shows a dispersion plot of a quality story with dispersion coefficient , and (b) shows the dispersion plot of a story with dispersion coefficient .

**Figure 9. Dispersion plots and dispersion coefficients.**
Dispersion plots a, b, c, and d of this figure are associated with the respective stories of Figure 1.

**Figure 10. Variation in number of stories.**
The number of stories generated decreases monotonically with stringent clique size and distance threshold requirements. Plot (a) shows the number of stories with different parameters but same 50,751 start-end document pairs of case study 1. Plot (b) shows a similar graph with 5,646 start-end document pairs of case study 2. Note that both the plots depict monotonic decrease in number of stories with stringent clique size and distance threshold parameters.

**Figure 11. Relation between clique chain and distance threshold.**
This plot shows the largest k for which at least one k-clique chain is discovered with different distance thresholds. Note that the stricter the distance threshold is the smaller such k is. Both the case studies exhibit the same monotonic relationship between clique size and distance threshold requirements.

**Figure 12. A screenshot of the Storygrapher interface.**
10 most significant stories beginning with documents labeled by *stanniocalcin* ending at documents marked by *poly-ADP-ribose are displayed*. Storygrapher helps in analyzing significant stories in a single spatial environment.

**Figure 13. A screenshot of the Storygrapher interface.**
The screenshot displays 10 most significant stories starting with documents marked by *nicotinamide* ending in documents marked by *poly-ADP-ribose*.

**Figure 14. Distributions of dispersion.**
The plots show the distributions of dispersion of the final set of stories for three case studies, The distributions show that the resulting stories of all the case studies have overall high dispersion coefficient.

**Figure 15. Distribution of clusters in the stories.**
Each of the plots shows the number of traditional clusters in the stories. The peak is found at four clusters in each of the case studies. The plots show that in each of the case studies, more than 40% of the stories pass through four clusters.

See this image and copyright information in PMC

Cited by

Narratives in the network: interactive methods for mining cell signaling networks.
Hossain MS, Akbar M, Polys NF. Hossain MS, et al. J Comput Biol. 2012 Sep;19(9):1043-59. doi: 10.1089/cmb.2011.0244. Epub 2012 Aug 16. J Comput Biol. 2012. PMID: 22897227 Free PMC article.
A systematic review on literature-based discovery workflow.
Thilakaratne M, Falkner K, Atapattu T. Thilakaratne M, et al. PeerJ Comput Sci. 2019 Nov 18;5:e235. doi: 10.7717/peerj-cs.235. eCollection 2019. PeerJ Comput Sci. 2019. PMID: 33816888 Free PMC article.
Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery.
Smalheiser NR. Smalheiser NR. J Data Inf Sci. 2017 Dec;2(4):43-64. doi: 10.1515/jdis-2017-0019. J Data Inf Sci. 2017. PMID: 29355246 Free PMC article.

References

1. Skalamera D, Ranall MV, Wilson BM, Leo P, Purdon AS, et al. A High-Throughput Platform for Lentiviral Overexpression Screening of the Human ORFeome. PLoS One. 2011;6:e20057. - PMC - PubMed
1. Carpenter AE, Sabatini DM. Systematic Genome-wide Screens of Gene Function. Nat Rev Genet. 2004;5:11–22. - PubMed
1. Shatkay H, Feldman R. Mining the Biomedical Literature in the Genomic Era: an Overview. J Comput Biol. 2003;10:821–855. - PubMed
1. Zhou D, He Y. Extracting Interactions between Proteins from the Literature. J Biomed Inform. 2008;41:393–407. - PubMed
1. Kersey P, Apweiler R. Linking Publication, Gene and Protein Data. Nat Cell Biol. 2006;8:1183–1189. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

[1] Skalamera D, Ranall MV, Wilson BM, Leo P, Purdon AS, et al. A High-Throughput Platform for Lentiviral Overexpression Screening of the Human ORFeome. PLoS One. 2011;6:e20057. - PMC - PubMed

[2] Skalamera D, Ranall MV, Wilson BM, Leo P, Purdon AS, et al. A High-Throughput Platform for Lentiviral Overexpression Screening of the Human ORFeome. PLoS One. 2011;6:e20057. - PMC - PubMed

[3] Carpenter AE, Sabatini DM. Systematic Genome-wide Screens of Gene Function. Nat Rev Genet. 2004;5:11–22. - PubMed

[4] Carpenter AE, Sabatini DM. Systematic Genome-wide Screens of Gene Function. Nat Rev Genet. 2004;5:11–22. - PubMed

[5] Shatkay H, Feldman R. Mining the Biomedical Literature in the Genomic Era: an Overview. J Comput Biol. 2003;10:821–855. - PubMed

[6] Shatkay H, Feldman R. Mining the Biomedical Literature in the Genomic Era: an Overview. J Comput Biol. 2003;10:821–855. - PubMed

[7] Zhou D, He Y. Extracting Interactions between Proteins from the Literature. J Biomed Inform. 2008;41:393–407. - PubMed

[8] Zhou D, He Y. Extracting Interactions between Proteins from the Literature. J Biomed Inform. 2008;41:393–407. - PubMed

[9] Kersey P, Apweiler R. Linking Publication, Gene and Protein Data. Nat Cell Biol. 2006;8:1183–1189. - PubMed

[10] Kersey P, Apweiler R. Linking Publication, Gene and Protein Data. Nat Cell Biol. 2006;8:1183–1189. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Connecting the dots between PubMed abstracts

Affiliation

Connecting the dots between PubMed abstracts

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous