Finding Query Suggestions for PubMed

Zhiyong Lu; W. John Wilbur; Johanna R McEntyre; Alexey Iskhakov; Lee Szilagyi

AMIA Annu Symp Proc. 2009; 2009: 396–400.

Published online 2009 Nov 14.

PMCID: PMC2815412

PMID: 20351887

Finding Query Suggestions for PubMed

Zhiyong Lu, PhD, W. John Wilbur, MD, PhD, Johanna R McEntyre, PhD, Alexey Iskhakov, BS, and Lee Szilagyi, BA

Author information Copyright and License information PMC Disclaimer

Abstract

It is common for PubMed users to repeatedly modify their queries (search terms) before retrieving documents relevant to their information needs. To assist users in reformulating their queries, we report the implementation and usage analysis of a new component in PubMed called Related Queries, which automatically produces query suggestions in response to the original user’s input. The proposed method is based on query log analysis and focuses on finding popular queries that contain the initial user search term with a goal of helping users describe their information needs in a more precise manner. This work has been integrated into PubMed since January 2009. Automatic assessment using clickthrough data show that each day, the new feature is used consistently between 6% and 10% of the time when it is shown, suggesting that it has quickly become a popular new feature in PubMed.

Introduction

PubMed, a literature search system maintained at the National Library of Medicine (NLM), is widely accessed by millions of users each day to seek biomedical information¹. However, due to the large and rapidly growing literature, the process of retrieving relevant information in PubMed is challenging. Given a user query to PubMed, the average number of returned citations is over 10,000². Although generally speaking, more recent publications are of greater interest to users than older ones, PubMed’s default sorting algorithm – reverse chronological order – makes it difficult for users to retrieve citations that are most relevant to their information needs but are not returned at the top positions. Another difficulty is that a user’s initial query may not be a perfect description of their information need. Indeed, our own analysis reveals that query modification/reformulation is the most common behavior in user search history. Given the challenges to PubMed searching, an information need was identified for assisting users in reformulating their queries.

From the perspective of a practical application where suggestions will daily be displayed to millions of PubMed users, the generated query suggestions need to meet the following requirements: a) they are highly related to user input; b) they are mostly error free. That is, we are more concerned with high precision than recall in this application; and c) they resemble real user queries.

To this end, we propose to identify query suggestions among the most popular PubMed queries that contain the current search terms. This corresponds to one particular type of query modification: specification (increase of specificity relative to the user input). For instance, breast cancer (Queries are highlighted in SimSun font throughout this paper) is a more specific query than cancer. Although there are many ways a query can be modified³, offering more precise queries will not only make the size of the retrieval list shorter, but also let more relevant documents be returned earlier.

At present, the result of this work is implemented as a new component in PubMed, called Related Queries. It is, when available, shown in the search results page under the heading “Also Try”, along with other changes to the PubMed database as part of a larger effort to reach the full potential of different Web services maintained by the NLM.

Related Work

Automatically finding query suggestions is a well-known and important problem in the field of information retrieval. Most existing solutions to this problem comprise query modification or expansion techniques from two sources: a) query logs or b) retrieved documents based on (pseudo-) relevance feedback.

Query logs contain rich information of real user search habits. Jones et al., (2006) proposed to generate highly relevant query substitutions based on typical modifications observed in Web logs³. More recently, Shi and Yang (2007) proposed a method to mine related queries based on the log of previously submitted queries by using association rules⁴.

Another approach for finding query suggestions is through the subset of the initial retrieved results. By using user clickthrough data as (pseudo-) relevance feedback, terms in those retrieved documents are considered as relevant to the user search intent and can be correlated with user input queries by various methods. Much work has been devoted to this approach⁵.

In the biomedical domain, research on query expansion techniques is most closely related to our work. For example, by default PubMed employs a process called Automatic Term Mapping (ATM) that compares and maps user’s search terms to lists of pre-indexed terms. Query expansion techniques have also been extensively studied in the TREC Genomics Tracks⁶.

Methods

As illustrated in Figure 1, the overall architecture of our system comprises three separate components. First, we process raw PubMed logs and collect all user queries, followed by discarding such problematic queries as author names. The second step involves aggregating different queries and adjusting query frequencies. Finally, we rank query suggestions based on their frequency and subsequently return and display the top k ranked suggestions.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-396f1.jpg

Open in a separate window

Figure 1

Steps for generating query suggestions.

1. Collection User Queries

Our goal in this step is to compute query frequency: how many times specific queries are entered by different users. During system development, we collected a sample of 30 days worth of PubMed logs. The basic unit of the PubMed logs is the user session, in which different transactions (e.g. searches) during a particular user’s visit over a specified time period are recorded. The longest time span for a single user session in our data is one day. To eliminate individual bias, each unique query in a user session is counted only once despite the fact that the query may be entered repeatedly by the same user.

As the primary gateway for retrieving MEDLINE articles, PubMed handles approximately 2 million user queries each day. As shown in a previous study¹, a sizeable number of queries are problematic, thus they are disqualified for generating high-quality query suggestions:

Misspelled queries (e.g. diebetes)
Queries with no results (e.g. salts iron rash)
Queries with irregular characters (e.g. foreign characters)

Both our own experience and the previous study indicated that queries of these three classes account for over 20% of the total queries.

Furthermore, since our goal is to find generally useful queries that contain the user’s search term(s), we discarded the following classes of queries from consideration:

Single-term queries (e.g. cancer)
Queries with bibliographic information (e.g. Smith M)
Queries with search tags (e.g. Smith[au])
Queries ≥ 70 characters long

Single-term queries are not usable for our purpose, thus they are removed. In most cases, queries with bibliographic information represent users’ needs for searching specific MEDLINE citations. Such queries are sometimes called navigational¹ and indeed they account for a significant portion of PubMed queries. In response to such needs, two separate features have already been implemented in PubMed: one is the citation matcher and the other a citation sensor. Thus, they are not considered in this work. In practice, if a query contains a recognized author name, it is removed from further analysis.

In PubMed, users are allowed to attach tags after the search terms in order to search specific indexed fields. For example, gene[au] would only retrieve articles with gene as an author name but not as a text word in a title or abstract. Such queries are filtered because: a) the most commonly used tags are typically related to bibliographic searches (e.g. [au] for search author names); b) search tags are seldom used by PubMed users; and c) PubMed tags user queries automatically through its Automatic Term Mapping (ATM) process by default.

Finally, it is reported that PubMed queries have an average length of three terms¹^,², where terms are defined as sequences of characters separated by whitespace. In this regard, very long queries are likely to be of limited interest. Thus, they are filtered as we prefer the query suggestions to resemble real user queries.

2. Aggregating queries and adjusting query frequency

After processing the raw logs and collecting qualified user queries, a list of possible suggestions is obtained. We sort this list by query frequency. Some of the most frequent submitted terms are: breast cancer; multiple sclerosis; lung cancer; stem cells; stem cell; and myocardial infarction.

For the 30 days worth of PubMed logs, our list contained approximately 14 million unique queries. Consistent with previous analysis of the queries submitted to AltaVista⁷, we found most unique queries occurred only a few times: once (83.2%), twice (10.6%); three times (2.9%); four times (1.2%); and five times or more (2.1%). For query suggestions to be most useful and of high quality, we empirically determined to use only queries that occurred at least five times over a predetermined time period.

After removing the less frequent queries (occurrences less than five) from our list, we further adjust a query’s frequency by taking its occurrences in a longer query into account. For instance, we would add the frequency of breast cancer treatment to the frequency of breast cancer. Table 1 shows such an example where the frequency of breast cancer increased significantly (from 6,479 to 7,689: an 18.7% increase) after the adjustment step. Subsequently, we use the adjusted frequency to re-rank our suggestion list.

Table 1

An example of the adjustment of frequency for breast cancer after considering its occurrence embedded in longer queries.

Query	Frequency
breast cancer (before adjustment)	6,479
triple negative breast cancer	224
breast cancer screening	205
male breast cancer	205
inflammatory breast cancer	205
breast cancer treatment	202
breast cancer stem cells	169
breast cancer (after adjustment)	7,689

Open in a separate window

Adjusting query frequency also includes steps to consolidate queries with minor differences. For example, the following five queries are essentially identical with a single information need despite their differences in strings:

breast cancer stem cell
breast cancer stem cells
breast cancer and stem cell
stem cell breast cancer
stem cells breast cancer

Thus, we modify them by removing Boolean operators and stop words; stemming individual words; and switching word orders. Of all of the similar queries, we keep the most frequent one and remove the others from the list. In addition, we update the frequency of the one we kept.

3. Generating a list of ranked suggestions

The final step involves producing pairs of related queries: q_i → q_j where q_j is a query suggestion for the user input query q_i.

For each query q_j in the suggestion list obtained above, we generate its corresponding q_i by selecting phrases of length from one to the number of terms minus one. This is similar to the method used in a previous study⁸. Taking the query breast cancer stem cell for example, we would generate the following q_i:

breast
cancer
stem
cell
breast cancer
cancer stem
stem cell
breast cancer stem
cancer stem cell

If a user types any of the above queries (e.g. breast cancer), breast cancer stem cell would be a potential suggestion. At run time, the top k suggestions ranked by their frequency are returned and displayed to the user.

Results on Query Suggestions

As previously discussed, only a small percentage of queries occurred five or more times in the query logs over the thirty-day period. Thus, in order to increase the number of query suggestions, we expanded the number of days for query log collection. In the current implementation, query suggestions were generated based on 180 days’ worth of query logs.

Over the 180 days period, our method produced approximately two million suggestions (q_j) for 1.2 million unique queries (q_i). For each query q_i, the number of suggestions varies from one to ten with an average of 1.7 suggestions. Since only a small percentage (~5%) of queries have more than 5 suggestions, we set the k = 5. That is, only the top 5 suggestions are returned and displayed in PubMed.

Figure 2 shows the distribution of queries (q_i) in terms of length (number of terms). As can be seen, the majority of the queries are two-word queries (e.g. breast cancer). This suggests that our method targets short queries rather than long queries. According to Figure 2, queries with four or more terms account for less than 10% of total queries. This is indeed by design because we mainly target queries whose length is shorter than the average (3 terms) based on our assumption that long queries represent user’s intent for specific searches. Thus less help would be needed.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-396f2.jpg

Open in a separate window

Figure 2

The distribution of queries (q_i) in terms of their length (number of terms).

Evaluation and Usage Analysis

Our method is currently integrated in PubMed as Related Queries under “Also try”, which displays query suggestions, when available, at the top right position of the result page. Figure 3 shows the top five suggestions for the query p53: p53 mutation; p53 apoptosis; p53 mdm2; p53 review; and p53 cancer.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-396f3.jpg

Open in a separate window

Figure 3

Screenshot of PubMed where query suggestions are displayed for a sample query p53.

Before the new feature was released to all of PubMed users, two analyses were performed for quality control and system optimization: first, we randomly selected 100 queries and their corresponding suggestions for human inspection. In this process, we attempted to limit the number of problematic queries to less than 5%. Despite our efforts in the data cleanup steps (see details in the Method Section), there remained some problematic suggestions such as author names (altschul lipman) and similar suggestions (collapsin response mediator 1 vs. collapsing response mediator–1).

Second, the Related Queries feature was evaluated using clickthrough data, an unbiased and automatic approach for assessing retrieval performance⁹. During the period from September 2008 to January 2009, the new feature was released to 5% of the PubMed users (randomly selected) and its corresponding usage was recorded. Based on the logged usage, we are able to compute the clickthrough rate (CTR), a widely used metric for measuring the success of online ads. In the context of Related Queries, CTR is calculated as the number of total clicks on suggested queries divided by the number of times query suggestions are displayed. As shown in Figure 4, the CTR of query suggestions falls consistently between 6% and 10%, with an average of 7.9%. Although there is no absolute gold standard for CTR, a 2% CTR would usually be considered as very successfully in Web advertising¹⁰. This implies that the Related Queries feature is highly clicked by PubMed users. In addition, the CTR of 7.9% also ranks Related Queries higher than many other new PubMed features (e.g. Recent Activity) in the context of MEDLINE retrieval.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-396f4.jpg

Open in a separate window

Figure 2

Clickthrough rate (CTR) for Related Queries during the period from 08 September 2008 to 08 March 2009. The new feature was released to 100% of PubMed users on 26 January 2009.

Since 26 January 2009, the new feature has been made available to all PubMed users. We have kept monitoring its daily CTR since then. For the forty-two-day period (from 01/26/09 to 03/08/09), the average CTR is 6.6%. The CTR decline from 7.9% (5% of users) to 6.6% (100% of users) is statistically significant and can be partly attributed to the fact that we included a small number of additional users on top of the randomly selected 5% of users – all users of our institute – in the test phase, during which the new feature was frequently clicked internally for evaluation purposes. Moreover, our internal records show that there is a list of other changes (e.g. other feature right below the “Also Try’) taking effect in PubMed website on the very same day when the Related Query feature went into production. Those changes can affect user attention and subsequently cause a change in the CTR of Related Query.

In addition to CTR, we also measured the percentage of time query suggestions are triggered and displayed upon user input. We monitored this for the first thirty-day period after the “Also Try” feature became available to all of the users (not applicable during the testing phase). On average, 17.5% of user requests triggered the display of the query suggestions daily. Taking together with the clickthrough rate and overall number of daily requests, there are approximately 20,000 clicks in PubMed on displayed query suggestions per working day.

Discussion and Conclusions

By taking advantage of rich information in the query log, we developed a practical application for offering query suggestions based on the most popular queries in PubMed. Usage analysis shows that such a new feature is highly clicked by PubMed users and using these suggested queries may provide more precise results than user’s initial search.

Even with an excellent source of related queries, we still observe poor suggestions such as author names occasionally. In the future, we plan to further clean such data during the collection step. Furthermore, we are expanding query suggestion from specification to many other types of query modifications such as generalization and synonym substitution. In addition to improving the coverage rate, these hold promise in cases where users find no or little relevant information in PubMed.

Acknowledgments

This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors are grateful to David J. Lipman for valuable discussions.

References

1. Herskovic JR, Tanaka LY, Hersh W, Bernstam EV. A day in the life of PubMed: analysis of a typical day’s query log. J Am Med Inform Assoc. 2007 Mar-Apr;14(2):212–20. [PMC free article] [PubMed] [Google Scholar]

2. Islamaj-Dogan R, Neveol A, Murray GC, Lu Z.Understanding PubMed user search behaviors through log analysisSubmitted, 2009 [PMC free article] [PubMed]

3. Jones R, Rey B, Madani O, Greiner W. Generating query substitutions. Proceedings of the 15th international conference on World Wide Web; Edinburgh, Scotland: ACM; 2006. [Google Scholar]

4. Shi XD, Yang CC. Mining related queries from web search engine query logs using an improved association rule mining model. Journal of the American Society for Information Science and Technology. 2007 Oct;58(12):1871–83. [Google Scholar]

5. Manning CD, Raghavan P, Schtze H. Introduction to Information Retrieval. Cambridge University Press; 2008. [Google Scholar]

6. Hersh WR, Bhupatiraju RT, Ross L, Roberts P, Cohen AM, Kraemer DF. Enhancing access to the Bibliome: the TREC 2004 Genomics Track. J Biomed Discov Collab. 2006;1:3. [PMC free article] [PubMed] [Google Scholar]

7. Silverstein C, Marais H, Henzinger M, Moricz M. Analysis of a very large web search engine query log. SIGIR Forum. 1999;33(1):6–12. [Google Scholar]

8. Kraft R, Zien J. Proceedings of the 13th international conference on World Wide Web. New York, NY, USA: ACM; 2004. Mining anchor text for query refinement. [Google Scholar]

9. Joachims T. Evaluating retrieval performance using clickthrough data. In: Franke J, Nakhaeizadeh G, Renz I, editors. Text Mining: Physica Verlag; 2003. [Google Scholar]

10. Lee Sherman JD. Banner advertising: Measuring effectiveness and optimizing placement. Journal of Interactive Marketing. 2001;15(2):60–4. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association