Automated GenAI-driven search quality evaluation
For more than 1 billion members on LinkedIn, the global search bar is a tool for looking up jobs, company pages, member profiles and posts, LinkedIn Learning courses, and much more. As soon as members start typing in the search bar, their experience on LinkedIn begins to take shape with the assistance of our Flagship Blended Typeahead, which presents auto-completed suggestions that blend various downstream search results, such as company entities, people entities, and plain text suggestions.
The quality of these typeahead suggestions can significantly impact the member experience when interacting with LinkedIn's search ecosystem. High-quality suggestions can motivate members to discover and engage with valuable content from across LinkedIn. On the other hand, low-quality suggestions can sidetrack, overwhelm or possibly dissatisfy users, compromising the ideal experience we want to provide across our search ecosystem. This is why one of the most powerful steps we can take to maintain a valuable search experience is to consistently assess and refine the quality of our suggestions.
Search quality assessment has historically hinged on human evaluation to achieve our high bar on quality. However, using that very hands on approach is difficult to sustain for typeahead suggestions given the growth of our platform and the challenges around scalability for human evaluation. Thankfully, the advent of advanced large language models (LLMs) introduced the ability for automated evaluations. In this blog, we demonstrate how we leveraged those automated capabilities by using an OpenAI GPT model served through Azure to build the GenAI Typeahead Quality Evaluator and its key components.
Setting a foundation with typeahead quality measurement guidelines
Before diving into the complexities of evaluating typeahead suggestions, one of the very first steps we took was to establish clear measurement guidelines. Starting with well-defined guidelines is essential, as they provide a consistent framework for assessing the quality of suggestions and help align automated evaluations with our desired outcomes. These guidelines serve as a foundation for prompt engineering and ensure that we are measuring the right aspects of typeahead qualities across varied suggestion types.
However, developing these guidelines wasn’t straightforward due to the unique complexities involved for typeahead suggestions:
- Vertical Intent Diversity: Typeahead displays varied types of results, such as People entities, Company entities, Job suggestions, plain text suggestions, etc.
- Personalization: Typeahead suggestions are highly personalized for different users. For instance, the same query ("Andrew") will yield different suggestions for different users and the relevance of the evaluations can be subjective.
To tackle these challenges, we crafted guidance and examples for each suggestion type. Then, we categorized the suggestion quality into either high or low to reduce the amount of gray area or uncertainty caused by subjective judgment. We also designed the evaluation to perform once the member is done typing, ensuring sufficient user-query context. These guidelines ultimately laid the foundation for our subsequent prompt engineering.
Below are examples of People entities and Job suggestions.
Creating a golden test set to calibrate for overall quality
For our GenAI-driven evaluations, we used a golden test set, which is a pre-defined query set sampled from platform traffic, with the goal of replicating the overall quality of the typeahead search experience for all members. Given typeahead suggestions are personalized, the golden test set comprises {query, member id} pairs and we adhered to the following principles to ensure it accurately reflects the typeahead user experience for all members:
- Includes comprehensive coverage of search intents: We sample 200 queries from each of the search intent categories covering People, Company, Product, Job, Skill Queries, Trending Topics, and Knowledge-Oriented questions. We leverage the click signal on typeahead to identify the query intent. For example, if a typeahead session has a click on a People entity result, we label the query intent to be People. Likewise we categorized the clicked sessions’ query into the different intent categories.
- Uses sufficient samples from bypass and abandoned sessions: In addition to clicked sessions, bypass and abandoned sessions are crucial for assessing and improving typeahead quality. Bypass sessions occur when a member presses “Enter” without selecting a typeahead suggestion, while abandoned sessions refer to when a member leaves the LinkedIn platform after initiating a typeahead search and does not click any suggestion. We sampled 200 queries from both bypass and random sessions.
- Identifies frequent members by member life cycles (MLC): When sampling member ids for the golden test set, we leverage the Member Life Cycle (MLC) data that represents the frequency and consistency of a member's visits. In particular, we focus on the weekly-active members (who visit roughly once per week), and daily-active members (who visit every day). We adopt the filter for members when sampling for each search intent or the bypass and abandoned sessions.
Developing prompt engineering templates for quality evaluation
It’s at this stage, we were able to really start leveraging the new capabilities that LLMs provided for automated evaluations. We started with considering the most efficient and effective way to prompt engineer for this process. In general, the prompt is constructed with the following sections:
- IDENTITY
- TASK GUIDELINES
- (Optional) EXAMPLES
- INPUT
- OUTPUT
With this general prompt structure, we asked an OpenAI GPT model served through Azure to evaluate the appropriateness of a typeahead suggestion based on a given prefix and the suggestion’s detailed information. It then would provide an output as a score of either 1( high-quality) or 0 (low-quality) for the suggestion. We crafted specialized evaluation prompt templates for each of the result types, ensuring they align with the specific quality guideline and data structure of that result type. The key differences in prompt design for different result types included:
- IDENTITY and TASK GUIDELINES are very different for Entity suggestions vs for a Job suggestion or a plain text suggestion, given the different nature of search tasks for each of the result type.
- The INPUT varies for each of the result types. For instance, for Job suggestions, the INPUT only contains the completed query, and for entity suggestions, the INPUT contains detailed suggestion page information.
- The optional EXAMPLE sections are few-shot examples we provided for the prompt. For instance, in the People Entities Prompt Template, we included examples where the search queries are patterns of “name + job title” and “name + geo location”. These examples are shown to improve GPT evaluation’s accuracy when the use cases are complicated.
In the OUTPUT section, we ask GPT to also generate the Reasoning for its score. This simple chain-of-thought technique is proven to improve GPT performance. A high-level example of prompt templates is described below.
##IDENTITY
You are an evaluator whose role is to evaluate suggestions.
[[Evaluate whether the suggestion is a good suggestion to complete the partial search query based on the searcher information below]]
[[Evaluate whether the suggestion is appropriate for an entity search based on the searcher information below]]
[[Evaluate whether the suggestion is appropriate for a job search based on the searcher information below]]
##TASK GUIDELINES
Input Format: you will be provided with the following:
- A partial search query
- Information about the user inputting the partial search query
- A list of suggestions based on the partial search query
- Information about the suggestions
Output Format:
After your assessment, provide a score for the quality of the suggestion. Please score the suggestion using a binary evaluation of whether the suggestion is a high-quality suggestion or a low-quality suggestion. Explain why you gave the suggestion the score.
##EXAMPLES
Example 1:
Searcher Information: Name: ABC. Information: ABC is a UX designer at XX.
Query: Dav
Suggestion: Name: David W. Information: David W. is an Engineering Lead at XX.
Reasoning: The user might be looking for a first-degree connection whose name starts with “Dav” .
Score: 1
Example 2:
Searcher Information: Name: DEF. Information: DEF is a Sales representative at YY.
Query: imposter syndrome
Suggestion: imposter syndrome in jobs
Reasoning: "Imposter syndrome" is not a legitimate keyword for searching for jobs on LinkedIn because it is a psychological phenomenon.
Score: 0
##SEARCHER INFORMATION
- Name: XYZ
- Searcher Information: XYZ has been a manager at ZZ for 10 years. ZZ was founded in 2007 and engineers widgets.
##INPUTS
Query: Alex
Suggestion 1:
Suggestion name: Alex V.
Suggestion information: Alex V. has worked at AA for 4 years. At AA, Alex V works on engineering widgets at AA. Alex V lives in BB.
Suggestion 2:
Suggestion name: Alexa
Suggestion information: Alexa is a product that can be purchased at www.alexa.com
##OUTPUTS
Follow the above instructions to deduce the quality of the above suggestions. Provide reasoning for your score and then give the score ONLY AS either 1 or 0:
Suggestion 1:
Reasoning
Score
Suggestion 2:
Reasoning
Score
Evaluating the typeahead quality score
Since typeahead is a ranked surface, the quality of top ranked suggestions is more important than the lower ranked suggestions because of their visibility. We defined four typeahead quality scores to help ground our evaluation:
- TyahQuality1 : quality score of the top suggestion
- TyahQuality3 : average quality score of the top three suggestion
- TyahQuality5 : average quality score of the top five suggestion
- TyahQuality10: average quality scores of all (maximum 10) typeahead suggestion
For each suggestion in a typeahead session, GPT scores it as either 1 (high) or 0 (low). For a typeahead session, we define TyahQualityn as:
Then by averaging the typeahead quality scores for all the sessions in the golden test set, we get an overall typeahead quality score. By calculating the overall typeahead quality score, we gain a comprehensive metric that reflects the quality of typeahead suggestions across all user sessions. This metric allows us to monitor and benchmark the health of the typeahead experience over time, helping us quickly identify areas for improvement and measure the impact of new experiments.
Laying out our evaluation pipelines
To help visualize many of the steps we’ve covered in this blog so far, below we’ve included a diagram of the GenAI Typeahead Quality Evaluation pipelines:
For a new experiment on typeahead relevance, we perform GenAI quality evaluation via the following steps:
- Generate requests with the experiment configs on the Golden Test Set, then call Typeahead backend
- Collect typeahead responses on the golden test set
- Generate Prompts for GPT 3.5 Turbo on the responses suggestions
- Batch call GPT API to perform quality evaluations on the responses suggestions
- Post processing GPT responses to calculate TypeaheadQuality scores
This evaluation pipeline not only accelerates our ability to assess the impact of new experiments but also ensures a high degree of consistency and objectivity in our quality assessments. By structuring the process into clear and automated steps, we’ve created a simple yet effective framework that allows us to iterate quickly and maintain high-quality typeahead suggestions at scale.
A representative experiment
To better understand the evaluation process at work, we've demonstrated how the typeahead quality score evaluator helps typeahead experiments move fast through an example. A major initiative in typeahead focuses on improving plain text suggestions liquidity by expanding the inventories with short phrases summarized from high-quality User Generated Content (UGC) on LinkedIn Posts. Below are screenshots showing the before and after effects of this initiative.
To evaluate the impact of this experiment on typeahead quality, we used the GenAI Typeahead Quality evaluator. Below are the typeahead quality scores before (Control) and after (Experiment) implementing the initiative we described throughout the blog.
Control | Experiment | |
TyahQuality10 | 66.70% | 73.50% |
TyahQuality5 | 69.20% | 75.00% |
TyahQuality3 | 70.60% | 75.70% |
TyahQuality1 | 73.20% | 77.20% |
The results reveal the significant improvements in typeahead quality achieved through the experiment example. We observed a lift in typeahead scores across all positions, with a notable 6.8% absolute improvement in the typeahead quality score (TyahQuality@10), which corresponds to a 20% reduction in low-quality suggestions.
These insights are invaluable because they increase our visibility into the ongoing quality of the typeahead search suggests, while doing so with improved speed and efficiency. Traditional manual search quality evaluations, often involving multiple human evaluators, can take days or even weeks to complete. In contrast, the GenAI automated search quality evaluator accomplishes the task in just a few hours. This rapid turnaround enables a more immediate assessment of a project's impact, allowing the project to maintain momentum and progress quickly.
Key learnings
Based on our experiment, the GenAI Typeahead Quality Evaluator not only helps to speed up the evaluation process, but can improve the quality of typeahead suggestions for members. However, achieving these results required rigorous quality standards and several cycles of cross-evaluation on GPT outputs and prompt iterations.
Defining the evaluation criteria for the GenAI Typeahead Quality Evaluator proved particularly challenging due to the inherent complexity and diversity of typeahead results. Addressing this required high precision and eliminating ambiguities across the suggestion types or entities that appear in typeahead.
Prompt engineering is also an iterative process. With several cycles of cross-evaluation on GPT outputs and prompt iterations, we learned from the instances where GPT evaluations did not align with human evaluation. Using these learnings, we were able to refine the prompt based on these discrepancies and significantly enhance the accuracy of GPT evaluations.
Acknowledgements
We would like to extend our heartfelt thanks to the following individuals and teams for their invaluable support: Search Leads Abhi Lad and Alice Xiong, Project Manager Partners Jeffery Wang Fatoumata Diallo, as well as the GaitWay team, Search Federation. We would also like to express our gratitude to Rupesh Gupta Jagadeesan Sundaresan and Yafei Wang for their thorough review and insightful feedback