Skip to content

Latest commit

 

History

History

inpars-v1

InPars: Inquisitive Parrots for Search

InPars is a simple yet effective approach towards efficiently using large LMs in retrieval tasks. For more information, checkout our paper:

In this work, we use large LMs to generate labeled data in a few-shot manner for IR tasks. We then finetune retrieval models on this synthetic data and use them to rerank the search results of a first-stage retrieval system.

Ilustration of our method

How to Generate

Download the data, including the document collection, you want to generate synthetic queries from. Here, we are provinding data from the MS MARCO dataset.

bash download_data.sh

To generate synthetic queries using the OpenAI models, you need to provide your API_KEY:

export API_KEY=<YOUR_KEY>

You can generate synthetic queries, using the Curie model by running:

python generate_queries_openai.py \
    --collection data/msmarco/collection.tsv \
    --output data/msmarco/synthetic_queries.jsonl \
    --engine curie

Filtering and creating training data

Also, as reported on the paper, after generating the queries, we filter them by the score:

What's going on here?

In this filtering step, you can choose three possible values to filter the synthetic queries to a small set. The values are: sum_log_probs, mean_log_probs and mean_probs. For each synthetic query, there is a sequence of probabilities assigned by the LM to each token generated. The probabilities are used to compute the query probability.

python filter_queries_by_score.py \
    --input data/msmarco/synthetic_queries.jsonl \
    --output data/msmarco/filtered_synthetic_queries.jsonl \
    --top_k 10000 \
    --scoring_function mean_log_probs

Training

To train a monoT5 model using the filtered synthetic queries, you need to generate the traning pairs by creating a positive and negative example for each query. For the MS MARCO synthetic queries generated before, using BM25 to select the negatives examples, you can create the training data by:

python generate_triples_train.py \
    --input data/msmarco/filtered_synthetic_queries.jsonl \
    --output data/msmarco/synthetic.triples.train.tsv \
    --output_ids data/msmarco/synthetic.triples.train.ids.tsv \
    --corpus data/msmarco/collection.tsv \
    --index msmarco-passage

Finally, training a monoT5 model using the synthetic data:

python train_t5.py \
    --base_model t5_base \
    --corpus data/msmarco/collection.tsv \
    --triples_train data/msmarco/synthetic.triples.train.tsv \
    --queries data/msmarco/topics.msmarco-passage.dev-subset.txt \
    --qrels data/msmarco/qrels.msmarco-passage.dev-subset.txt \
    --run data/msmarco/run.beir-v1.0.0-trec-covid-flat.trec \
    --relevance_threshold 2 \
    --output_dir data/msmarco/ \
    --save_every_n_steps 156 \
    --eval_steps 156 \
    --max_eval_queries 54 \
    --max_eval_docs_per_query 1000

Generated datasets

Download synthetic datasets generated by InPars: