Official repository for the paper Promptriever: Retrieval models can be controlled with prompts, just like language models.
This repository contains the code and resources for Promptriever, which demonstrates that retrieval models can be controlled with prompts on a per-instance basis, similar to language models.
Binary | Description |
---|---|
samaya-ai/promptriever-llama2-7b-v1 | A Promptriever bi-encoder model based on LLaMA 2 (7B parameters). |
samaya-ai/promptriever-llama3.1-8b-instruct-v1 | A Promptriever bi-encoder model based on LLaMA 3.1 Instruct (8B parameters). |
samaya-ai/promptriever-llama3.1-8b-v1 | A Promptriever bi-encoder model based on LLaMA 3.1 (8B parameters). |
samaya-ai/promptriever-mistral-v0.1-7b-v1 | A Promptriever bi-encoder model based on Mistral v0.1 (7B parameters). |
samaya-ai/RepLLaMA-reproduced | A reproduction of the RepLLaMA model (no instructions). A bi-encoder based on LLaMA 2, trained on the tevatron/msmarco-passage-aug dataset. |
samaya-ai/msmarco-w-instructions | A dataset of MS MARCO with added instructions and instruction-negatives, used for training the above models. |
To initialize your research environment:
bash setup/install_conda.sh # if you don't have conda already
bash setup/install_req.sh
pip install git+https://github.com/orionw/tevatron
Run a MSMARCO experiment (DL19, DL20, Dev) with:
bash msmarco/encode_corpus.sh <output_path> <model_name>
bash msmarco/encode_queries.sh <output_path> <model_name>
bash msmarco/search.sh <output_path>
To reproduce the BEIR experiments you can either use the batch method (running all models):
bash scripts/beir/matrix_of_corpus.sh
bash scripts/beir/matrix_of_prompts.sh
bash scripts/beir/search_all_prompts.sh <output_path>
Or can also run just one model with:
bash beir/run_all.sh <model_name> <output_nickname>
bash beir/run_all_prompts.sh <model_name> <output_nickname>
bash beir/search_all_prompts.sh <output_path>
The beir/bm25
subfolder contains scripts for BM25 baseline experiments, using BM25S.
To train a Promptriever model, you can use the scripts in scripts/training/*
:
bash scripts/training/train.sh <output_name> <dataset_name> <gpu_ids> <port>
Available training scripts:
train_instruct.sh
(Llama 2)train_instruct_llama3_instruct.sh
train_instruct_llama3.sh
train_instruct_mistral_v1.sh
train_instruct_mistral.sh
(v0.3)
There are a variety of utilities to symlink corpus files (to avoid double storage when doing the dev set optimization), to upload models to Huggingface, and to filter out bad instruction-negatives.
utils/symlink_dev.sh
andutils/symlink_msmarco.sh
: Optimize storage usageutils/upload_to_hf_all.py
andutils/upload_to_hf.py
: Upload models to Hugging Face Hubutils/validate_all_present.py
: Validate dataset completenessfiltering/filter_query_doc_pairs_from_batch_gpt.py
: Implement advanced filtering using GPT model outputs
If you found the code, data or model useful, free to cite:
@article{weller2024promptriever,
title={Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author={Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
journal={arXiv preprint TODO},
year={2024}
}