Skip to content

orionw/promptriever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Promptriever: Retrieval models can be controlled with prompts, just like language models

Official repository for the paper Promptriever: Retrieval models can be controlled with prompts, just like language models.

This repository contains the code and resources for Promptriever, which demonstrates that retrieval models can be controlled with prompts on a per-instance basis, similar to language models.

Table of Contents

Links

Binary Description
samaya-ai/promptriever-llama2-7b-v1 A Promptriever bi-encoder model based on LLaMA 2 (7B parameters).
samaya-ai/promptriever-llama3.1-8b-instruct-v1 A Promptriever bi-encoder model based on LLaMA 3.1 Instruct (8B parameters).
samaya-ai/promptriever-llama3.1-8b-v1 A Promptriever bi-encoder model based on LLaMA 3.1 (8B parameters).
samaya-ai/promptriever-mistral-v0.1-7b-v1 A Promptriever bi-encoder model based on Mistral v0.1 (7B parameters).
samaya-ai/RepLLaMA-reproduced A reproduction of the RepLLaMA model (no instructions). A bi-encoder based on LLaMA 2, trained on the tevatron/msmarco-passage-aug dataset.
samaya-ai/msmarco-w-instructions A dataset of MS MARCO with added instructions and instruction-negatives, used for training the above models.

Setup

To initialize your research environment:

bash setup/install_conda.sh # if you don't have conda already
bash setup/install_req.sh
pip install git+https://github.com/orionw/tevatron

Experiments

MSMARCO Experiments

Run a MSMARCO experiment (DL19, DL20, Dev) with:

bash msmarco/encode_corpus.sh <output_path> <model_name>
bash msmarco/encode_queries.sh <output_path> <model_name>
bash msmarco/search.sh <output_path>

BEIR Experiments

To reproduce the BEIR experiments you can either use the batch method (running all models):

bash scripts/beir/matrix_of_corpus.sh
bash scripts/beir/matrix_of_prompts.sh
bash scripts/beir/search_all_prompts.sh <output_path>

Or can also run just one model with:

bash beir/run_all.sh <model_name> <output_nickname>
bash beir/run_all_prompts.sh <model_name> <output_nickname>
bash beir/search_all_prompts.sh <output_path>

The beir/bm25 subfolder contains scripts for BM25 baseline experiments, using BM25S.

Training

To train a Promptriever model, you can use the scripts in scripts/training/*:

bash scripts/training/train.sh <output_name> <dataset_name> <gpu_ids> <port>

Available training scripts:

  • train_instruct.sh (Llama 2)
  • train_instruct_llama3_instruct.sh
  • train_instruct_llama3.sh
  • train_instruct_mistral_v1.sh
  • train_instruct_mistral.sh (v0.3)

Utilities

There are a variety of utilities to symlink corpus files (to avoid double storage when doing the dev set optimization), to upload models to Huggingface, and to filter out bad instruction-negatives.

  • utils/symlink_dev.sh and utils/symlink_msmarco.sh: Optimize storage usage
  • utils/upload_to_hf_all.py and utils/upload_to_hf.py: Upload models to Hugging Face Hub
  • utils/validate_all_present.py: Validate dataset completeness
  • filtering/filter_query_doc_pairs_from_batch_gpt.py: Implement advanced filtering using GPT model outputs

Citation

If you found the code, data or model useful, free to cite:

@article{weller2024promptriever,
  title={Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
  author={Weller, Orion and Van Durme, Benjamin and Lawrie, Dawn and Paranjape, Ashwin and Zhang, Yuhao and Hessel, Jack},
  journal={arXiv preprint TODO},
  year={2024}
}

About

The first dense retrieval model that can be prompted like an LM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published