WARNING: This branch has been deprecated! Please use the main
branch instead.
ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.
As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim
) operators.
These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20).
- Relevance-guided Supervision for OpenQA with ColBERT (TACL'21).
- Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21).
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (preprint).
ColBERT (currently: v2.0.2) requires Python 3.7+ and Pytorch 1.9+ and uses the HuggingFace Transformers library.
We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)
conda env create -f conda_env.yml
conda activate colbert-v0.4.2
If you face any problems, please open a new issue and we'll help you promptly!
This Jupyter docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.
It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.
We have included a new environment file specifically for CPU-only environments (conda_env_cpu.yml
), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES=""
as part of your command.