ColBERT (v2)

WARNING: This branch has been deprecated! Please use the main branch instead.

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20).
Relevance-guided Supervision for OpenQA with ColBERT (TACL'21).
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21).
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (preprint).

Installation

ColBERT (currently: v2.0.2) requires Python 3.7+ and Pytorch 1.9+ and uses the HuggingFace Transformers library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)

conda env create -f conda_env.yml
conda activate colbert-v0.4.2

If you face any problems, please open a new issue and we'll help you promptly!

UPDATED 2022/02/02: API Usage Notebook

This Jupyter docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.

CPU execution

We have included a new environment file specifically for CPU-only environments (conda_env_cpu.yml), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES="" as part of your command.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
baleen		baleen
colbert		colbert
docs		docs
utility		utility
.gitignore		.gitignore
LICENSE		LICENSE
LoTTE.md		LoTTE.md
README.md		README.md
conda_env.yml		conda_env.yml
conda_env_cpu.yml		conda_env_cpu.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Installation

UPDATED 2022/02/02: API Usage Notebook

CPU execution

About

Contributors 32

Languages

License

stanford-futuredata/ColBERT

Folders and files

Latest commit

History

Repository files navigation

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Installation

UPDATED 2022/02/02: API Usage Notebook

CPU execution

About

Resources

License

Stars

Watchers

Forks

Contributors 32

Languages