Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
MXueguang committed Mar 13, 2024
1 parent c2b436c commit 24cd99a
Showing 1 changed file with 32 additions and 15 deletions.
47 changes: 32 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,25 @@
# Tevatron
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.
The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast
development and testing. A set of easy-to-use interfaces to Huggingface's state-of-the-art pre-trained transformers
ensures Tevatron's superior performance.
# Tevatron V2
Tevatron aims to provide a flexible and efficient toolkit that enables training and inference for neural retrieval models at scale.

*Tevatron is currently under initial development stage. We will be actively adding new features and API changes
may happen. Suggestions, feature requests and PRs are welcomed.*
> Some of the features in Tevatron v1 is not yet migrated to Tevatron v2. We are working on it.
> If you are looking for the Tevatron v1 features, please pull the [v1 branch]().
## Features
- Command line interface for dense retriever training/encoding and dense index search.
- Flexible and extendable Pytorch retriever models.
- Highly efficient Trainer, a subclass of Huggingface Trainer, that naively support training performance features like mixed precision and distributed data parallel.
- Fast and memory-efficient train/inference data access based on memory mapping with Apache Arrow through Huggingface datasets.
- Jax/Flax training/encoding on TPU
- Training billion-scale LLM neural retriever on GPUs and TPUs.
- Parameter efficient tuning with LoRA.
- Integration with DeepSpeed, flash attention, gradient accumulation, and other efficient training techniques.
- Self-contained datasets for neural retrieval and open-domain QA tasks.
- Direct loading and finetuning SoTA pre-trained models (BGE-Embbedding, Instruct-E5) from HuggingFace.

## Installation

## Toolkit Usage
<details><summary><b>PyTorch (GPU)</b></summary></details>
<details><summary><b>JAX (TPU)</b></summary></details>
<details><summary><b>JAX (GPU)</b></summary></details>



## Toolkit Usage

<details><summary><b>PyTorch (GPU)</b></summary>

Expand Down Expand Up @@ -55,7 +57,7 @@ In batch passages per query: 8x4x16 = 512

Number of queries per update: 8x4x4 = 128

The training tooks about 70 hours on 4xA6000 GPU.
The above training setting tooks about 70 hours on 4xA6000 GPU.

Equivalent training tooks about 110 hours on 1xA100 GPU.

Expand Down Expand Up @@ -138,7 +140,7 @@ The output file is in the format of `<query_id> <passage_id> <score>` in each li
```bash
python -m tevatron.tevax.experimental.mp.train_lora \
--checkpoint_dir retriever-mistral-jax \
--train_file Tevatron/msmarco-passage \
--train_file Tevatron/msmarco-passage-aug \
--model_name mistralai/Mistral-7B-v0.1 \
--model_type mistral \
--batch_size 128 \
Expand All @@ -157,6 +159,14 @@ python -m tevatron.tevax.experimental.mp.train_lora \
--query_num_chunks 4
```

In batch passages per query: 128x16 = 2048

Number of queries per update: 128

The above training setting tooks about 42 hours on a v4-8 TPU VM.

Equivalent training tooks about 80 hours on 1xA100 GPU.

### Encoding

#### Query Encoding
Expand Down Expand Up @@ -222,9 +232,16 @@ If you find Tevatron helpful, please consider citing our [paper](https://arxiv.o
}
```


## Contacts
If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.
- Luyu Gao luyug@cs.cmu.edu
- Xueguang Ma x93ma@uwaterloo.ca


## Acknowledgement

* We thank all the contributors of dependency libraries.
* We thank Google's [TPU research cloud](https://sites.research.google/trc/about/) for providing TPU resources.

0 comments on commit 24cd99a

Please sign in to comment.