Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update duoT5 training file generation instructions, fix silent bug #210

Merged
merged 2 commits into from
Aug 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 25 additions & 2 deletions docs/experiments-duot5-tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,30 @@ export DATA_DIR=data/msmarco_passage
mkdir ${DATA_DIR}
```

We provide specific data prep instructions for evaluating on the dev set.
We provide specific data prep instructions for the train and dev set.

### Train Set

First, download the MS MARCO train triples:
```
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/triples.train.small.tar.gz
tar -xvf triples.train.small.tar.gz
rm triples.train.small.tar.gz
cd ../../
```

Then convert the train triples file to the duoT5 input format:
```
python pygaggle/data/create_msmarco_duot5_train.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_docs_triples.train.tsv
```

Next, copy the duoT5 input file to Google Storage. TPU training will read data directly from `gs`.
```
gsutil cp ${DATA_DIR}/query_docs_triples.train.tsv ${GS_FOLDER}/
```

This file is made available in our [bucket](https://console.cloud.google.com/storage/browser/castorini/duot5/data).

### Dev Set

Expand Down Expand Up @@ -156,7 +179,7 @@ git clone https://github.com/castorini/mesh.git
pip install --editable mesh
```

## Rerank with monoT5
## Rerank with duoT5

Let's first define the model type and checkpoint.

Expand Down
2 changes: 1 addition & 1 deletion docs/experiments-monot5-tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ cd ../../

Then convert the train triples file to the monoT5 input format:
```
python pygaggle/data/create_msmarco_t5_training_pairs.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
python pygaggle/data/create_msmarco_monot5_train.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
```

Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
Expand Down
2 changes: 1 addition & 1 deletion pygaggle/data/convert_duot5_output_to_msmarco_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def load_run(path):
print('Sorting candidate docs by rank...')
sorted_run = collections.OrderedDict()
for query_id, doc_titles_ranks in tqdm(run.items()):
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't believe we let this pass for so long hahaha

doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks]
sorted_run[query_id] = doc_titles

Expand Down
2 changes: 1 addition & 1 deletion pygaggle/data/create_msmarco_duot5_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def load_run(path, top_k=50):
print('Sorting candidate docs by rank...')
sorted_run = collections.OrderedDict()
for query_id, doc_titles_ranks in tqdm(run.items()):
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks][:top_k]
sorted_run[query_id] = doc_titles

Expand Down
21 changes: 21 additions & 0 deletions pygaggle/data/create_msmarco_duot5_train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""
This script creates duoT5 input files for training,
Each line in the duoT5 input file follows the format:
f'Query: {query} Document0: {document0} Document1: {document1} Relevant:\t{label}\n')
"""
import argparse
from tqdm import tqdm

parser = argparse.ArgumentParser()
parser.add_argument("--triples_train", type=str, required=True,
help="tsv file <query>, <positive_document>, <negative_document>")
parser.add_argument("--output_to_t5", type=str, required=True,
help="t5 train input file")
args = parser.parse_args()

with open(args.output_to_t5, 'w') as fout_t5:
for line_num, line in enumerate(tqdm(open(args.triples_train))):
query, positive_document, negative_document = line.strip().split('\t')
fout_t5.write(f'Query: {query} Document0: {positive_document} Document1: {negative_document} Relevant:\ttrue\n')
fout_t5.write(f'Query: {query} Document0: {negative_document} Document1: {positive_document} Relevant:\tfalse\n')
print('Done!')
2 changes: 1 addition & 1 deletion pygaggle/data/create_msmarco_monot5_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def load_run(path):
print('Sorting candidate docs by rank...')
sorted_run = collections.OrderedDict()
for query_id, doc_titles_ranks in tqdm(run.items()):
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks]
sorted_run[query_id] = doc_titles

Expand Down
2 changes: 1 addition & 1 deletion pygaggle/data/create_robust04_monot5_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def load_run(path):
# Sort candidate docs by rank.
sorted_run = collections.OrderedDict()
for query_id, doc_titles_ranks in run.items():
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks]
sorted_run[query_id] = doc_titles

Expand Down
2 changes: 1 addition & 1 deletion pygaggle/data/msmarco.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def load_run(cls, path: str):
run[qid].append((doc_title, int(rank)))
sorted_run = OrderedDict()
for qid, doc_titles_ranks in run.items():
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks]
sorted_run[qid] = doc_titles
return sorted_run
Expand Down
2 changes: 1 addition & 1 deletion pygaggle/data/trec_covid.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def load_run(cls, path: str):
run[qid].append((doc_title, int(rank)))
sorted_run = OrderedDict()
for qid, doc_titles_ranks in run.items():
sorted(doc_titles_ranks, key=lambda x: x[1])
doc_titles_ranks.sort(key=lambda x: x[1])
doc_titles = [doc_titles for doc_titles, _ in doc_titles_ranks]
sorted_run[qid] = doc_titles
return sorted_run
Expand Down