PyGaggle provides a gaggle of deep neural architectures for text ranking and question answering. It was designed for tight integration with Pyserini, but can be easily adapted for other sources as well.
Currently, this repo contains implementations of the rerankers for CovidQA on CORD-19, as described in "Rapidly Bootstrapping a Question Answering Dataset for COVID-19".
-
Install via PyPI
pip install pygaggle
. Requires Python 3.6+ -
Install PyTorch 1.4+.
-
Download the index:
sh scripts/update-index.sh
. -
Make sure you have an installation of Java 11+:
javac --version
. -
Install Anserini.
-
Clone the repo with
git clone git@github.com:castorini/pygaggle.git
-
Make you sure you have an installation of Python 3.6+. All
python
commands below refer to this. -
For pip, do
pip install -r requirements.txt
- If you prefer Anaconda, use
conda env create -f environment.yml && conda activate pygaggle
.
- If you prefer Anaconda, use
For a full list of mostly self-explanatory environment variables, see this file.
BM25 uses the CPU. If you don't have a GPU for the transformer models, pass --device cpu
(PyTorch device string format) to the script.
Note: Run the following evaluations at root of this repo.
BM25:
python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25
BERT:
python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name bert-base-cased
SciBERT:
python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name allenai/scibert_scivocab_cased
BioBERT:
python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name biobert
T5 (fine-tuned on MS MARCO):
python -um pygaggle.run.evaluate_kaggle_highlighter --method t5
BioBERT (fine-tuned on SQuAD v1.1):
-
mkdir biobert-squad && cd biobert-squad
-
Download the weights, vocab, and config from the BioBERT repository to
biobert-squad
. -
Untar the model and rename some files in
biobert-squad
:
tar -xvzf BERT-pubmed-1000000-SQuAD.tar.gz
mv bert_config.json config.json
for filename in model.ckpt*; do
mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
- Evaluate the model:
cd .. # go to root of this of repo
python -um pygaggle.run.evaluate_kaggle_highlighter --method qa_transformer --model-name <folder path>
BioBERT (fine-tuned on MS MARCO):
- Download the weights, vocab, and config from our Google Storage bucket. This requires an installation of gsutil.
mkdir biobert-marco && cd biobert-marco
gsutil cp "gs://neuralresearcher_data/doc2query/experiments/exp374/model.ckpt-100000*" .
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/bert_config.json config.json
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/vocab.txt .
- Rename the files:
for filename in model.ckpt*; do
mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
- Evaluate the model:
cd .. # go to root of this repo
python -um pygaggle.run.evaluate_kaggle_highlighter --method seq_class_transformer --model-name <folder path>