This repository contains code for evaluating NLG Models as described in the following paper:
Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
Akash Kumar Mohankumar, Mitesh M. Khapra. Association for Computational Linguistics (ACL), 2022
From Source:
git clone https://github.com/akashkm99/duelnlg.git
cd duelnlg
pip install -e .
To use automatic metrics, you may also need to download nlgeval data:
python ./scripts/download/nlg-eval --setup
Here, we describe the steps to replicate the experiments mentioned in the paper.
To download and preprocess the WMT 2016 datasets:
bash scripts/preprocess/wmt16.sh
All the processed data will be stored as .pkl files at data/wmt16/processed/
For the WMT 2015 datasets,
bash scripts/preprocess/wmt15.sh
To perform experiments with model-free dueling bandits algorithms, use the duelnlg/duelpy/experiments/experiments.py
script. It has the following arguments:
--feedback-config
: A json config that specifies the list of datasets and their parameters. Useconfigs/feedback/wmt_all.json
to run on all 7 WMT datasets.--algorithm-config
: Config file that specifies the dueling bandit algorithms and their parameters. Useconfigs/algorithm/rmed.json
to run the RMED algorithm and refer toconfigs/algorithm/default_all.json
for the default parameters for all algorithms.--output-dir
: Directory to save the results. (Default: ./results/bandits)--num-runs
: The number of times each algorithm is run with different random seed (Default: 200)--random-seed
: The base random seed to use (Default: 42)
For example, to run all the dueling bandits algorithm (except: IF and PL which are quite slow) on the WMT 2016 tur->eng dataset with 50 runs use:
python duelnlg/duelpy/experiments/experiments.py \
--feedback-config ./configs/feedback/wmt16_tur_eng.json \
--algorithm-config ./configs/algorithm/default_all_no_if_pl.json \
--num-runs 50
To use direct evaluation metrics, we need to tune a few hyperparameters (e.g. thresholds for the preference probabilities) on a validation set. For training any end-to-end metric for pairwise prediction, we would also require a training set.
To create the train and validation datasets for WMT, we use data from WMT 2013 and 2014:
bash scripts/prepare_train_val/wmt.sh
To run the Bleurt model, you need to download the model checkpoint:
bash scripts/download/bleurt_ckpt.sh
To run automatic metrics and save the predictions, use the duelnlg/direct_eval/evaluation.py
script. It has the following arguments:
--metrics-config
: A json config that specifies the list of automatic metrics and their parameters. Useconfigs/metrics/bleurt.json
to use bleurt and refer toconfigs/metrics/all.json
to run all metrics.--val-path
andtest-path
: CSV files with the validation (for tuning) and test datasets. E.g. for WMT 2016, it's./data/wmt13_14/processed/val.csv
anddata/wmt16/processed/wmt16-human-judgements.csv
respectively.--processed-dir
: Directory with the processed .pkl files. E.g. for WMT 2016, it'sdata/wmt16/processed
--ensemble
: Whether to perform mulitple model forward passes with dropout for uncertainity estimation. Applicable only for Bleurt (Default: False)--multiref
: Whether the dataset has multiple reference texts. (Default: True)
For example, to run the Bleurt metric on WMT 2016 datasets, use the following:
python duelnlg/direct_eval/evaluation.py \
--metrics ./configs/metrics/bleurt.json \
--val-path ./data/wmt13_14/processed/val.csv \
--test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
--output-results ./results/metrics/bleurt.csv \
--processed-dir ./data/wmt16/processed
-
Use GPUs to speedup the evaluation. If GPUs are not being used, please check your tensorflow version and CUDA compatibility. (Install a tf (>2.0) version that supports your CUDA version).
-
To accelerate your evaluation with Google Cloud TPUs, refer to
configs/metrics/bleurt_tpu.json
. You just need to provide information on your storage bucket & TPU.
To compute uncertainity in the Bleurt scores (required for Uncertainity-aware selection and UCB elimination algos), use the following:
python duelnlg/direct_eval/evaluation.py \
--metrics ./configs/metrics/bleurt_ensemble.json \
--val-path ./data/wmt13_14/processed/val_1k.csv \
--test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
--output-results ./results/metrics/bleurt_ensemble.csv \
--processed-dir ./data/wmt16/processed \
--ensemble
Once you've computed the automatic metric predictions, you can run model-based algorithms by simply adding the flag --model-config
to your duelnlg/duelpy/experiments/experiments.py
script.
For example to perform random mixing with Bleurt using RMED on the WMT16 tur->eng dataset, use
python duelnlg/duelpy/experiments/experiments.py \
--model-config ./configs/models/random_mixing_blerut.json
--feedback-config ./configs/feedback/wmt16_tur_eng.json \
--algorithm-config ./configs/algorithm/rmed.json \
--num-runs 200
For other model-based algorithms, you can use the following model configs:
Algorithm | Config |
---|---|
Random Mixing | ./configs/models/random_mixing_bleurt |
Uncertainity-aware Selection (BALD) | ./configs/models/uncertainity_bleurt.json |
UCB Elimination | ./configs/models/ucb_elimination_bleurt.json |
Uncertainity + UCB Elimination | ./configs/models/uncertainity_ucb_elimination_bleurt.json |