We have used BAGEL, SFHOT and SFRES datasets for our experiments. For each dataset, we converted the original dataset into an unified form as shown below (all texts tokenized, normal cased). The unified data form is in each dataset folder, and is named as data.pkl
. Note that there is another file final_p.pkl
in each dataset folder, which is our calculated score file. We take the max when combining the multi-reference results.
{
"doc_id": {
"src": "This is the source text.",
"sys_summ": "This is the system generated text.",
"ref_summs": [
"This is the first reference text.",
"This is the second reference text.",
"..."
],
"scores": {
"informativeness": 6.0,
"naturalness": 4.0,
"quality": 5.0
}
}
}
After calculating scores using automatic metrics, the scores
field for each document is updated, like the one below.
"scores": {
"auto_metric1": 0.9,
"auto_metric2": 0.7,
"informativeness": 6.0,
"naturalness": 4.0,
"quality": 5.0
}
Please run the following commands to download the PRISM model.
mkdir models
sh download.sh
Our trained BARTScore (on ParaBank2) can be downloaded here. Please also move it to the models
folder for subsequent experiments if you consider using it.
Run the following to see all the arguments that are supported by the score.py
script.
python score.py --help
To reproduce the result, run the following as an example.
python score.py --file BAGEL/data.pkl --device cuda:0 --output BAGEL/scores.pkl --bert_score --mover_score --rouge --bart_score --bart_score_cnn --bart_score_para --prism --prompt bart_para_ref