The is the repository documenting experiments for the ACL 2024 paper: RORA: Robust Free-Text Rationale Evaluation.
steps/ // callable scripts corresponds to each step of REV score calculation.
src/ // source code of models, trainers, data collations etc.
scripts/ // helper scripts to do examination, sanity check etc.
We need to prepare the dataset in the following format:
{
"qid": "7fa631340ce8c42aba53",
"term": "1980 United States presidential election",
"description": "49th quadrennial presidential election in the United States",
"question": "Were there greater landslides than 1980 United States presidential election?",
"answer": true,
"facts": [
"A landslide refers to a competitor beating their opponent by a wide margin.",
"Ronald Reagan defeated Jimmy carter in the 1980 United States presidential election by around 8 million votes.",
"Franklin D. Roosevelt won the 1936 United States presidential election over Alf Landon by more than 11 million votes.",
"In 1804 Thomas Jefferson received 162 (92%) of the electoral votes while Charles Cotesworth Pinckney received only 14 (8%)."
],
"decomposition": [
"By what votes margin did Ronald Reagan defeat Jimmy Carter in the 1980 US Presidential election?",
"By how many votes was Franklin D. Roosevelt leading Alf Landon in the 1936 US Presidential election?",
"How many more votes did Thomas Jefferson receive than Charles Cotesworth Pinckney in the 1804 United States presidential election?", "Are #2 and #3 greater individually than #1?"
],
"vacuous_rationale": "There were greater landslides than 1980 United States presidential election."
}
We use the same split here.
The scripts expect data in the huggingface datasets format.
To run the code, you need to configure the environment by running the following command:
pip install -r requirements.txt
In case pwd is not in the PYTHONPATH, you need to add the path:
export PYTHONPATH=$PYTHONPATH:$(pwd)
- Configure the environment
source runs/configure.sh \
--removal-model-type=${REMOVAL_MODEL_TYPE} \
--dataset-name=${DATASETNAME} \
--rationale-format=${RATIONALE_FORMAT} \
--num-ngrams=${NUM_NGRAMS} \
--min-freq=${MIN_FREQ} \
--max-tokens=${MAX_TOKENS} \
--threshold=${THRESHOLD} \
--irm-coefficient=${IRM_COEFFICIENT} \
--rev-model-type=${REV_MODEL_TYPE} \
--removal-epochs=${REMOVAL_EPOCHS} \
--removal-batch-size=${REMOVAL_BATCH_SIZE} \
--generation-epochs=${GENERATION_EPOCHS} \
--generation-batch-size=${GENERATION_BATCH_SIZE} \
--rev-epochs=${REV_EPOCHS} \
--rev-batch-size=${REV_BATCH_SIZE} \
--learning-rate=${REV_LEARNING_RATE}
By specifying the relevant parameters, the script will make sure all the relevant varaibles in the process got properly configured.
- Create raw data by appending model-generated rationales
make raw_dataset
- Create vocab files
Notice that we will use non-pretrained models to calculate attributions, so that we need to create our own vocab to avoid under-training.
make vocab_file
- Create removal dataset that is used to train removal_model and calculate attributions
make removal_dataset
- Train removal model
make removal_model
- Generation dataset creation
Now using the calculation we generate the dataset that can be used to train the generation model.
make generation_dataset
- Train generation model
make generation_model
- Create the REV dataset
Now we have the model to generate counterfactual rationales for data, we can create the REV dataset.
make rev_dataset
- Train REV model
make rev_model
- Notice that we need a baseline model to calculate REV, so we need to train a baseline model (first we prepare dataset).
make baseline_dataset
- Actually training the baseline model
make baseline_model
- From these two models we are able to generate the report file.
make report_file
Notice that due to the nature of Makefile
, the process will be executed in a pipeline manner, so that if you want to re-run the process, you need to clean the intermediate files.
make clean
And making the last step report_file
will run the whole process. Please check runs/run_make_{ecqa, strategyqa}.sh
for examples.
If you use this code, please cite the following paper:
@misc{jiang2024rora,
title={RORA: Robust Free-Text Rationale Evaluation},
author={Zhengping Jiang and Yining Lu and Hanjie Chen and Daniel Khashabi and Benjamin Van Durme and Anqi Liu},
year={2024},
eprint={2402.18678},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}