Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

This repository provides the code for reproducing the results from the TACL paper "Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks". In short, it contains the code for fine-tuning the two language models T5-Large and OLMo-7B (LoRA) for self-rationalization, and performing inference on 19 NLI-related Out-of-Distribution datasets.

Abstract

Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models' out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.

Contact person: Jing Yang

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Setting up environment

Follow the instructions below to recreate the environment used for all our experiments

Create and activate a conda environment

conda create -n oodeval python=3.10
conda activate oodeval

Install the required packages

pip install -r requirements.txt

Download source datasets

e-SNLI dataset can be directly loaded from huggingface. For downloading the e-FEVER dataset, you need to ask for permission from the original paper authors.

Before the following code, you need to enter the code directory cd code.

Few-shot sample selection

First step of our pipeline is to selection different number of instances for model fine-tuning. To prepare the instance selection, you can run the following command:

python sample_selection.py --source_dataset esnli --sample_selection random --split 0

You can replace the --sample_selection with the other options: fastvotek, ambiguous, accept-fastvotek and accept-ambiguous. The --split option is for selecting from one of the five subsets (0,1,2,3,4) of the corresponding source dataset. For ambiguous selection, you need to pass the original T5 model path (with --model_path t5-large) for estimating the probability of labels.

We have also provided resulting selected samples in the folder ../samples.

Supervised fine-tuning

Fine-tuning T5-Large

With the selected samples, you can run the following command to fine-tune the T5-Large model:

python training.py --project_name your_wandb_project_name --source_dataset_name esnli --n_shots 8 --sample_selection random --data_sub 0 --save_model_path ../model/ --do_eval

--do_eval is set to be true for using the validation set to select the best checkpoint of the model.

Fine-tuning OLMo-7B

The following command is for fine-tuning the OLMo-7B with LoRA.

python training_olmo.py --project_name your_wandb_project_name --n_shots 8 --sample_selection random --data_sub 0

OOD Inference

Load/download OOD datasets

You can find the code to load all the listed OOD datasets in load_target_dataset.py. For some datasets, you will need to manually download from origial sources.

Inference with T5-Large

The following code is for inference with the fine-tuned T5-Large models.

python Inference.py --project_name your_wandb_project_name --source_dataset_name esnli --target_dataset_name sick --sample_selection random --n_shots 8 --test_bsz 64 --data_sub 0

Inference with OLMo-7B

The following code is for inference with the fine-tuned OLMo models.

python inference-olmo.py --project_name your_wandb_project_name --source_dataset_name esnli --target_dataset_name sick --sample_selection random --n_shots 8 --test_bsz 64 --data_sub 0

You can also find fine-tuned models and OOD generated results from this Google Drive folder.

Reference-free evaluation metrics

Evaluation with the Acceptability score

Run the following command for evaluating the generated explanations with the Acceptability score.

python acceptability_evaluation.py --source_dataset_name  esnli --model_type olmo --target_dataset_name sick --sample_selection random --n_shots 8 --data_sub 0

Human Evaluation

For human evaluation, we follow examples projects from the Potato Annotation Tool. The code for our project can be found here.

Citation

In progress

Disclaimer

This repository is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
code		code
datasets/source		datasets/source
human_eval		human_eval
imgs		imgs
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

Abstract

Setting up environment

Download source datasets

Few-shot sample selection

Supervised fine-tuning

Fine-tuning T5-Large

Fine-tuning OLMo-7B

OOD Inference

Load/download OOD datasets

Inference with T5-Large

Inference with OLMo-7B

Reference-free evaluation metrics

Evaluation with the Acceptability score

Human Evaluation

Citation

Disclaimer

About

Releases

Packages

Languages

License

UKPLab/tacl2025-ood-eval-self-rationalization

Folders and files

Latest commit

History

Repository files navigation

Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

Abstract

Setting up environment

Download source datasets

Few-shot sample selection

Supervised fine-tuning

Fine-tuning T5-Large

Fine-tuning OLMo-7B

OOD Inference

Load/download OOD datasets

Inference with T5-Large

Inference with OLMo-7B

Reference-free evaluation metrics

Evaluation with the Acceptability score

Human Evaluation

Citation

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages