Code release and supplementary materials for:
"ABFS: Natural Robustness Testing for LLM-based NLP Software"
Slightly perturbed text can mislead ChatGPT o1-preview into judging the label of financial news from "POSITIVE" (with a confidence of 95%) to "NEGATIVE" (with a confidence of 70%).
There are three datasets used in our experiments:
datasets
: define the dataset object used for carrying out testsgoal_functions
: determine if the testing method generates successful test casessearch_methods
: explore the space of transformations and try to locate a successful perturbationtransformations
: transform the input text, e.g. synonym replacementconstraints
: determine whether or not a given transformation is valid
The most important files in this project are as follows:
goal_functions/classification/untargeted_llm_classification.py
: quantify the goal of testing LLM-based NLP software in text classification tasksearch_methods/best_first_word_swap_wir.py
: search test cases based on adaptive best-first searchinference.py
: drive threat LLMs to do inference and process outputsabfs_fi_llama270b.py
: an example of testing Llama-2-70b-chat on the Financial Phrasebank dataset via ABFS
The code was tested with:
- bert-score>=0.3.5
- autocorrect==2.6.1
- accelerate==0.25.0
- datasets==2.15.0
- nltk==3.8.1
- openai==1.3.7
- sentencepiece==0.1.99
- tokenizers==0.15.0
- torch==2.1.1
- tqdm==4.66.1
- transformers==4.38.0
- Pillow==10.3.0
- transformers_stream_generator==0.0.5
- matplotlib==3.8.3
- tiktoken==0.6.0
Follow these steps to run the attack from the library:
-
Fork this repository
-
Run the following command to install it.
$ pip install -e . ".[dev]"
-
Run the following command to test Llama-2-70b-chat on the Financial Phrasebank dataset via ABFS.
$ python abfs_fi_llama270b.py
Take a look at the Models
directory in Hugging Face to run the test across any threat model.
This code and model are available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using the code and model you agree to the terms in the LICENSE.
This code is based on the TextAttack framework.