GitHub - INK-USC/Lifelong-ICL: Code for paper "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack"

Stress-Testing Long-Context Language Models
with Lifelong ICL and Task Haystack

📃 [Paper] • 🏠 [Website] • 🚀 [Quick Start]

🌱 Lifelong ICL is a new problem setting that challenges long-context LMs to learn a sequence of language tasks through in-context learning.
🌾 Task Haystack is an evaluation suite for assessing and diagnosing how long-context LMs utilize contexts in Lifelong ICL.
- ✅ Pass Rate measures how often the performance of Lifelong ICL is not significantly worse than that of Single-task ICL.
- 🔎 To pass the test, the model needs to locate and make use of the relevant ICL demonstrations (the "needle") in the Lifelong ICL prompt (the "task haystack").

Result Summary

We benchmark 11 open-weight models and 3 closed models using Lifelong ICL. Here is the summary of our results:

State-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average. Open models we evaluate further lack behind by a large margin.
Llama3.1-70B is the best open-weight model we tested.
Check out the full result table.

Diagnosing Models with Task Haystack

Task Haystack inherits the controllability aspect of the original needle-in-a-haystack test, making it easy to create clear visualizations for diagnosing model vulnerabilities. Here we provide an example of Mistral-7B (32k):

While long-context LMs excel at retrieving and pasting information within the context (Left), their ability to utliize the context with deep understanding remains limited (Middle, Right).

Further Analysis

With Task Haystack, we can further ...

Group the results by task, by permutation, by the depth in the context.

Investigate context utilization with controlled experiments

Quick Start

Configure Environment

## Create a conda env
conda create -n llicl python=3.9
conda activate llicl
pip install pandas matplotlib scikit-learn retrying

## vLLM
pip install vllm==0.5.0

# (Optional) HF datasets
pip install datasets==2.18.0
pip install -U huggingface_hub

## (Optional) Openai API
pip install openai==1.25.1

Data Preparation (Optional)

Our repository already includes preprocessed data files in data.

If you would like to run it by yourself

cd preprocessing/tasks
bash run.sh

Setup a Model in vLLM (Optional)

We mainly use vLLM as the inference framework. Check model/vllm_config.py for the models that we've already integrated.

If you would like to set up a new model

We use the following code to set up the Mistral-7B (32k) model. If you would like to set up a new model, please add the configurations accordingly.

if model_name == "mistral-7b": # model name
    model_config = {
        "model": "mistralai/Mistral-7B-Instruct-v0.2", # huggingface identifier
        "gpu_memory_utilization": 0.7, # other configurations
    }

Run Task Haystack

Set model configuration in scripts/evaluate/run_baseline.sh and scripts/evaluate/run_recall.sh

MODEL_NAME="mistral-7b" # your model name

And start the Single-task ICL (baseline) and Lifelong ICL (recall) experiments:

bash run_baseline.sh
bash run_recall.sh

Visualize Results

Visualize results of Task Haystack by using playground/analysis_niath.ipynb (example)
Generate detailed diagnostic reports by using playground/analysis_diagnose.ipynb (example)
Modify the path and model name accordingly

# set your baseline and recall results directory
home_dir = "" # your project path
model = "" # your model name

Run Controlled Experiments (Optional)

To run the contolled experiments, configure your model in model/vllm_configs.py and utilize the following scripts in scripts/controlled_experiments controlled experiments:

run_repeat.sh: Repeat setting - repeat in-context learning demonstrations of one task for multiple times
run_paraphrase.sh: Paraphrase setting - employ paraphrased instructions when testing
run_irrelevant.sh: Random setting - prepend irrelevant text to in-context learning demonstrations
run_replay.sh: Replay setting - replay in-context learning demonstrations before testing
run_remove.sh: Remove setting - exclude the test task from task stream

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
data @ 11c7ab4		data @ 11c7ab4
dataloader		dataloader
model		model
needle-in-the-haystack-test		needle-in-the-haystack-test
playground		playground
preprocessing/tasks		preprocessing/tasks
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
cli.py		cli.py
prepare_openai.py		prepare_openai.py
prepare_openai_baseline.py		prepare_openai_baseline.py
run.py		run.py
run_baseline.py		run_baseline.py
run_openai.py		run_openai.py
run_repeat.py		run_repeat.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stress-Testing Long-Context Language Models
with Lifelong ICL and Task Haystack

Result Summary

Diagnosing Models with Task Haystack

Further Analysis

Quick Start

Configure Environment

Data Preparation (Optional)

Setup a Model in vLLM (Optional)

Run Task Haystack

Visualize Results

Run Controlled Experiments (Optional)

About

Releases

Packages

Contributors 2

Languages

License

INK-USC/Lifelong-ICL

Folders and files

Latest commit

History

Repository files navigation

Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

Result Summary

Diagnosing Models with Task Haystack

Further Analysis

Quick Start

Configure Environment

Data Preparation (Optional)

Setup a Model in vLLM (Optional)

Run Task Haystack

Visualize Results

Run Controlled Experiments (Optional)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Stress-Testing Long-Context Language Models
with Lifelong ICL and Task Haystack

Packages