Skip to content

Code for paper "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack"

License

Notifications You must be signed in to change notification settings

INK-USC/Lifelong-ICL

Repository files navigation

Stress-Testing Long-Context Language Models
with Lifelong ICL and Task Haystack

📃 [Paper] • 🏠 [Website] • 🚀 [Quick Start]


  • 🌱 Lifelong ICL is a new problem setting that challenges long-context LMs to learn a sequence of language tasks through in-context learning.
  • 🌾 Task Haystack is an evaluation suite for assessing and diagnosing how long-context LMs utilize contexts in Lifelong ICL.
    • Pass Rate measures how often the performance of Lifelong ICL is not significantly worse than that of Single-task ICL.
    • 🔎 To pass the test, the model needs to locate and make use of the relevant ICL demonstrations (the "needle") in the Lifelong ICL prompt (the "task haystack").

Result Summary

We benchmark 11 open-weight models and 3 closed models using Lifelong ICL. Here is the summary of our results:


  • State-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average. Open models we evaluate further lack behind by a large margin.
  • Llama3.1-70B is the best open-weight model we tested.
  • Check out the full result table.

Diagnosing Models with Task Haystack

Task Haystack inherits the controllability aspect of the original needle-in-a-haystack test, making it easy to create clear visualizations for diagnosing model vulnerabilities. Here we provide an example of Mistral-7B (32k):


  • While long-context LMs excel at retrieving and pasting information within the context (Left), their ability to utliize the context with deep understanding remains limited (Middle, Right).

Further Analysis

With Task Haystack, we can further ...

Group the results by task, by permutation, by the depth in the context.

Investigate context utilization with controlled experiments




Quick Start

Configure Environment

## Create a conda env
conda create -n llicl python=3.9
conda activate llicl
pip install pandas matplotlib scikit-learn retrying

## vLLM
pip install vllm==0.5.0

# (Optional) HF datasets
pip install datasets==2.18.0
pip install -U huggingface_hub

## (Optional) Openai API
pip install openai==1.25.1

Data Preparation (Optional)

Our repository already includes preprocessed data files in data.

If you would like to run it by yourself
cd preprocessing/tasks
bash run.sh

Setup a Model in vLLM (Optional)

We mainly use vLLM as the inference framework. Check model/vllm_config.py for the models that we've already integrated.

If you would like to set up a new model

We use the following code to set up the Mistral-7B (32k) model. If you would like to set up a new model, please add the configurations accordingly.

if model_name == "mistral-7b": # model name
    model_config = {
        "model": "mistralai/Mistral-7B-Instruct-v0.2", # huggingface identifier
        "gpu_memory_utilization": 0.7, # other configurations
    }

Run Task Haystack

Set model configuration in scripts/evaluate/run_baseline.sh and scripts/evaluate/run_recall.sh

MODEL_NAME="mistral-7b" # your model name

And start the Single-task ICL (baseline) and Lifelong ICL (recall) experiments:

bash run_baseline.sh
bash run_recall.sh

Visualize Results

  • Visualize results of Task Haystack by using playground/analysis_niath.ipynb (example)
  • Generate detailed diagnostic reports by using playground/analysis_diagnose.ipynb (example)
  • Modify the path and model name accordingly
# set your baseline and recall results directory
home_dir = "" # your project path
model = "" # your model name

Run Controlled Experiments (Optional)

To run the contolled experiments, configure your model in model/vllm_configs.py and utilize the following scripts in scripts/controlled_experiments controlled experiments:

  • run_repeat.sh: Repeat setting - repeat in-context learning demonstrations of one task for multiple times
  • run_paraphrase.sh: Paraphrase setting - employ paraphrased instructions when testing
  • run_irrelevant.sh: Random setting - prepend irrelevant text to in-context learning demonstrations
  • run_replay.sh: Replay setting - replay in-context learning demonstrations before testing
  • run_remove.sh: Remove setting - exclude the test task from task stream

About

Code for paper "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published