📃 [Paper] • 🏠 [Website] • 🚀 [Quick Start]
- 🌱 Lifelong ICL is a new problem setting that challenges long-context LMs to learn a sequence of language tasks through in-context learning.
- 🌾 Task Haystack is an evaluation suite for assessing and diagnosing how long-context LMs utilize contexts in Lifelong ICL.
- ✅ Pass Rate measures how often the performance of Lifelong ICL is not significantly worse than that of Single-task ICL.
- 🔎 To pass the test, the model needs to locate and make use of the relevant ICL demonstrations (the "needle") in the Lifelong ICL prompt (the "task haystack").
We benchmark 11 open-weight models and 3 closed models using Lifelong ICL. Here is the summary of our results:
- State-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average. Open models we evaluate further lack behind by a large margin.
- Llama3.1-70B is the best open-weight model we tested.
- Check out the full result table.
Task Haystack inherits the controllability aspect of the original needle-in-a-haystack test, making it easy to create clear visualizations for diagnosing model vulnerabilities. Here we provide an example of Mistral-7B (32k):
- While long-context LMs excel at retrieving and pasting information within the context (Left), their ability to utliize the context with deep understanding remains limited (Middle, Right).
With Task Haystack, we can further ...
## Create a conda env
conda create -n llicl python=3.9
conda activate llicl
pip install pandas matplotlib scikit-learn retrying
## vLLM
pip install vllm==0.5.0
# (Optional) HF datasets
pip install datasets==2.18.0
pip install -U huggingface_hub
## (Optional) Openai API
pip install openai==1.25.1
Our repository already includes preprocessed data files in data
.
If you would like to run it by yourself
cd preprocessing/tasks
bash run.sh
We mainly use vLLM as the inference framework. Check model/vllm_config.py
for the models that we've already integrated.
If you would like to set up a new model
We use the following code to set up the Mistral-7B (32k) model. If you would like to set up a new model, please add the configurations accordingly.
if model_name == "mistral-7b": # model name
model_config = {
"model": "mistralai/Mistral-7B-Instruct-v0.2", # huggingface identifier
"gpu_memory_utilization": 0.7, # other configurations
}
Set model configuration in scripts/evaluate/run_baseline.sh
and scripts/evaluate/run_recall.sh
MODEL_NAME="mistral-7b" # your model name
And start the Single-task ICL (baseline) and Lifelong ICL (recall) experiments:
bash run_baseline.sh
bash run_recall.sh
- Visualize results of Task Haystack by using
playground/analysis_niath.ipynb
(example) - Generate detailed diagnostic reports by using
playground/analysis_diagnose.ipynb
(example) - Modify the path and model name accordingly
# set your baseline and recall results directory
home_dir = "" # your project path
model = "" # your model name
To run the contolled experiments, configure your model in model/vllm_configs.py
and utilize the following scripts in scripts/controlled_experiments
controlled experiments:
run_repeat.sh
: Repeat setting - repeat in-context learning demonstrations of one task for multiple timesrun_paraphrase.sh
: Paraphrase setting - employ paraphrased instructions when testingrun_irrelevant.sh
: Random setting - prepend irrelevant text to in-context learning demonstrationsrun_replay.sh
: Replay setting - replay in-context learning demonstrations before testingrun_remove.sh
: Remove setting - exclude the test task from task stream