Skip to content

neulab/data-agora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Agora-Logo

πŸ›οΈ Agora πŸ›οΈ

arXiv Hugging Face Organization License PyPI version

⚑ A repository for generating synthetic data with LLMs & evaluating LLMs' data generation capabilities πŸš€ ⚑

Latest News πŸ”₯

  • [2024/12] We release the Agora and Agora-Bench!
    • Agora-Bench covers 9 settings, measuring data generation capabilities across 3 domains and 3 data generation methods.
    • Agora is an easily customizable framework for data generation with LLMs.
    • Checkout our dataset, checkpoints, leaderboard, and the code!

What does Agora mean?

Agora-Logo

In ancient Athens, the Agora was a public space where citizens would gather to debate, share news, learn from each other, and listen to famous philosophers.

We made an analogy between data generators and teachers, where different generators teach student models using synthetic data in AgoraBench!

πŸ”§ Installation

Installation with pip:

pip install data-agora

Project Structure πŸ“

Root Directory

.
β”œβ”€β”€ agora_scripts/          # Scripts for converting and handling data formats
β”‚   β”œβ”€β”€ prompts/            # Various prompt templates
β”‚   └── run.py              # Main execution script
β”œβ”€β”€ assets/                 # Project images and visual assets
β”œβ”€β”€ libs/                   # Core libraries
β”‚   └── data-agora/         # Main data processing library
β”‚       β”œβ”€β”€ data_agora/     # Core data agora implementation
β”‚       β”‚   β”œβ”€β”€ core/       # Core functionality (LLMs, parsers, validators)
β”œβ”€β”€ train/                  # Training related code (based on llama-recipes)
└── LICENSE

data-agora Library (libs/data-agora/)

  • Core implementation for data processing and handling
  • Includes LLM integrations (OpenAI, vLLM, etc.)
  • Parsers and validators for data processing
  • Serving capabilities for deployment

Agora Scripts (agora_scripts/)

  • Tools for data format conversion
  • Collection of prompt templates for different use cases
  • Main execution script for running the pipeline

Training (train/)

  • Based on Meta's llama-recipes repository
  • Contains training configurations and utilities

Usage Guide πŸš€

Our library is convenient for two types of audiences:

  1. Testing an LM's Data Generation Capability with AgoraBench: Using the pre-built pipeline, you can easily measure the data generation capabilities of different LLMs.
  2. Custom Usage: You could customize the pipeline for your own tasks to generate large amounts of synthetic data.

Testing an LM's Data Generation Capability with AgoraBench

Step 1: Generate Data with Pre-built Pipeline

You could simply run the following script:

cd "./agora_scripts"

python3 run.py --method "instance_generation" --domain "math" --model_name "gpt-4o-mini-2024-07-18" --max_tokens 4096 --temperature 1.0 --num_instances 10000 --num_threads 4 --api_key ""
  • method should be either "instance_generation", "response_generation", or "quality_enhancement".

  • domain should be either "math", "general", "code'.

  • model_name should be exactly the same with how you call it on OpenAI API, LiteLLM, or vLLM.

  • The resulting dataset should look as follows:

[
   {
      "config": "",
      "instruction": "",
      "response": ""
   },
   [...]
]

Step 2: Upload the dataset to huggingface

You could use the following function:

from datasets import DatasetDict

def upload_to_huggingface(data, dataset_name, hf_key):
    dataset = Dataset.from_list(data)
    dataset_dict = DatasetDict({"train": dataset})
    api = HfApi()
    dataset_dict.push_to_hub(dataset_name, token=hf_key, private=True)

Step 3: Train Student Models with Synthetic Data

The following code is modified based on Meta's llama-recipes!

First, install the required packages

cd ./llama-recipes
pip3 install -r requirements.txt
pip3 install -e .
pip3 install wandb
wandb login
huggingface-cli login

Then, launch the following code.

gpu = 4
lr = 1e-5
checkpoint_dir = ""
hf_cache_dir = ""
hf_dataset_name = ""

torchrun --nnodes 1 --nproc_per_node $gpu \
        src/llama_recipes/finetuning.py \
        --model_name meta-llama/Meta-Llama-3.1-8B \
        --dist_checkpoint_root_folder "${checkpoint_dir}" \
        --dist_checkpoint_folder "${hf_dataset_name}" \
        --hf_cache_dir "${hf_cache_dir}" \
        --dataset "$hf_dataset_name" \
        --run_validation True \
        --context_length 4096 \
        --gradient_accumulation_steps 8 \
        --batching_strategy "packing" \
        --use_fast_kernels \
        --enable_fsdp \
        --pure_bf16 \
        --low_cpu_fsdp \
        --batch_size_training 2 \
        --num_epochs $num_epochs \
        --lr $lr \
        --weight_decay 0.01 \
        --use_wandb
  • You have to fill in:

    • checkpoint_dir (where the checkpoint is saved)
    • hf_cache_dir (where huggingface cache is saved)
    • hf_dataset_name (the dataset you uploaded on hf from Stage 1)
  • For uploading the checkpoint to huggingface, you could refer to this code.

Step 5: Evaluate Student Models and Measure Performance Gap Recovered (PGR)

For evaluating the trained student models, we used the following libraries:

  • AlpacaEval 2.0 (Instruction-following): link
  • Arena-Hard (Instruction-following): link
  • MBPP (Code): link
  • Human-Eval (Code): link

For GSM8K (Math) and MATH (Math), we implemented our custom code: TO BE ADDED

Custom Usage

For custom usage with different pipelines, parsing mechanisms, and validation logics, Agora supports convenient customization through abstract classes.

Important Keywords: First define the following dictionary:

placeholder_formats = {
    "demonstration_input_placeholder": "<input@>",
    "demonstration_output_placeholder": "<output@>",
    "test_input_placeholder": "<input>",
    "test_output_placeholder": "<output>",
    "test_input_trigger": "INPUT:",
    "test_output_trigger": "OUTPUT:",
    "stop_phrase": "[END]",
    "input_theme": "<input_theme>",
}

These will be used in the following classes.

  • demonstration_input_placeholder and demonstration_output_placeholder is where the in-context demonstrations will be at.
  • test_input_placeholder and test_output_placeholder

Prompt Loader: A class that prepares the meta-prompt passed to the data generator.

class CustomPromptLoader(InstanceGenerationPromptLoader):
   def __init__(self, prompt_template: str, seed_data: List[Dict], num_fewshot: int, placeholder_formats: Dict[str, str] = None, num_sample_from_seed_data: Optional[int] = None, [...]):
      super().__init__(prompt_template, seed_data, num_fewshot, placeholder_formats, num_sample_from_seed_data)
      [...]
    
    def prepare(self) -> PromptResult:
      [...]
      return PromptResult(prompt=prompt, metadata=metadata)

Parser: A class that separates the instruction and response from the data generator's output.

class InstanceGenerationParser(Parser):
    """Parser for instance generation scenario"""

    def parse(self, prompt, teacher_model_output, placeholder_formats: Dict[str, str]) -> Dict[str, str]:

        instruction = (
            teacher_model_output.split(placeholder_formats["test_input_trigger"])[-1]
            .split(placeholder_formats["test_output_trigger"])[0]
            .strip()
        )
        response = (
            teacher_model_output.split(placeholder_formats["test_output_trigger"])[-1]
            .split(placeholder_formats["stop_phrase"])[0]
            .strip()
        )

        return {"instruction": instruction, "response": response}

Validator: A class that determines if the output is valid or not.

class CustomValidator(Validator):
   def validate(self, instruction: str, response: str, [...]):
      [...]
      if [...]:
        return True
      else:
        return False

All together

Then, you could write a script that utilizes the custom classes to generate data.

# MODIFY THE PLACEHOLDER FORMATS BASED ON YOUR PROMPT TEMPLATE
# Demonstration related placeholders are only used for instance generation
# Input Theme place holder is an example of a custom placeholder

placeholder_formats = {
    "demonstration_input_placeholder": "<input@>",
    "demonstration_output_placeholder": "<output@>",
    "test_input_placeholder": "<input>",
    "test_output_placeholder": "<output>",
    "test_input_trigger": "INPUT:",
    "test_output_trigger": "OUTPUT:",
    "stop_phrase": "[END]",
    "input_theme": "<input_theme>",
}


with open("", "r") as f:
    seed_data = json.load(f)

with open("", "r") as f:
    prompt_template = f.read()

llm = OpenAILLM(model_name="gpt-4o-mini-2024-07-18", api_key="")

prompt_loader = CustomPromptLoader(prompt_template=prompt_template, seed_data=seed_data, num_fewshot=3, placeholder_formats=placeholder_formats, num_sample_from_seed_data=2)
parser = CustomParser()
validator = CustomValidator()


sampling_params = {
    "max_tokens": args.max_tokens,
    "temperature": args.temperature,
    "top_p": 0.9,
    "stop": placeholder_formats["stop_phrase"]
}

agora = Agora(
    llm=llm,
    placeholder_formats=placeholder_formats,
    prompt_loader=prompt_loader,
    parser=parser,
    validator=validator,
    sampling_params=sampling_params
)

# Use cache_file to resume from previous results: The Agora class will automatically make a cache file "final_result.jsonl" for example
result = agora.run(num_instances=10000, num_threads=16, output_file="./results/final_result.json")
print(result[0])

Citation

If you find our work useful, please consider citing our paper!

@misc{kim2024evaluating,
      title={Evaluating Language Models as Synthetic Data Generators}, 
      author={Seungone Kim and Juyoung Suk and Xiang Yue and Vijay Viswanathan and Seongyun Lee and Yizhong Wang and Kiril Gashteovski and Carolin Lawrence and Sean Welleck and Graham Neubig},
      year={2024},
      eprint={2412.03679},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.03679}, 
}