This repo contains source code of our EMNLP'23 paper Generating Data for Symbolic Language with Large Language Models.
This is another successor of our previous works, i.e., ZeroGen, ProGen and SunGen that employ LLMs as data generators. Some codes are adapted from CEIL.
TL;DR:
- Annotating symbolic languages (e.g., SQL, Bash, Python, TOP, QDMR, etc.) manually is expensive and time-consuming.
- We use LLMs for generating data for symbolic language via an informative prompt to steer generation and an agreement-based verifier to improve data correctness.
- We show that the generated data can be used to train a much smaller task model that behaves well when compared with the data generator.
git clone --recurse-submodules git@github.com:HKUNLP/SymGen.git
conda env create -n symgen python=3.10
conda activate symgen
pip install -r requirements.txt
Note: This repo initially uses Codex model which is not available currently, so you may need to require access here. The implemented filtering processes are mostly based on logprob, if you want to use ChatGPT or GPT-4 as data generators, which don't provide such information, you should modify the filtering process.
The core codes are under src
. We provide some scripts in scripts
directory:
run_data_gen-q.sh
is used to generate question for a given task in zero-shot or few-shot way.run_few_shot_data_gen-a.sh
is used to generate answer (i.e., symbolic language) with given a few in-context examples.run_full_shot_data_gen-a.sh
is used to generate answer with retrieved in-context examples given a large pool of available examples.run_zero_shot_prompting.sh
,run_few_shot_prompting.sh
andrun_full_shot_prompting.sh
are not related to data generation, they are scripts used to directly do prompting on the evaluation set.
You can run the command to do specific tasks, e.g.,
bash scripts/run_data_gen-q.sh
Some notes:
- We provide three types of filtering method when obtaining a proper answer, i.e.,
base
,exec
(execution),mv
(majority vote), seeconfig/filter_config/gen_a-*.yaml
andsrc/filters/
for details. - We provide API model and huggingface local model as data generators, see
config/model_config/
. We use API model in the paper, and the huggingface local models are not highly tested. - For SQL, we consider spider dataset, so each instance has an atrribute
db_path
pointing to its database position, you may need to change based on where the spider database is. We use test_suite evaluation, so you should provide a test_database path inconfigs/task_config/spider.yaml
. - For Prolog, check
pyswip
is installed for Prolog execution through python.
Add +dataset_reader.ds_size=2
to only inference 2 data points for debugging.
We keep some data processing scripts in each symbolic language directory under scripts/*
. For example, create few-shot file for Spider dataset
python scripts/spider/split_few_shot.py
This will create several files in data dir, e.g., 10_shot.json
contains 10 examples from train set as demonstrations,
dev_10_shot.json
is the dev.json
with additional ctxs
field indicating few-shot example index in 10_shot.json
.
data
└── spider
├── dev.json # full dev file, json lines
├── dev_10_shot.json # dev file with 10-shots, for few-shot evaluation
├── 10_shot.json # few-shot file, contains 10 few-shot examples
├── train.json # original training file
└── 10_gen_q.json # for generating question
I've provide the initial data used in paper in data/
so you can ignore scripts/*/*.py
.
If you want to generate data for a new task, you have to implement some task files, e.g.,
src/dataset_readers/dataset_wrappers/spider.py
: this specify how to construct prompt for each examplesrc/filters/spider_filter.py
: this specify several filters to use after gathering generations. Make sure to at least includepost_proc
to extract field (e.g., answer, question) in the raw generated text.- (Optional)
src/metrics/spider/evaluator.py
: show the task metric, only used in zero/few-shot prompting and seteval=true
when runningfilter.py
.
We provide generated data here.
If you find our work helpful, please cite us:
@article{ye2023generating,
title={Generating Data for Symbolic Language with Large Language Models},
author={Ye, Jiacheng and Li, Chengzu and Kong, Lingpeng and Yu, Tao},
journal={arXiv preprint arXiv:2305.13917},
year={2023}
}