This is the code base for ACL'19 paper Complex Question Decomposition for Semantic Parsing
.
Download ComplexWebQ data, prepare environment and libraries.
In order to run preprocess, you should put the following files in DATA_PATH directory, DATA_PATH is defined in the script.
- ComplexWebQuestions_train.json
- ComplexWebQuestions_dev.json
- ComplexWebQuestions_test.json (we need any other information in above files)
- train.json, dev.json, test.json (we need the splitted sub questions in these files, it is not included in raw data,
but we generate and prepare them for users in
complex_questions
directory, also you can generate them by following steps)
cd WebAsKB
.- Prepare a StanfordCoreNLP server in localhost following 1.2.2.
- Change data_dir setting in
WebAsKB/config.py
. - Change EVALUATION_SET setting in
WebAsKB/config.py
to train, dev and test, and Runpython webaskb_run.py gen_golden_sup
for three times. - By this time, you can get
train.json, dev.json, test.json
in DATA_PATH.
In order to run the POS annotation process, you should download and start a StanfordCoreNLP server in localhost:9003.
- Download from
https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
, unzip and cd to it. - Start server using
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9003 -timeout 15000
.
We provide a template script scripts/run.sh
for the users, and you need to change the following directory settings at least to run it.
- DATA_PATH: where the data root directory is.
- RUN_T2T: the root directory of the code base.
Now run scripts/run.sh preprocess
, the command will generate the data format for our model, and annotate POS labels.
Prepare Glove pretrained embedding file glove.6B.300d.txt
and put it in DATA_PATH/embed/.
scripts/run.sh prepare
will shuffle the dataset and build vocabulary file.
To train our decompose model, use scripts/run.sh train
.
To train our semantic parsing model, use scripts/run.sh train_lf
.
scripts/run.sh test
, it will generate decomposed query with a input file, and print bleu-4 & rouge-l score compared to references.
scripts/run.sh test_lf
, it will generate logical form with a input file, and print EM score compared to references.
If you use this code in your research, please kindly cite our paper via the following BibTeX.
@inproceedings{Zhang2019HSP,
author = {Zhang, Haoyu and Cai, Jingjing and Xu, Jianjun and Wang, Ji},
booktitle = {Conference of the Association for Computational Linguistics (ACL)},
year = {2019}
}