This repository includes the preprocessing scripts for ACE and ERE datasets, including name tagging, entity coreference, relation extraction, event extraction and event coreference tasks. (verified on ACE_2005 and Rich_ERE corpus)
Python 3.6, jieba, NLTK
Preprocess Data: remove XML tags from ACE/ERE articles, sentence merging.
python --source [source_path] --rsd [rsd_path] --data [ace or ere] --extension [ending_of_source_files]
[source_path]: the path for input files (all .sgm files from ACE source corpus)
[rsd_path]: output path
Sentence segmentation, tokenization with offset retrieval
python --rsd [rsd_path] --ltf [ltf_path] --extension [ending_of_rsd_files]
[rsd_path]: the path for rsd files from step 1
[ltf_path]: output path
Convert ltf files to sentences of tokens as the bio format in name tagging tasks
python --ltf [ltf_path] --bio [bio_path]
[ltf_path]: the path for input files
[bio_path]: output path
Add annotations to bio files
python --bio [bio_path] --ann [ann_path] --ace [ace_path]
[bio_path]: the path for input files
[ann_path]: the path for all annotation files from ACE
[ace_path]: output path
Similarly, for ERE corpus,
python --bio [bio_path] --ann [ann_path] --ere [ace_path]
[1] Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng Ji, Clare R Voss, Jiawei Han, Avirup Sil. Liberal Event Extraction and Event Schema Induction. Proc. ACL'2016
[2] Lifu Huang, Avirup Sil, Heng Ji, Radu Florian. Improving slot filling performance with attentive neural networks on dependency structures. Proc. EMNLP'2017
[3] Lifu Huang, Heng Ji, Kyunghyun Cho, Clare R Voss. Zero-shot transfer learning for event extraction, Proc. ACL, 2018