NER with Wikipedia Distant Supervision Contextualized Embeddings This repository contains the source code for the NER system presented in the following research publication (link)
Abbas Ghaddar and Philippe Langlais
Contextualized Word Representations from Distant Supervision with and for NER
In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
This code is based on the original bert implementation
- python 3.6
- tensorflow>=1.13
- pyhocon (for parsing the configurations)
- fasttext==0.8.3
-
Follow instruction in /data in order to obtain the data, and change the path of
data_dir
in theexperiments.config
file. -
Change the
raw_path
variables for conll and ontonotes datasets inexperiments.config
file topath/to/conll-2003
andpath/to/conll-2012/v4/data
respectively. For conll dataset please rename eng.train eng.testa eng.testb files to conll.train.txt conll.dev.txt conll.test.txt respectively. Also, changeDATA_DIR
intrain_ner.sh
andcache_emb.sh
. -
Run:
$ python preprocess.py {conll|ontonotes}
$ cd data
$ sh cache_emb.sh {conll|ontonotes}
Once the data preprocessing is completed, you can train and test a model with:
$ cd data
$ sh train_ner.sh {conll|ontonotes}
Please cite the following paper when using our code:
@inproceedings{ghaddar2019contextualized,
title={Contextualized Word Representations from Distant Supervision with and for NER},
author={Ghaddar, Abbas and Langlais, Philippe},
booktitle={Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)},
pages={101--108},
year={2019}
}