Code for EMNLP2022 paper Improved grammatical error correction by ranking elementary edits that provides a state-of-the-art approach to grammatical error correction.
pip install -r requirements.txt
- (optional) Install ERRANT for evaluation.
- Download W&I-LOCNESS data
mkdir -p data
cd data && wget https://www.cl.cam.ac.uk/research/nl/bea2019st/data/wi+locness_v2.1.bea19.tar.gz
tar -xzvf wi+locness_v2.1.bea19.tar.gz
cd ..
- (To reproduce finetuning and evaluation) Download edits generated by GECToR model
cd data
mkdir -p bea_reranking && cd bea_reranking
wget https://www.dropbox.com/s/m5dot9rp0vwkcc8/gector_variants.tar.gz
tar -xzvf gector_variants.tar.gz
cd ../..
- (To reproduce finetuning and evaluation) Download model checkpoints.
Checkpoint folder | Language | Best F1 | Model | Threshold | Basic model weight |
---|---|---|---|---|---|
pie_bea-gector | English | 56.05:star: | roberta-base | 0.8 | 0.1 |
pie_bea_ft2-gector | English | 57.51:star: | roberta-large | 0.8 | 0.1 |
clang_large_ft2-gector | English | 58.94:star: | roberta-large | 0.8 | 0.1 |
ru_200K_gpt | Russian | 53.44:heavy_check_mark: | sberbank-ai/ruRoberta-large | 0.7 | 0.1 |
ru_200K_gpt_ft1 | Russian | 55.04:heavy_check_mark: | sberbank-ai/ruRoberta-large | 0.8 | 0.1 |
⭐ On BEA-2019 development set ✔️ On RULEC-GEC test set
To obtain RULEC-GEC data follow the instructions in RULEC-GEC repository. The zip archive with edits is available via the link, the password is the correction for the first error in its training data.
- English, GECToR: see our modification of GECToR repository.
- English, BERT-GEC: run beam search with large beam size (e.g., 15) using their code and then postprocess the output with
python bertgec/output_to_json.py -i BERT_GEC_OUTPUT_FOLDER/test.nbest.tok -o OUTPUT.jsonl
python bertgec/process_bert_gec_outputs.py -i OUTPUT.jsonl -s INPUT_FILE -o OUTPUT.variants -t -3.0 -j
In case the data is simply the list of tokenized sentences, append -r
option to the last command.
- Russian: uses a modification of a GPT-like model, TO APPEAR SOON.
You may use your own generator if it produces the file in the appropriate format (use the provided GECToR edits as reference).
For each generated edit, our model returns its probability to be correct and applies the edits whose probabilities
are higher than the given threshold. We recommend to use 0.8
or 0.9
threshold by default or tune it on development set.
# Faster simultaneous decoding (see the paper)
python apply_model.py -c CHECKPOINT_FOLDER -C CHECKPOINT_NAME -v TEST_VARIANTS_PATH
-O OUTPUT_FOLDER --n_max 8 [-m MODEL_NAME; DEFAULT=roberta-base] [-T THRESHOLDS ...; DEFAULT=0.4 0.5 0.6 0.7 0.8 0.9] [-a BASIC_MODEL_WEIGHTS ...] [-r]
# Better stagewise decoding (see the paper)
python apply_staged_model.py -c CHECKPOINT_FOLDER -C CHECKPOINT_NAME -v TEST_VARIANTS_PATH
-O OUTPUT_FOLDER -s 8 [-m MODEL_NAME, DEFAULT=roberta-base] [-T THRESHOLDS ..., DEFAULT=0.7 0.8 0.9] [-a BASIC_MODEL_WEIGHTS ...] [-r]
Add -r
key when variants were obtained from unlabeled data and correct answers are not known.
- For example, to make the predictions on development set using
checkpoints/pie_bea_ft2-gector/checkpoint_2.pt
checkpoint with stagewise decoding and evaluate them forthreshold=0.9
, run
python apply_staged_model.py -c checkpoints/pie_bea_ft2-gector -C checkpoint_2.pt \
-i data/wi+locness/m2/ABCN.dev.gold.bea19.m2 -v data/bea_reranking/gector_variants/bea.dev.variants \
-O dump/reranking -s 8 -a 0.1
./scripts/evaluate.sh -i data/wi+locness/m2/ABCN.dev.gold.bea19.m2 -r dump/reranking/pie_bea_ft2-gector/0.9_staged.output
It should produce
=========== Span-Based Correction ============
TP FP FN Prec Rec F0.5
2250 903 5211 0.7136 0.3016 0.5605
==============================================
The combined model output for threshold 0.8 is evaluated by
./scripts/evaluate.sh -i data/wi+locness/m2/ABCN.dev.gold.bea19.m2 -r dump/reranking/pie_bea_ft2-gector/0.8_alpha=0.10_1.00_staged.output
and produces
=========== Span-Based Correction ============
TP FP FN Prec Rec F0.5
2567 1147 4894 0.6912 0.3441 0.5751
==============================================
A larger checkpoints/clang_large_ft2-gector/checkpoint_2.pt
checkpoint is used analogously
python apply_staged_model.py -c checkpoints/clang_large_ft2-gector -C checkpoint_2.pt \
-i data/wi+locness/m2/ABCN.dev.gold.bea19.m2 -v data/bea_reranking/gector_variants/bea.dev.variants \
-O dump/reranking -m roberta-large -s 8 -a 0.1
./scripts/evaluate.sh -i data/wi+locness/m2/ABCN.dev.gold.bea19.m2 -r dump/reranking/clang_large_ft2-gector/0.8_alpha=0.10_1.00_staged.output
=========== Span-Based Correction ============
TP FP FN Prec Rec F0.5
2678 1136 4783 0.7021 0.3589 0.5894
==============================================
- To generate the outputs on the test set, run
python apply_staged_model.py -c checkpoints/clang_large_ft2-gector -C checkpoint_2.pt \
-v data/wi+locness/test/ABCN.test.bea19.orig -O dump/test_output -s 8 -a 0.1 -r
The *.output
files for different threshold values are available in OUTPUT_FOLDER
(dump/test_output
in our case).
The only difference for Russian is that we use M2Scorer to do evaluation
python apply_staged_model.py -c checkpoints/ru_200K_gpt_ft1 -C checkpoint_2.pt -O dump/reranking \
-v data/russian_reranking/gpt/test.variants -i data/russian/RULEC-GEC.test.M2 -m sberbank-ai/ruRoberta-large \
-s 5 -a 0.1
python scripts/m2scorer/scripts/m2scorer.py dump/reranking/ru_200K_gpt_ft1/0.7_alpha\=0.10_1.00_staged.output data/russian/RULEC-GEC.test.M2
Precision : 0.7367
Recall : 0.2733
F_0.5 : 0.5502
python train.py TRAIN_VARIANTS_PATH -T TEST_VARIANTS_PATH -M 768 --loss_by_class -e EPOCHS
-c CHECKPOINT_FOLDER [-L INITIAL_CHECKPOINT_PATH] [-E RECALL_ESTIMATE] [-m MODEL_NAME; DEFAULT=roberta-base] --save_all_checkpoints
--only_generated
- English, finetuning on W&I-LOCNESS train set using GECToR-generated edits:
python train.py -t data/bea_reranking/gector_variants/bea.train.variants -T \
data/bea_reranking/gector_variants/bea.dev.variants -M 768 --loss_by_class -e 3 \
-c checkpoints/pie_bea_ft2_rerun-gector -L checkpoints/pie_bea-gector/checkpoint_2.pt \
-E 0.4 --save_all_checkpoints --only_generated
- Russian, finetuning on RULEC-GEC data
python train.py -t data/russian_reranking/gpt/train.variants -T data/russian_reranking/gpt/dev.variants \
-M 768 --loss_by_class -e 5 -c checkpoints/ru_200K_gpt_ft1 -L checkpoints/ru_200K_gpt/checkpoint_1.pt \
-E 0.4 --save_all_checkpoints -m sberbank-ai/ruRoberta-large --only_generated