Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP
You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta
Based on the following papers:
- SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents
- Get To The Point: Summarization with Pointer-Generator Networks
- Self-Attentive Model for Headline Generation
- Text Summarization with Pretrained Encoders
- Multilingual Denoising Pre-training for Neural Machine Translation
- Telegram: @YallenGusev
pip install -r requirements.txt
Script for training a model based on AllenNLP 'train' command.
Argument | Required | Description |
---|---|---|
-c | true | path to file with configuration |
-s | true | path to directory where model will be saved |
-t | true | path to train dataset |
-v | true | path to val dataset |
-r | false | recover from checkpoint |
Script for model evaluation. The test dataset should have the same format as the train dataset.
Argument | Required | Default | Description |
---|---|---|---|
-t | true | path to test dataset | |
-m | true | path to tar.gz archive with model | |
-p | true | name of Predictor | |
-c | false | 0 | CUDA device |
-L | true | Language ("ru" or "en") | |
-b | false | 32 | size of a batch with test examples to run simultaneously |
-M | false | path to meteor.jar for Meteor metric | |
-T | false | tokenize gold and predicted summaries before metrics calculation | |
-D | false | save temporary files with gold and predicted summaries |
Script for subword model training.
Argument | Default | Description |
---|---|---|
--train-path | path to train dataset | |
--model-path | path to directory where generated subword model will be saved | |
--model-type | bpe | type of subword model, see sentencepiece |
--vocab-size | 50000 | size of the resulting subword model vocabulary |
--config-path | path to file with configuration for DatasetReader (with parse_set) |
- First paper: Importance of Copying Mechanism for News Headline Generation
- Slides: Importance of Copying Mechanism for News Headline Generation
- Second paper: Advances of Transformer-Based Models for News Headline Generation
- RIA original dataset: https://github.com/RossiyaSegodnya/ria_news_dataset
- RIA train/val/test: https://www.dropbox.com/s/rermx1r8lx9u7nl/ria.tar.gz
- RIA dataset preprocessed for mBART: https://www.dropbox.com/s/iq2ih8sztygvz0m/ria_data_mbart_512_200.tar.gz
- Lenta original dataset: https://github.com/yutkin/Lenta.Ru-News-Dataset
- Lenta train/val/test: https://www.dropbox.com/s/v9i2nh12a4deuqj/lenta.tar.gz
- Lenta dataset preprocessed for mBART: https://www.dropbox.com/s/4oo8jazmw3izqvr/lenta_mbart_data_512_200.tar.gz
- Telegram train dataset with split: https://www.dropbox.com/s/ykqk49a8avlmnaf/ru_all_split.tar.gz
- Telegram test dataset with multiple references: https://github.com/dialogue-evaluation/Russian-News-Clustering-and-Headline-Generation/blob/main/data/headline_generation/headline_generation_answers.jsonl
Prediction script:
./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru
Model | R-1-f | R-2-f | R-L-f | BLEU |
---|---|---|---|---|
ria_copynet_10kk | 40.0 | 23.3 | 37.5 | - |
ria_pgn_24kk | 42.3 | 25.1 | 39.6 | - |
ria_mbart | 42.8 | 25.5 | 39.9 | - |
First Sentence | 24.1 | 10.6 | 16.7 | - |
Model | R-1-f | R-2-f | R-L-f | BLEU |
---|---|---|---|---|
ria_copynet_10kk | 25.6 | 12.3 | 23.0 | - |
ria_pgn_24kk | 26.4 | 12.3 | 24.0 | - |
ria_mbart | 30.3 | 14.5 | 27.1 | - |
First Sentence | 25.5 | 11.2 | 19.2 | - |
- CNN/DailyMail jsonl dataset: https://www.dropbox.com/s/35ezpg78rtukkgh/cnn_dm_jsonl.tar.gz
Prediction script:
./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R
Model | R-1-f | R-2-f | R-L-f | METEOR | BLEU |
---|---|---|---|---|---|
cnndm_pgn_25kk | 38.5 | 16.5 | 33.4 | 17.6 | - |
- Paper: Dataset for Automatic Summarization of Russian News
- Gazeta dataset: https://github.com/IlyaGusev/gazeta
- Usage examples:
- gazeta_pgn_7kk
- gazeta_pgn_7kk_cov.tar.gz
- gazeta_pgn_25kk
- gazeta_pgn_words_13kk.tar.gz
- gazeta_summarunner_3kk
Prediction scripts:
./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T
Model | R-1-f | R-2-f | R-L-f | METEOR | BLEU |
---|---|---|---|---|---|
gazeta_pgn_7kk | 29.4 | 12.7 | 24.6 | 21.2 | 9.0 |
gazeta_pgn_7kk_cov | 29.8 | 12.8 | 25.4 | 22.1 | 10.1 |
gazeta_pgn_25kk | 29.6 | 12.8 | 24.6 | 21.5 | 9.3 |
gazeta_pgn_words_13kk | 29.4 | 12.6 | 24.4 | 20.9 | 8.9 |
gazeta_summarunner_3kk | 31.6 | 13.7 | 27.1 | 26.0 | 11.5 |
gazeta_mbart | 32.6 | 14.6 | 28.2 | 25.7 | 12.4 |
gazeta_mbart_lower | 32.7 | 14.7 | 28.3 | 25.8 | 12.5 |
python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>
Headline generation (PGN):
@article{Gusev2019headlines,
author={Gusev, I.O.},
title={Importance of copying mechanism for news headline generation},
journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
year={2019},
volume={2019-May},
number={18},
pages={229--236}
}
Headline generation (transformers):
@InProceedings{Bukhtiyarov2020headlines,
author={Bukhtiyarov, Alexey and Gusev, Ilya},
title="Advances of Transformer-Based Models for News Headline Generation",
booktitle="Artificial Intelligence and Natural Language",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages={54--61},
isbn="978-3-030-59082-6",
doi={10.1007/978-3-030-59082-6_4}
}
Summarization:
@InProceedings{Gusev2020gazeta,
author="Gusev, Ilya",
title="Dataset for Automatic Summarization of Russian News",
booktitle="Artificial Intelligence and Natural Language",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="{122--134}",
isbn="978-3-030-59082-6",
doi={10.1007/978-3-030-59082-6_9}
}