This repository contains the code to query our best models (served as TensorFlow Hub models) and their predictions on various academic text-generation benchmarks from our paper "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks" at TACL 2020.
Please cite our paper if you use our data or models.
@article{rothe_tacl20,
author = {Rothe, Sascha and Narayan, Shashi and Severyn, Aliaksei},
title = {Leveraging Pre-trained Checkpoints for Sequence Generation Tasks},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {264-280},
year = {2020}
}
Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. We developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2 and RoBERTa checkpoints and achieved new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion. We believe that NLP researchers will find our dataset with model predictions as a valuable resource to compare pre-trained text generation models and to derive actionable insights.
The dataset consists of our sequence-to-sequence model predictions on academic datasets for text generation: Sentence Fusion (DiscoFuse), Sentence Splitting (WikiSplit), Summarization (XSum, CNN/DailyMail and Gigaword) and Machine Translation (WMT 2014 and 2016). Our dataset will be a valuable resource to compare pre-trained text generation models.
The dataset consists of json files with lists of dictionaries
{
“target”: <string>,
“prediction”: <string>
}
Here, “prediction” is the model generated text and “target” is the reference text.
- MT(DE ->EN): WMT 2014 and WMT 2016
- MT(EN->DE): WMT 2014 and WMT 2016
- Sentence Fusion: DiscoFuse
- Sentence Splitting: WikiSplit
- Summarization: Gigaword, CNN/DailyMail and XSum
Here is the code to query our best models served as TensorFlow Hub models.
# TF1 version
import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import tensorflow_text as tf_text
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/bert24_de_en/1](https://tfhub.dev/google/bertseq2seq/bert24_de_en/1)')
de_sents = ['Satz 1', 'Satz 2']
en_sents = text_generator(en_sents)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/bert24_en_de/1](https://tfhub.dev/google/bertseq2seq/bert24_en_de/1)')
en_sents = ['Sentence 1', 'Sentence 2']
de_sents = text_generator(en_sents)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_discofuse/1](https://tfhub.dev/google/bertseq2seq/roberta24_discofuse/1)')
input_texts = ['Sentence 1a Sentence 1b',
'Sentence 2a Sentence 2b Sentence 2c']
output_sents = text_generator(input_texts)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_wikisplit/1](https://tfhub.dev/google/bertseq2seq/roberta24_wikisplit/1)')
input_sentences = ['Long Sentence 1', 'Long Sentence 2']
output_texts = text_generator(input_sentences)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_gigaword/1](https://tfhub.dev/google/bertseq2seq/roberta24_gigaword/1)')
input_sents = ['This is the first sentence.', 'This is the second sentence.']
output_summaries = text_generator(input_sents)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_cnndm/1](https://tfhub.dev/google/bertseq2seq/roberta24_cnndm/1)')
input_documents = ['This is text from the first document.',
'This is text from the second document.']
output_summaries = text_generator(input_documents)
text_generator = hub.Module(
'[https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1](https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1)')
input_documents = ['This is text from the first document.',
'This is text from the second document.']
output_summaries = text_generator(input_documents)
- SentencePiece Tokenizer: vocab file and model file.
- WordPiece Tokenizer: vocab file.
If you have a technical question regarding the dataset or publication, please create an issue in this repository. This is the fastest way to reach us.
If you would like to share feedback or report concerns, please email us at berts2s@google.com.