This is the source code for the following paper:
Nikita Login, Alexander Baranov and Pavel Braslavski Jokingbird: Funny Headline Generation for News
The research was done as a part of master's course at HSE University in 2020-2021
Nikita Login, HSE University, Moscow, Russia
Alexander Baranov, HSE University, Moscow, Russia
Pavel Braslavski, Ural Federal University
Research supervisor: Pavel Braslavsky, Ural Federal University
Academic supervisor: Anastasia Bonch-Osmolovskaya, HSE University, Moscow, Russia
In this study, we address the problem of generating funny headlines for news articles. Funny headlines are beneficial even for seri- ous news stories – they attract and entertain the reader. Automatically generated funny headlines can serve as prompts for news editors. More generally, humor generation can be applied to other domains, e.g. con- versational systems. Like previous approaches, our methods are based on lexical substitutions. We consider two techniques for generating sub- stitute words: one based on BERT and another based on collocation strength and semantic distance. At the final stage, a humor classifier chooses the funniest variant from the generated pool. An in-house eval- uation of 200 generated headlines showed that the BERT-based model produces the funniest and in most cases grammatically correct output.
As he moves campaign to battlegrounds, which Donald Trump duck will show up?
Wall Street dips before French election toast , but up for week.
UK leaders must let the Brexit vote sandwich stand
Our input and training data was from publicly available datasets:
Times front page news -
All the news -
Harvard news articles -
RedditJokes [1] -
Humicroedit [2] -
FunLines [3] -
Output of our best model (BERTHumEdit) on 1000 headlines from our input dataset is available here:
Clone this repository.
Download the following folders and files and put them into repository folder:
Variables for humor classifier (put it in "new_colbert_predict/colbert-trained") -
Collocation matrix (trained on joke corpus) -
Collocation matrix (trained on news headlines) -
Collocation matrix (trained on news body, slow) -
Word2Vec model -
Bert model (trained on Humicroedit/Funlines) -
Bert model (trained on joke corpus) -
- Install dependencies:
pip install -r requirements.txt
Create a table (.CSV, .XLSX) with column named "headline" where you headlines will be
Run the script on your file:
python my_input_file.xlsx my_output_file.xlsx
To see the possible list of options, type:
python --help
Some of the key command line arguments
-- word_replacer str - Which algorithm to use for masked word replacement (BERT Replacer or DistReplacer)
-- bert_model_path str - Path to saved bert model for BERTReplacer
-- colloc_matrix_path str - Path to saved collocation matrix
-- keep_case - Whether not to lowercase before identifying collocations (default: False)
-- keep_all - Whether to consider all possible options that exceed collocation strength and semantic distance thresholds in GensimCollocateReplacer (default: False)
-- score - Whether to score elements on output with humour classifier and keep only the most funny variant of a sentence (default: False)
-- top_k int - Whether to sample top-k BERT predictions and pass them to next level instead of selecting only most probable one (default: 3)
-- colloc_thresh float - Threshold of collocation strength for n-grams to be considered collocations (default: 3.0)
-- dist_thresh float - Threshold of Word2Vec cosine distance (between 0 and 1) for DistReplacer (default: 0.4)
-- colloc_metric str - Which collocation strength metric to use (default: that of saved collocate matrix, PMI in files provided above, supported: PMI, LL (Log-likelihood), Jaccard, Dice, TScore)
python --word_replacer "DistReplacer" --bert_model_path "./bert/" --colloc_matrix_path "./matrix" --keep_case False --top_k 5
The workflow of the algorithm is illustrated below:
Select and mask words to be replaced
Collocation matrix
Select words to be inserted as replacement for masked
|| ||
|| ||
DistReplacer BERTReplacer
Collocation matrix BERT model
Word2Vec model ||
|| ||
|| ||
Humor classifier Humor classifier
|| ||
|| ||
Select most funny variant Select most funny variant
|| ||
|| ||
Output Output