Jokingbird: Funny headline generation for news

About

This is the source code for the following paper:

Nikita Login, Alexander Baranov and Pavel Braslavski Jokingbird: Funny Headline Generation for News

https://link.springer.com/chapter/10.1007/978-3-031-16500-9_9

The research was done as a part of master's course at HSE University in 2020-2021

Authors

Nikita Login, HSE University, Moscow, Russia

Alexander Baranov, HSE University, Moscow, Russia

Pavel Braslavski, Ural Federal University

Research supervisor: Pavel Braslavsky, Ural Federal University

Academic supervisor: Anastasia Bonch-Osmolovskaya, HSE University, Moscow, Russia

Abstract

In this study, we address the problem of generating funny headlines for news articles. Funny headlines are beneficial even for seri- ous news stories – they attract and entertain the reader. Automatically generated funny headlines can serve as prompts for news editors. More generally, humor generation can be applied to other domains, e.g. con- versational systems. Like previous approaches, our methods are based on lexical substitutions. We consider two techniques for generating sub- stitute words: one based on BERT and another based on collocation strength and semantic distance. At the final stage, a humor classifier chooses the funniest variant from the generated pool. An in-house eval- uation of 200 generated headlines showed that the BERT-based model produces the funniest and in most cases grammatically correct output.

Algorithm output examples

As he moves campaign to battlegrounds, which Donald ~~Trump~~ duck will show up?

Wall Street dips before French ~~election~~ toast , but up for week.

UK leaders must let the Brexit ~~vote~~ sandwich stand

Data

Our input and training data was from publicly available datasets:

Times front page news - https://components.one/datasets/above-the-fold/

All the news - https://www.kaggle.com/snapcrack/all-the-news

Harvard news articles - https://doi.org/10.7910/DVN/GMFCTR

RedditJokes [1] - https://github.com/Moradnejad/ColBERT-Using-BERT-Sentence-Embedding-for-Humor-Detection

Humicroedit [2] - https://cs.rochester.edu/u/nhossain/funlines.html

FunLines [3] - https://cs.rochester.edu/u/nhossain/funlines.html

Output of our best model (BERTHumEdit) on 1000 headlines from our input dataset is available here:

https://github.com/Funny-Headline-Generation/Jokingbird_code/blob/main/BertHumEditOutput.xlsx?raw=true

References

Annamoradnejad, I., Zoghi, G.: ColBERT: Using BERT sentence embedding for humor detection. arXiv preprint arXiv:2004.12765 (2020)
Hossain, N., Krumm, J., Gamon, M.: “President Vows to Cut Hair”: Dataset and analysis of creative text editing for humorous headlines. In: Proceedings of the 2019 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 133–142 (2019)
Hossain, N., Krumm, J., Sajed, T., Kautz, H.: Stimulating creativity with FunLines: A case study of humor generation in headlines. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 256–262 (2020)

Quickstart

Clone this repository.
Download the following folders and files and put them into repository folder:

Variables for humor classifier (put it in "new_colbert_predict/colbert-trained") - https://drive.google.com/drive/folders/157uwBlLrOwJgsgQD8EU36N94qZlzlTKE?usp=sharing

Collocation matrix (trained on joke corpus) - https://drive.google.com/drive/folders/1q0Z5-pLicPTTX_YlCHxkSkSVliPPI_Yt?usp=sharing

Collocation matrix (trained on news headlines) - https://drive.google.com/drive/folders/1q0Z5-pLicPTTX_YlCHxkSkSVliPPI_Yt?usp=sharing

Collocation matrix (trained on news body, slow) - https://drive.google.com/drive/folders/1q0Z5-pLicPTTX_YlCHxkSkSVliPPI_Yt?usp=sharing

Word2Vec model - https://drive.google.com/drive/folders/17vj8Ciu0bf_rtrfbuag3NQlAiKcgkwkJ?usp=sharing

Bert model (trained on Humicroedit/Funlines) - https://drive.google.com/file/d/1IngGcanB9pviw_-8Rd-GUsCfEDzVjCss/view?usp=sharing

Bert model (trained on joke corpus) - https://drive.google.com/file/d/1WYLu0XSC5MUrY2RxGI_fNt1N5pLMElNu/view?usp=sharing

Install dependencies:

pip install -r requirements.txt

Create a table (.CSV, .XLSX) with column named "headline" where you headlines will be
Run the script on your file:

python main.py my_input_file.xlsx my_output_file.xlsx

To see the possible list of options, type:

python main.py --help

Some of the key command line arguments

-- word_replacer str - Which algorithm to use for masked word replacement (BERT Replacer or DistReplacer)

-- bert_model_path str - Path to saved bert model for BERTReplacer

-- colloc_matrix_path str - Path to saved collocation matrix

-- keep_case - Whether not to lowercase before identifying collocations (default: False)

-- keep_all - Whether to consider all possible options that exceed collocation strength and semantic distance thresholds in GensimCollocateReplacer (default: False)

-- score - Whether to score elements on output with humour classifier and keep only the most funny variant of a sentence (default: False)

-- top_k int - Whether to sample top-k BERT predictions and pass them to next level instead of selecting only most probable one (default: 3)

-- colloc_thresh float - Threshold of collocation strength for n-grams to be considered collocations (default: 3.0)

-- dist_thresh float - Threshold of Word2Vec cosine distance (between 0 and 1) for DistReplacer (default: 0.4)

-- colloc_metric str - Which collocation strength metric to use (default: that of saved collocate matrix, PMI in files provided above, supported: PMI, LL (Log-likelihood), Jaccard, Dice, TScore)

Run example

python main.py --word_replacer "DistReplacer" --bert_model_path "./bert/" --colloc_matrix_path "./matrix" --keep_case False --top_k 5

The workflow of the algorithm is illustrated below:

  headline
     ||
     ||
     ||
 Select and mask words to be replaced
   Collocation matrix
     ||
     ||
 Select words to be inserted as replacement for masked
  ||                                     ||
  ||                                     ||
DistReplacer                        BERTReplacer
  Collocation matrix                    BERT model
  Word2Vec model                         ||
  ||                                     ||
  ||                                     ||
Humor classifier                     Humor classifier
  ||                                     ||
  ||                                     ||
Select most funny variant           Select most funny variant
  ||                                     ||
  ||                                     ||
Output                                Output

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bert-base-uncased		bert-base-uncased
colbert_predict		colbert_predict
new_colbert_predict		new_colbert_predict
.gitignore		.gitignore
BertHumEditOutput.xlsx		BertHumEditOutput.xlsx
Dataset statistics.ipynb		Dataset statistics.ipynb
ForManualEval.tsv		ForManualEval.tsv
ForManualEval1.tsv		ForManualEval1.tsv
FullAgreement.tsv		FullAgreement.tsv
LICENSE		LICENSE
Make new Table1.ipynb		Make new Table1.ipynb
OnlyGramAgreement.tsv		OnlyGramAgreement.tsv
OnlyHumAgreement.tsv		OnlyHumAgreement.tsv
README.md		README.md
analyse_toloka_output.ipynb		analyse_toloka_output.ipynb
bert_humicroedit_headlines.tsv		bert_humicroedit_headlines.tsv
bert_humicroedit_headlines_right_thresh.tsv		bert_humicroedit_headlines_right_thresh.tsv
bert_humicroedit_tested_on_toloka.tsv		bert_humicroedit_tested_on_toloka.tsv
bert_threshold_selection.ipynb		bert_threshold_selection.ipynb
collocate_replacer.py		collocate_replacer.py
collocations.py		collocations.py
full_unique.csv.txt		full_unique.csv.txt
func_api.py		func_api.py
get_orionw_output.bat		get_orionw_output.bat
get_output.bat		get_output.bat
get_output1.bat		get_output1.bat
get_output_right_thresh.bat		get_output_right_thresh.bat
gold_standard.tsv		gold_standard.tsv
input_for_wellers.ipynb		input_for_wellers.ipynb
join_manual_eval_and_compute_kappa.ipynb		join_manual_eval_and_compute_kappa.ipynb
main.py		main.py
ngram_matrix.npz		ngram_matrix.npz
ngrams.py		ngrams.py
nn_utils.py		nn_utils.py
prepare_data_for_toloka.ipynb		prepare_data_for_toloka.ipynb
preprocessing.py		preprocessing.py
real_subset.tsv		real_subset.tsv
requirements.txt		requirements.txt
score_humtranslate_output.ipynb		score_humtranslate_output.ipynb
token_replacement.py		token_replacement.py
token_selection.py		token_selection.py
toloka_assignments.tsv		toloka_assignments.tsv
training_tasks.tsv		training_tasks.tsv
utils.py		utils.py
word_change.py		word_change.py
Ровно 1000 заголовков.ipynb		Ровно 1000 заголовков.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jokingbird: Funny headline generation for news

About

Authors

Abstract

Algorithm output examples

Data

References

Quickstart

Run example

About

Releases

Packages

Contributors 2

Languages

License

Funny-Headline-Generation/Jokingbird_code

Folders and files

Latest commit

History

Repository files navigation

Jokingbird: Funny headline generation for news

About

Authors

Abstract

Algorithm output examples

Data

References

Quickstart

Run example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages