PPDB: The Paraphrase Database

Introduction

The paper presents a database of ranked English and Spanish paraphrases derived by:
- Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
- Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.
Link to the paper

The basic idea is that if two English strings e₁ and e₂ translate to the same foreign string f (also called pivot), they should have the same meaning.
Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where
- f is a foreign string
- e is an english string
- φ is a vector of feature functions
The system can pivot over f to create paraphrase triplets < e₁, e₂, φ_p > where φ_p is computed using translation feature vectors φ₁ and φ₂
For example, conditional paraphrase probability p(e₂|e₁) can be computed by marginalizing over all shared foreign language translations f:
- p(e₂|e₁) = Sum over all f, p(e₂|f)p(e₁|f)

Measure similarity of phrases using Distributional similarity.
Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
To describe a given phrase e₁, collect contextual features like:
- n-gram based features for words (to the left and right of the given phrase)
- Lexical, lemma-based, POS and named entity unigrams and bigrams
- Dependency link features
- Syntactic features
Aggregate all the features, over all the occurences of e, to obtain distributional signature s_e.
Define similarity between 2 phrases e₁ and e₂ as :
- *sim(e₁, e₂) = dot(s_e1, s₂)/(|s_e1||s_e2|)
Paper mentions two instances:
- English paraphrases - 169.6 Million paraphrases
- Spanish paraphrases - 161.6 Million paraphrases

The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.

Segregation of data by domain or topic
Support for more languages
Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.