Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created March 5, 2017 17:27
Show Gist options
  • Save shagunsodhani/fa1f387f084355dfafdf7550b1899af6 to your computer and use it in GitHub Desktop.
Save shagunsodhani/fa1f387f084355dfafdf7550b1899af6 to your computer and use it in GitHub Desktop.
Summary of "PPDB: The Paraphrase Database" paper

PPDB: The Paraphrase Database

Introduction

  • The paper presents a database of ranked English and Spanish paraphrases derived by:
    • Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
    • Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.
  • Link to the paper

Extracting Paraphrase from Bilingual Text

  • The basic idea is that if two English strings e1 and e2 translate to the same foreign string f (also called pivot), they should have the same meaning.
  • Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where
    • f is a foreign string
    • e is an english string
    • φ is a vector of feature functions
  • The system can pivot over f to create paraphrase triplets < e1, e2, φp > where φp is computed using translation feature vectors φ1 and φ2
  • For example, conditional paraphrase probability p(e2|e1) can be computed by marginalizing over all shared foreign language translations f:
    • p(e2|e1) = Sum over all f, p(e2|f)p(e1|f)

Scoring Paraphrases Using Monolingual Distributional Similarity

  • Measure similarity of phrases using Distributional similarity.
  • Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
  • To describe a given phrase e1, collect contextual features like:
    • n-gram based features for words (to the left and right of the given phrase)
    • Lexical, lemma-based, POS and named entity unigrams and bigrams
    • Dependency link features
    • Syntactic features
  • Aggregate all the features, over all the occurences of e, to obtain distributional signature se.
  • Define similarity between 2 phrases e1 and e2 as :
    • *sim(e1, e2) = dot(se1, s2)/(|se1||se2|)
  • Paper mentions two instances:
    • English paraphrases - 169.6 Million paraphrases
    • Spanish paraphrases - 161.6 Million paraphrases

Analysis

  • The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
  • Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.

Areas of Improvement

  • Segregation of data by domain or topic
  • Support for more languages
  • Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.
@sahilbadyal
Copy link

Could you please explain the ALIGNMENT column in PPDB2.0

@serkanemreelci
Copy link

I want to ask the same question with sahilbadyal. What is ALIGNMENT column ? I have not found any clue about it.

@shagunsodhani
Copy link
Author

Hello :) I could not find the reference to ALIGNMENT column. Could you point me to it ? (which section for example)

@serkanemreelci
Copy link

serkanemreelci commented May 22, 2020

I think it is in the second version of PPDB database which is PPDB 2. The structure is like that for every line :
LHS ||| PHRASE ||| PARAPHRASE ||| (FEATURE=VALUE )* ||| ALIGNMENT ||| ENTAILMENT

PPDB 2 is available at http://paraphrase.org/

@shagunsodhani
Copy link
Author

Could you point me to the paper where it is mentioned? Cant find it on the website.

@serkanemreelci
Copy link

I think that is the problem. I can't find it, too. Yet, there is a column with that name. They also listed on the download page. However they don't explain it.

http://paraphrase.org/#/download

@shagunsodhani
Copy link
Author

I would recommend contacting the authors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment