- The paper presents a database of ranked English and Spanish paraphrases derived by:
- Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
- Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.
- Link to the paper
- The basic idea is that if two English strings e1 and e2 translate to the same foreign string f (also called pivot), they should have the same meaning.
- Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where
- f is a foreign string
- e is an english string
- φ is a vector of feature functions
- The system can pivot over f to create paraphrase triplets < e1, e2, φp > where φp is computed using translation feature vectors φ1 and φ2
- For example, conditional paraphrase probability p(e2|e1) can be computed by marginalizing over all shared foreign language translations f:
- p(e2|e1) = Sum over all f, p(e2|f)p(e1|f)
- Measure similarity of phrases using Distributional similarity.
- Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
- To describe a given phrase e1, collect contextual features like:
- n-gram based features for words (to the left and right of the given phrase)
- Lexical, lemma-based, POS and named entity unigrams and bigrams
- Dependency link features
- Syntactic features
- Aggregate all the features, over all the occurences of e, to obtain distributional signature se.
- Define similarity between 2 phrases e1 and e2 as :
- *sim(e1, e2) = dot(se1, s2)/(|se1||se2|)
- Paper mentions two instances:
- English paraphrases - 169.6 Million paraphrases
- Spanish paraphrases - 161.6 Million paraphrases
- The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
- Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.
- Segregation of data by domain or topic
- Support for more languages
- Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.
Could you please explain the ALIGNMENT column in PPDB2.0