Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created September 18, 2016 15:53
Show Gist options
  • Save shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8 to your computer and use it in GitHub Desktop.
Save shagunsodhani/efea5a42d17e0fcf18374df8e3e4b3e8 to your computer and use it in GitHub Desktop.
Notes for GloVe paper

GloVe: Global Vectors for Word Representation

Introduction

  • Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods.

Global Matrix Factorization Methods

  • Decompose large matrices into low-rank approximations.
  • eg - Latent Semantic Analysis (LSA)

Limitations

  • Poor performance on word analogy task
  • Frequent words contribute disproportionately high to the similarity measure.

Shallow, Local Context-Based Window Methods

  • Learn word representations using adjacent words.
  • eg - Continous bag-of-words (CBOW) model and skip-gram model.

Limitations

  • Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively.

GloVe Model

  • To capture the relationship between words i and j, word vector models should use ratios of co-occurene probabilites (with other words k) instead of using raw probabilites themselves.
  • In most general form:
    • F(wi, wj, wk~ ) = Pik/Pjk
  • We want F to encode information in the vector space (which have a linear structure), so we can restrict to the difference of wi and wj
    • F(wi - wj, wk~ ) = Pik/Pjk
  • Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments.
    • F( (wi - wj)T, wk~ ) = Pik/Pjk
  • F should be invariant to order of the word pair i and j.
    • F(wiTwk~) = Pik
  • Doing further simplifications and optimisations (refer paper), we get cost function,
    • J = Sum (over all i, j pairs in the vocabulary)[wiTwk + bi + bk - log(Xik)]2
    • f is a weighing function.
    • f(x) = min((x/xmax)α, 1)
    • Typical values, xmax = 100 and α = 3/4
    • b are the bias terms.

Complexity

  • Depends on a number of non-zero elements in the input matrix.
  • Upper bound by the square of vocabulary size
  • Since for shallow window-based approaches, complexity depends on |C| (size of the corpus), tighter bounds are needed.
  • By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to |C|0.8

Evaluation

Tasks

  • Word Analogies

    • a is to b as c is to ___?
    • Both semantic and syntactic pairs
    • Find closest d to wb - wc + wa (using cosine similarity)
  • Word Similarity

  • Named Entity Recognition

Datasets

  • Wikipedia Dumps - 2010 and 2014
  • Gigaword5
  • Combination of Gigaword5 and Wikipedia2014
  • CommonCrawl
  • 400,000 most frequent words considered from the corpus.

Hyperparameters

  • Size of context window.
  • Whether to distinguish left context from right context.
  • f - Word pairs that are d words apart contribute 1/d to the total count.
  • xmax = 100
  • α = 3/4
  • AdaGrad update

Models Compared With

  • Singular Value Decomposition
  • Continous Bag-Of-Words
  • Skip-Gram

Results

  • Glove outperforms all other models significantly.
  • Diminishing returns for vectors larger than 200 dimensions.
  • Small and asymmetric context windows (context window only to the left) works better for syntactic tasks.
  • Long and symmetric context windows (context window to both the sides) works better for semantic tasks.
  • Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5.
  • Word2vec’s performance decreases if the number of negative samples increases beyond about 10.
  • For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.
@Liskutin
Copy link

I know its old post, but I have a question. If the model is basically trying to reconstruct the big co-occurrence matrix [well in the paper they said they transform that into probability ratios matrix, i guess], is the model reconstructing the matrix line by line or word by word?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment