Skip to content

This repository contains an Efficient implementation of Word Representations in Vector Space, as proposed by Mikolov et al.

Notifications You must be signed in to change notification settings

HeberTU/word-2-vect-skipgram

Repository files navigation

Implementing Word2Vec SkipGram

Visual Demo

The below image illustrates how the algorithm has learned to represent semantically similar words into close spatial points.

Word_embed_plot

Introduction

This Python package uses PyTorch to implement the Word2Vec algorithm using skip-gram architecture.

We provide the following resources that were used to build this package. We suggest reading these either beforehand or while you're exploring the code.

  1. Word2Vec paper from Mikolov et al.
  2. Neural Information Processing Systems, paper with improvements for Word2Vec also from Mikolov et al.

Word2Vec

The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. This way, words that show up in similar contexts, such as code, programming or python will have vectors representation near from each other.

In this implementation, we'll be using the skip-gram architecture because it performs better than Continuous Bag-Of-Words. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

Hopefully, the following diagram will help to settle down the intuition:

skip_gram

Data

We have used a series of Wikipedia articles provided by Matt Mahoney, you can find a broader description by clicking here.

Model

Below is an approximate diagram of the general structure of the network:

skip_gram_arch

Results

In this section, we will show some preliminary results. But before, lest talk a bit about how can we take advantage of the embeddings.

Cosine Similarity

We can encode a given word as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table with the following equation:

$$ \mathrm{similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|} $$

Random Examples

The image below shows some randomly selected words, followed by a set of words with which they share a similar context:

Random_results

About

This repository contains an Efficient implementation of Word Representations in Vector Space, as proposed by Mikolov et al.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages