Implementing Word2Vec SkipGram

Visual Demo

The below image illustrates how the algorithm has learned to represent semantically similar words into close spatial points.

Introduction

This Python package uses PyTorch to implement the Word2Vec algorithm using skip-gram architecture.

We provide the following resources that were used to build this package. We suggest reading these either beforehand or while you're exploring the code.

Word2Vec paper from Mikolov et al.
Neural Information Processing Systems, paper with improvements for Word2Vec also from Mikolov et al.

Word2Vec

The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. This way, words that show up in similar contexts, such as code, programming or python will have vectors representation near from each other.

In this implementation, we'll be using the skip-gram architecture because it performs better than Continuous Bag-Of-Words. Here, we pass in a word and try to predict the words surrounding it in the text. In this way, we can train the network to learn representations for words that show up in similar contexts.

Hopefully, the following diagram will help to settle down the intuition:

Data

We have used a series of Wikipedia articles provided by Matt Mahoney, you can find a broader description by clicking here.

Model

Below is an approximate diagram of the general structure of the network:

Results

In this section, we will show some preliminary results. But before, lest talk a bit about how can we take advantage of the embeddings.

Cosine Similarity

We can encode a given word as vectors $\vec{a}$ using the embedding table, then calculate the similarity with each word vector $\vec{b}$ in the embedding table with the following equation:

$$ \mathrm{similarity} = \cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|} $$

Random Examples

The image below shows some randomly selected words, followed by a set of words with which they share a similar context:

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.github/workflows		.github/workflows
data		data
tests		tests
word2vect		word2vect
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementing Word2Vec SkipGram

Visual Demo

Introduction

Word2Vec

Data

Model

Results

Cosine Similarity

Random Examples

About

Releases

Packages

Languages

HeberTU/word-2-vect-skipgram

Folders and files

Latest commit

History

Repository files navigation

Implementing Word2Vec SkipGram

Visual Demo

Introduction

Word2Vec

Data

Model

Results

Cosine Similarity

Random Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages