Topics

Skip-gram is an algorithm belonging to the word2vec family, for learning fixed-size word embeddings of words in a vocabulary. The prior for skip-gram comes from the distributional hypothesis concept from 1950s. Simply put, it is the following:

If we take a random (center) word from a sentence and its surrounding (context) words on either side, their embeddings should lie close in the vector space.

We know that dot product is a nice way to measure proximity of 2 vectors. Say we got a “black-box” that gives us word embeddings. In the vanilla case, we use it to compute dot products between center word and all words in our vocab V, giving us a vector of size V.

Dot product values have no fixed range, so let’s apply softmax to turn this V-sized vector into a probability distribution with values between 0 and 1.

Goal: Value corresponding to the context words be higher than others.

If we use a neural network for modelling our “black-box”:

  • Our model can use embedding or linear layers to give us word vectors
  • We can do a sliding window over our sentences and give (center, context) word pairs as input to model. So if we choose to pick words on either side of a center word as our context, then we will have such pairs
  • For each input (center, context), e.g. (0, 4) indicating 1st and 5th word in our vocab, we get our vector and pick out the 5th value, i.e. 0.01. We want to maximize this, i.e. minimize -0.01. During implementation, we use cross-entropy loss with 4 as target class index
  • Use gradient descent and backprop for training over this loss and update embeddings

There are variations of standard skip-gram: