Topics
Skip-gram is an algorithm belonging to the word2vec family, for learning fixed-size word embeddings of words in a vocabulary. The prior for skip-gram comes from the distributional hypothesis concept from 1950s. Simply put, it is the following:
If we take a random (center) word from a sentence and its surrounding (context) words on either side, their embeddings should lie close in the vector space.
We know that dot product is a nice way to measure proximity of 2 vectors. Say we got a “black-box” that gives us word embeddings. In the vanilla case, we use it to compute dot products between center word and all words in our vocab V, giving us a vector of size V.
Dot product values have no fixed range, so let’s apply softmax to turn this V-sized vector into a probability distribution with values between 0 and 1.
Goal: Value corresponding to the context words be higher than others.
If we use a neural network for modelling our “black-box”:
- Our model can use embedding or linear layers to give us word vectors
- We can do a sliding window over our sentences and give
(center, context)
word pairs as input to model. So if we choose to pick words on either side of a center word as our context, then we will have such pairs - For each input
(center, context)
, e.g.(0, 4)
indicating 1st and 5th word in our vocab, we get our vector and pick out the 5th value, i.e.0.01
. We want to maximize this, i.e. minimize-0.01
. During implementation, we use cross-entropy loss with4
as target class index - Use gradient descent and backprop for training over this loss and update embeddings
There are variations of standard skip-gram:
- skip-gram neural net variations
- skip-gram with negative sampling
- skip-gram with heirarchical softmax