Topics
The skip-gram model in word2vec is fundamentally based on a probabilistic view of word-context relationships. Let’s build this view from the ground up with few basic assumptions:
- distributional hypothesis: Words that appear in similar contexts have similar meanings
- If we know a center word (like “loves”), we can predict its surrounding words (like “the”, “man”, “his”, and “son”)
Skip-gram treats this as a prediction problem: “Given a word, what is the likelihood of its surrounding words?”
Given center word and context words (the words within a window size around the center word), the conditional probability is expressed as:
Conditional Independence Assumption
This assumption reduces the complexity of the problem, making it tractable.
For a text sequence of length and using as the context window size, we would like to maximize the probability of all word-context pairs in the corpus. Formally:
To make optimization easier, we take the log of the objective, turning the product into a sum. This gives us the log-likelihood objective function:
The authors chose to model the conditional probability in an interesting way:
Each word has two d-dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index in the dictionary, denote by and its two vectors when used as a center word and a context word, respectively. The question on why we need two vector representations is still unclear.
With this, the conditional probability of generating any context word (with index in the dictionary) given the center word (with index in the dictionary) can be modeled by a softmax operation on vector dot products:
The formulation ensures that the probability of selecting a particular context word increases when the vector similarity (dot product ) is higher.
Practical optimizations
Computing the denominator (normalization term) is expensive as it sums over all context words. Some solutions are:
- skip-gram with heirarchical softmax: Use a binary tree structure to reduce computation
- skip-gram with negative sampling: Use negative samples instead of all contexts