Topics

The skip-gram model in word2vec is fundamentally based on a probabilistic view of word-context relationships. Let’s build this view from the ground up with few basic assumptions:

  • distributional hypothesis: Words that appear in similar contexts have similar meanings
  • If we know a center word (like “loves”), we can predict its surrounding words (like “the”, “man”, “his”, and “son”)

Skip-gram treats this as a prediction problem: “Given a word, what is the likelihood of its surrounding words?”

Given center word and context words (the words within a window size around the center word), the conditional probability is expressed as:

Conditional Independence Assumption

This assumption reduces the complexity of the problem, making it tractable.

For a text sequence of length and using as the context window size, we would like to maximize the probability of all word-context pairs in the corpus. Formally:

To make optimization easier, we take the log of the objective, turning the product into a sum. This gives us the log-likelihood objective function:

The authors chose to model the conditional probability in an interesting way:

Each word has two d-dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index  in the dictionary, denote by  and  its two vectors when used as a center word and a context word, respectively. The question on why we need two vector representations is still unclear.

With this, the conditional probability of generating any context word  (with index  in the dictionary) given the center word  (with index  in the dictionary) can be modeled by a softmax operation on vector dot products:

The formulation ensures that the probability of selecting a particular context word increases when the vector similarity (dot product ) is higher.

Practical optimizations

Computing the denominator (normalization term) is expensive as it sums over all context words. Some solutions are: