probabilistic view of skip-gram algorithm

Topics

skip-gram

The skip-gram model in word2vec is fundamentally based on a probabilistic view of word-context relationships. Let’s build this view from the ground up with few basic assumptions:

distributional hypothesis: Words that appear in similar contexts have similar meanings
If we know a center word (like “loves”), we can predict its surrounding words (like “the”, “man”, “his”, and “son”)

Skip-gram treats this as a prediction problem: “Given a word, what is the likelihood of its surrounding words?”

Given center word $w_{c}$ and context words $w_{o}$ (the words within a window size around the center word), the conditional probability is expressed as:

P (w_{o} ∣ w_{c})

Conditional Independence Assumption

$P (w_{co n t e x t_{1}}, w_{co n t e x t_{2}}, \dots ∣ w_{c}) = P (w_{co n t e x t_{1}} ∣ w_{c}) \cdot P (w_{co n t e x t_{2}} ∣ w_{c}) \cdot \dots$
This assumption reduces the complexity of the problem, making it tractable.

For a text sequence of length $T$ and using $m$ as the context window size, we would like to maximize the probability of all word-context pairs in the corpus. Formally:

maximize {t = 1 \prod T - m \leq j \leq m, j \neq = 0 \prod P (w^{(t + j)} ∣ w^{(t)})}

To make optimization easier, we take the log of the objective, turning the product into a sum. This gives us the log-likelihood objective function:

maximize {t = 1 \sum T - m \leq j \leq m, j \neq = 0 \sum lo g P (w^{(t + j)} ∣ w^{(t)})}

The authors chose to model the conditional probability in an interesting way:

Each word has two d-dimensional-vector representations for calculating conditional probabilities. More concretely, for any word with index $i$ in the dictionary, denote by $v_{i}$ and $u_{i}$ its two vectors when used as a center word and a context word, respectively. The question on why we need two vector representations is still unclear.

With this, the conditional probability of generating any context word $w_{o}$ (with index $o$ in the dictionary) given the center word $w_{c}$ (with index $c$ in the dictionary) can be modeled by a softmax operation on vector dot products:

P (w_{o} ∣ w_{c}) = \frac{exp ( u _{o}^{⊤} v _{c} )}{\sum _{i \in V} exp ( u _{i}^{⊤} v _{c} )}

The formulation ensures that the probability of selecting a particular context word increases when the vector similarity (dot product $u_{o}^{⊤} v_{c}$ ) is higher.

Practical optimizations

Computing the denominator (normalization term) is expensive as it sums over all context words. Some solutions are:

skip-gram with heirarchical softmax: Use a binary tree structure to reduce computation

skip-gram with negative sampling: Use $k$ negative samples instead of all contexts

Altamash Khan

Altamash Khan

probabilistic view of skip-gram algorithm

Backlinks

Altamash Khan

probabilistic view of skip-gram algorithm

Related

Backlinks