vanilla skip-gram spelled out

Topics

skip-gram

In word2vec, we aim to get good word embeddings by using local contextual information in sentences. The prior here is: words close to each other in sentences have similar meanings, and are also close in the vector space compared to other words. Let’s take a sentence:

The cat sat on the mat

Here, for $m = 1$ , let’s choose our center word to be sat, so contexts are {cat, on}. Let’s look at the design choices and training process in the skip-gram algorithm:

Context Window: For window size $m$ , consider $2 m$ context words ( $m$ on each side of center word). In practice we use dynamic window size in word2vec
Input Pairs: Process pairs like (center, context). E.g., (sat, cat), (sat, on)
Negative Sampling:
- Treat non-context words as negative samples
- Simplification: Other context words (e.g., on when processing cat) work as negatives, since we will see (sat, on) later again during training and update weights during optimization for that step
Distance Metric: Use dot product for vector similarity
Dual Embedding Layers:
- Input matrix $W$ (center words) and output matrix $W^{'}$ (context/negative words), both $V \times D$
- Post-training: Often use $W$ or average of $W$ and $W^{'}$ as final embeddings

Note

See discussions on no non-linearity is used in word2vec and why we need two weight matrices in word2vec

Training Process:
- Inputs: $(x, y)$ pairs as one-hot vectors $(1 \times V)$
- Center word embedding: $h = x W$ $(1 \times D)$
- All vocab embeddings: Use $W^{'}$
- Embedding dot products: $z = h W^{' T}$ $(1 \times V)$
- Softmax: $\overset{y}{^} = softmax (z)$
- Loss: Cross-entropy loss $L = - \sum_{i = 1}^{V} y_{i} lo g (\overset{y}{^}_{i})$
Goal: Maximize probability of true context words, effectively learning to assign higher probabilities to likely context words
skip-gram with negative sampling (SGNS):
- Choose $k$ negative samples instead of all $V - 1$ words
- Involves additional modifications to basic skip-gram model

The model is simple, composed of two embedding or linear matrices. Data points are shuffled, batched, and fed to the model for training, with each word getting a chance to be both center and context (except at text boundaries).

Altamash Khan

Altamash Khan

vanilla skip-gram spelled out

Backlinks

Altamash Khan

vanilla skip-gram spelled out

Related

Backlinks