Topics
In word2vec, we aim to get good word embeddings by using local contextual information in sentences. The prior here is: words close to each other in sentences have similar meanings, and are also close in the vector space compared to other words. Let’s take a sentence:
Here, for , let’s choose our center word to be sat
, so contexts are {cat, on}
. Let’s look at the design choices and training process in the skip-gram algorithm:
- Context Window: For window size , consider context words ( on each side of center word). In practice we use dynamic window size in word2vec
- Input Pairs: Process pairs like
(center, context)
. E.g.,(sat, cat)
,(sat, on)
- Negative Sampling:
- Treat non-context words as negative samples
- Simplification: Other context words (e.g.,
on
when processingcat
) work as negatives, since we will see(sat, on)
later again during training and update weights during optimization for that step
- Distance Metric: Use dot product for vector similarity
- Dual Embedding Layers:
- Input matrix (center words) and output matrix (context/negative words), both
- Post-training: Often use or average of and as final embeddings
Note
See discussions on no non-linearity is used in word2vec and why we need two weight matrices in word2vec
- Training Process:
- Inputs: pairs as one-hot vectors
- Center word embedding:
- All vocab embeddings: Use
- Embedding dot products:
- Softmax:
- Loss: Cross-entropy loss
- Goal: Maximize probability of true context words, effectively learning to assign higher probabilities to likely context words
- skip-gram with negative sampling (SGNS):
- Choose negative samples instead of all words
- Involves additional modifications to basic skip-gram model
The model is simple, composed of two embedding or linear matrices. Data points are shuffled, batched, and fed to the model for training, with each word getting a chance to be both center and context (except at text boundaries).