Topics

In word2vec, we aim to get good word embeddings by using local contextual information in sentences. The prior here is: words close to each other in sentences have similar meanings, and are also close in the vector space compared to other words. Let’s take a sentence:

Here, for , let’s choose our center word to be sat, so contexts are {cat, on}. Let’s look at the design choices and training process in the skip-gram algorithm:

  1. Context Window: For window size , consider context words ( on each side of center word). In practice we use dynamic window size in word2vec
  2. Input Pairs: Process pairs like (center, context). E.g., (sat, cat), (sat, on)
  3. Negative Sampling:
    • Treat non-context words as negative samples
    • Simplification: Other context words (e.g., on when processing cat) work as negatives, since we will see (sat, on) later again during training and update weights during optimization for that step
  4. Distance Metric: Use dot product for vector similarity
  5. Dual Embedding Layers:
    • Input matrix (center words) and output matrix (context/negative words), both
    • Post-training: Often use or average of and as final embeddings

Note

  1. Training Process:
    • Inputs: pairs as one-hot vectors
    • Center word embedding:
    • All vocab embeddings: Use
    • Embedding dot products:
    • Softmax:
    • Loss: Cross-entropy loss
  2. Goal: Maximize probability of true context words, effectively learning to assign higher probabilities to likely context words
  3. skip-gram with negative sampling (SGNS):
    • Choose negative samples instead of all words
    • Involves additional modifications to basic skip-gram model

The model is simple, composed of two embedding or linear matrices. Data points are shuffled, batched, and fed to the model for training, with each word getting a chance to be both center and context (except at text boundaries).