Topics

BPE can be viewed as a compression algorithm that minimizes the number of tokens needed to represent a corpus while preserving information. Let’s formalize the process:

  1. Corpus Representation:

    • Corpus , where each word is a sequence of tokens (initially characters)
    • Word frequency: is the count of word
  2. Pair Frequency:

    • For each word , count pairs
    • Pair frequency:
  3. Merge Operation:

    • Select pair with maximum
    • Replace with new token in all words
    • Update vocabulary:
  4. Objective:

    • Minimize sequence length: , where is the number of tokens after merges
    • Constraint: Vocabulary size
  5. Encoding:

    • For input word , apply merge rules in order:
    • Map tokens to IDs:
  6. Decoding:

    • Reverse mapping:
    • Join tokens, removing </w>:

This is a greedy algorithm, as it always merges the most frequent pair, which is not guaranteed to be globally optimal but is effective in practice.