Topics
BPE can be viewed as a compression algorithm that minimizes the number of tokens needed to represent a corpus while preserving information. Let’s formalize the process:
-
Corpus Representation:
- Corpus , where each word is a sequence of tokens (initially characters)
- Word frequency: is the count of word
-
Pair Frequency:
- For each word , count pairs
- Pair frequency:
-
Merge Operation:
- Select pair with maximum
- Replace with new token in all words
- Update vocabulary:
-
Objective:
- Minimize sequence length: , where is the number of tokens after merges
- Constraint: Vocabulary size
-
Encoding:
- For input word , apply merge rules in order:
- Map tokens to IDs:
-
Decoding:
- Reverse mapping:
- Join tokens, removing
</w>
:
This is a greedy algorithm, as it always merges the most frequent pair, which is not guaranteed to be globally optimal but is effective in practice.