BPE as a compression algorithm

Topics

byte pair encoding

data compression

BPE can be viewed as a compression algorithm that minimizes the number of tokens needed to represent a corpus while preserving information. Let’s formalize the process:

Corpus Representation:
- Corpus $C = {w_{1}, w_{2}, \dots, w_{n}}$ , where each word $w_{i}$ is a sequence of tokens (initially characters)
- Word frequency: $f (w_{i})$ is the count of word $w_{i}$
Pair Frequency:
- For each word $w_{i} = t_{1} t_{2} \dots t_{k}$ , count pairs $(t_{j}, t_{j + 1})$
- Pair frequency: $f (p) = \sum_{w_{i} containing p} f (w_{i})$
Merge Operation:
- Select pair $p = (t_{a}, t_{b})$ with maximum $f (p)$
- Replace $t_{a} t_{b}$ with new token $t_{ab}$ in all words
- Update vocabulary: $V \leftarrow V \cup {t_{ab}}$
Objective:
- Minimize sequence length: $\sum_{w_{i}} f (w_{i}) \cdot len (w_{i})$ , where $len (w_{i})$ is the number of tokens after merges
- Constraint: Vocabulary size $∣ V ∣ \leq V_{max}$
Encoding:
- For input word $w$ , apply merge rules in order: $w \to t_{1} t_{2} \dots t_{m}$
- Map tokens to IDs: $t_{i} \to vocab [t_{i}]$
Decoding:
- Reverse mapping: $ID_{i} \to t_{i}$
- Join tokens, removing </w>: $t_{1} t_{2} \dots t_{m} \to string$

This is a greedy algorithm, as it always merges the most frequent pair, which is not guaranteed to be globally optimal but is effective in practice.

Altamash Khan

Altamash Khan

BPE as a compression algorithm

Backlinks

Altamash Khan

BPE as a compression algorithm

Related

Backlinks