Topics
Byte-level BPE is GPT-2’s tokenizer variant. Unlike standard byte pair encoding, it operates on bytes (256 base tokens, similar to byte tokenization) rather than Unicode characters. This approach handles all text without needing an [UNK]
token since any text can be written as a valid byte sequence.
Few kyey differences from standard BPE:
- Base vocab is bytes (256 tokens), not Unicode chars
- Handles non-ASCII chars via byte merging (eg “é” becomes two bytes)
- Treats whitespace as explicit character (eg
"Ġhello"
represents" hello"
)
Advantages:
- No out-of-vocab issues
- Preserves exact whitespace/punctuation
- Handles arbitrary text including rare Unicode
Warning
Byte-level BPE results in a slightly larger base vocab (256 instead of ~100 chars), but still manageable. GPT-2’s final vocab was ~50,000 subword tokens.
Usage: GPT-2/GPT-3, RoBERTa, GPT-Neo/GPT-J, CLIP’s text encoder etc