byte-level BPE

Topics

byte pair encoding

Byte-level BPE is GPT-2’s tokenizer variant. Unlike standard byte pair encoding, it operates on bytes (256 base tokens, similar to byte tokenization) rather than Unicode characters. This approach handles all text without needing an [UNK] token since any text can be written as a valid byte sequence.

Few kyey differences from standard BPE:

Base vocab is bytes (256 tokens), not Unicode chars
Handles non-ASCII chars via byte merging (eg “é” becomes two bytes)
Treats whitespace as explicit character (eg "Ġhello" represents " hello")

Advantages:

No out-of-vocab issues
Preserves exact whitespace/punctuation
Handles arbitrary text including rare Unicode

Warning

Byte-level BPE results in a slightly larger base vocab (256 instead of ~100 chars), but still manageable. GPT-2’s final vocab was ~50,000 subword tokens.

Usage: GPT-2/GPT-3, RoBERTa, GPT-Neo/GPT-J, CLIP’s text encoder etc

Altamash Khan

Altamash Khan

byte-level BPE

Backlinks

Altamash Khan

byte-level BPE

Related

Backlinks