Topics
Represents text as sequence of raw bytes from [0, 255]
. Each character encoded as 1-4 bytes in UTF-8.
Advantages:
- Handles all Unicode text without special rules
- No out-of-vocabulary tokens (any text is a byte sequence)
- Simple implementation
Disadvantages:
- Long sequences (3-4x character count) (same problem as character tokenization)
- Harder for model to learn meaningful patterns (like chars, bytes on their own have no semantic context)
- Compression rate = 1
text = "こんにちは" # japanese hello
tokens = list(bytearray(text, "utf-8")) # [227, 129, ..., 175]
print(len(tokens)) # 15
Used in subword tokenization as base for more efficient schemes, such as BPE.