byte tokenization

Topics

tokenization

Represents text as sequence of raw bytes from [0, 255]. Each character encoded as 1-4 bytes in UTF-8.

Advantages:

Handles all Unicode text without special rules
No out-of-vocabulary tokens (any text is a byte sequence)
Simple implementation

Disadvantages:

Long sequences (3-4x character count) (same problem as character tokenization)
Harder for model to learn meaningful patterns (like chars, bytes on their own have no semantic context)
Compression rate = 1

text = "こんにちは"  # japanese hello
tokens = list(bytearray(text, "utf-8")) # [227, 129, ..., 175]
print(len(tokens)) # 15

Used in subword tokenization as base for more efficient schemes, such as BPE.