Topics

Represents text as sequence of raw bytes from [0, 255]. Each character encoded as 1-4 bytes in UTF-8.

Advantages:

  • Handles all Unicode text without special rules
  • No out-of-vocabulary tokens (any text is a byte sequence)
  • Simple implementation

Disadvantages:

  • Long sequences (3-4x character count) (same problem as character tokenization)
  • Harder for model to learn meaningful patterns (like chars, bytes on their own have no semantic context)
  • Compression rate = 1
text = "こんにちは" # japanese hello tokens = list(bytearray(text, "utf-8")) # [227, 129, ..., 175] print(len(tokens)) # 15

Used in subword tokenization as base for more efficient schemes, such as BPE.