Topics
Character-based tokenization converts text to tokens at character level. Each Unicode character becomes separate token. Simple but has tradeoffs.
Note
The unit of compression rate is “bytes/token”, i.e., the average number of bytes per token. In this tokenization, each unicode char can use multiple bytes, so compression rate >= 1. Example: char
–
takes up 3 bytes, so compression rate is 3/1 = 3
Core properties
- Treats text as sequence of Unicode code points
- Vocabulary size ~150k (all Unicode chars)
- No OOV tokens possible
- Sequence length equals text length in chars
While simple and universal (handling all languages/scripts with no OOV tokens), it creates long sequences and large vocabularies. Many unicode chars are quite rare, which is inefficient use of the vocabulary. This leads to computational inefficiency in transformer models (attention mechanism quadratic complexity) and forces it to learn word composition from scratch.
class CharacterTokenizer:
def encode(text):
return [ord(c) for c in text]
def decode(tokens):
return ''.join(chr(t) for t in tokens)
Compared to other methods
- byte tokenization: similar but uses UTF-8 bytes instead
- word tokenization: shorter sequences but large vocab
- subword tokenization (BPE or wordpiece): balance between vocab size and sequence length
Tip
Modern systems like GPT use byte-level BPE instead, combining benefits of character and subword approaches