Topics

Word-based tokenization is a simple form of tokenization where we break text into units (“tokens”) by splitting on whitespace and punctuation, so that each token corresponds roughly to a “word” or standalone symbol. Used in early NLP pipelines.

Pros:

  • Intuitive: Words are natural semantic units that humans understand
  • Preserves meaning: Each token represents a complete semantic unit
  • Simplicity: Implementation is straightforward (e.g., splitting on spaces and punctuation)
  • Interpretability: Tokens directly correspond to words in the vocabulary
  • Smaller sequence lengths: Compared to character tokenization

Cons:

  • Large vocabulary size: Requires storing many tokens, increasing model size
  • Out-of-vocabulary (OOV) issues: Cannot handle unseen words without special handling <unk> token
  • Inefficient for rare words: Dedicates vocabulary slots to infrequent terms
  • Language-specific: Different languages require different tokenization rules
  • Morphological variations: Different forms of the same word (run/running/runs) need separate tokens
  • Compound words: Struggles with languages that combine words (like German)
import re
 
class WordTokenizer:
    def __init__(self, texts, add_special_tokens=True):
        # Build vocabulary from a list of texts
        tokens = []
        for text in texts:
            tokens.extend(self._tokenize(text))
        vocab = sorted(set(tokens))
        if add_special_tokens:
            vocab.extend(["<|endoftext|>", "<|unk|>"])
        self.str_to_int = {token: idx for idx, token in enumerate(vocab)}
        self.int_to_str = {idx: token for token, idx in self.str_to_int.items()}
        self.unk_token = "<|unk|>"
 
    def _tokenize(self, text):
        # Split on whitespace and common punctuation
        tokens = re.split(r'([,.:;?_!\"()\'\-\-]|\s)', text)
        return [t.strip() for t in tokens if t.strip()]
 
    def encode(self, text):
        tokens = self._tokenize(text)
        return [self.str_to_int.get(token, self.str_to_int[self.unk_token]) for token in tokens]
 
    def decode(self, ids):
        tokens = [self.int_to_str.get(idx, self.unk_token) for idx in ids]
        text = " ".join(tokens)
        # Remove spaces before punctuation for readability
        text = re.sub(r'\s+([,.:;?_!\"()\'\-\-])', r'\1', text)
        return text
 
    def vocab_size(self):
        return len(self.str_to_int)