Topics
Unigram tokenization is a subword tokenization algorithm used in models like ALBERTA, T5, mBART, and XLNet, often implemented via sentencepiece. Unlike additive methods like BPE or wordpiece that build tokens, Unigram is subtractive. It starts with a large potential vocabulary and iteratively removes tokens to reach a target size, aiming to keep tokens minimizing corpus representation cost.
Training Algorithm
The training is iterative, reducing vocabulary size over steps.
- Initialization: Start with a large vocabulary. This typically includes all base characters (ensuring full coverage), common substrings, or even the result of a BPE run with a high vocabulary size. Frequencies of these tokens () are counted across the training corpus
- Iteration: Repeat until target vocabulary size is met:
- Calculate Loss: Using the current vocabulary , estimate token probabilities based on frequencies. . Tokenize the training corpus optimally using the current vocabulary and probabilities (explained below). Calculate the total negative log likelihood of the corpus under this unigram language model assumption. This is our current “loss”
- Calculate Removal Scores: For each token (excluding base characters), calculate how much the total loss would increase if were removed
- Pruning: Remove a small percentage (e.g., 10%) of tokens whose removal causes the smallest loss increase. These are the least valuable tokens for minimizing the loss
After pruning, probabilities and scores () for remaining tokens are recalculated using the new, smaller vocabulary. The corpus is re-tokenized, total loss re-evaluated, and the iteration repeats.
Tokenization (Inference)
Once training is complete and the final vocabulary obtained, tokenization for new text works as follows:
Unigram assumes tokens are independent within a sequence. The probability of a token sequence is . Token probability comes from the final trained model frequencies. To find the best segmentation for a word or text span, Unigram seeks the sequence of tokens from that maximizes this product probability. This is equivalent to minimizing the sum of negative log probabilities (scores): .
The viterbi algorithm, a dynamic programming technique, efficiently finds the single optimal segmentation with the lowest total score. It essentially builds a graph of possible token splits and finds the lowest-cost path through the text.
Example Calculation
This is taken from HuggingFace example. Consider a small corpus:
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
and an initial vocabulary:
h: 15, u: 36, g: 20, hu: 15, ug: 20, p: 17, pu: 17, n: 16, un: 16, b: 4, bu: 4, s: 5, hug: 15, gs: 5, ugs: 5
Total Frequency Sum = . Token probabilities are derived from corpus frequencies, and scores are .
E.g., for . Score = . Similarly:
- Score(“hug”)
- Score(“hu”)
- Score(“g”)
- Score(“pu”)
- Score(“ug”)
Tokenizing “hug” (from corpus) with initial vocabulary (this is the part that’s compute intensive since iterating through all possible splits is mad, hence we use viterbi algorithm):
- Option 1:
["hug"]
. Score: Score(“hug”) - Option 2:
["hu", "g"]
. Score: Score(“hu”) + Score(“g”) - Best (lowest score):
["hug"]
Tokenizing “pug” with initial vocabulary:
- Option 1:
["p", "u", "g"]
. Score: Score(“p”) + Score(“u”) + Score(“g”) . (Need scores for “p”, “u”) - Option 2:
["p", "ug"]
. Score: Score(“p”) + Score(“ug”) - Option 3:
["pu", "g"]
. Score: Score(“pu”) + Score(“g”) - Best (lowest score):
["p", "ug"]
or["pu", "g"]
(tie, pick first one)
Corpus Loss Calculation: Calculate optimal segmentation score for each word/text in the corpus based on current vocabulary. Total corpus loss is the sum of best scores for all words in the corpus (weigh by frequency of that word)
Loss Increase Calculation for removing a token, say “hug”:
- Assume “hug” is removed from vocabulary
- Re-tokenize only the parts of the corpus affected (those using “hug”: “hug” and “hugs”). E.g., “hug” must now be
["hu", "g"]
with score - Recalculate total corpus loss using the new tokenizations and recalculated scores for the smaller vocabulary
- Loss Increase = New Total Loss - Old Total Loss
Comparing loss increases helps identify tokens whose removal hurts the corpus representation least. These are the tokens pruned in that iteration
Post-Pruning Steps
After removing the lowest-scoring tokens (10% say) in an iteration:
- The vocabulary shrinks to
- Probabilities and scores for all remaining tokens are recalculated. The denominator (total frequency sum) decreases, so probabilities increase slightly, and scores decrease slightly
- Words in the corpus are re-evaluated using viterbi algorithm with the updated scores and vocabulary . Optimal segmentations might change even for words that didn’t use the removed tokens
- The total corpus loss is recalculated based on the new segmentations and scores. This becomes the baseline for the next pruning iteration
- The process repeats: calculate removal scores for tokens in , prune the lowest-scoring ones, update vocabulary, scores, segmentations, and loss
This cycle continues until the target vocabulary size is achieved.