Topics

WordPiece was originally used by Google for machine translation, popularized by BERT. It’s very similar to BPE in that it builds a vocab of subwords by merging characters into bigger units.

The core idea is that instead of always merging the most frequent pair, WordPiece chooses the pair that gives the best improvement in a language model likelihood of the corpus. In practice this is often approximated by a scoring function:

where subword units and are merge candidates. is the merged unit’s frequency, while and are the individual frequencies.

Formula is related to pointwise mutual information (PMI). This tends to merge pairs that co-occur more than you’d expect by chance.

Algorithm (Training)

  1. Initialize Vocabulary: Vocab consists of unique characters in training data. Each character initially its own token (standard WordPiece is not like byte-level BPE)
  2. Count Pairs: Count occurrences of all adjacent pairs of tokens
  3. Apply scoring func on pairs: Find the pair of tokens that have highest score
  4. Merge Pair: Create new token by merging the most frequent pair, add to vocabulary
  5. Replace Pairs: Replace all occurrences of the most frequent pair with the new merged token
  6. Repeat: Repeat steps 2-5 for some steps, or until desired vocabulary size is reached

Most of the steps are same as in BPE, only the scoring func is different here. Additionally, any subword that’s added to the vocab is prefixed with ## if it’s a “continuation”. Example: for “hug”, we initially have: ['h', '##u', '##g'] in the vocab, then say we end up merging ##u and ##g, our updated vocab will be: ['h', '##u', '##g', '##ug'] and so on. This is nice because it indicates if a subword is prefix or suffix of any word: predict (prefix) and ##predict (suffix) are different subwords.

Algorithm (Encoding)

During tokenization/encoding, it’s greedy longest-match: for a given word, find the longest prefix of characters that is in the vocabulary. Emit that token (with ## if it’s not the start), then repeat for the remainder. If a word cannot be fully matched, i.e. it’s pieces aren’t all in vocab, the whole word becomes [UNK]. This part can be explained with following example:

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.tokenize("word") # ['word']
tokenizer.tokenize("word∫") # ['[UNK]']

Observe above that when input is word∫, the first piece that matches is word. The leftover subword is which isn’t in the vocab:

print(tokenizer.vocab.get("∫")) # None

As a result of this, we do not get: ['word', '[UNK]']. Instead, we get ['[UNK]'].

Example:

Let’s say we want to tokenize: unpredictably, and we have the following vocab: ["un", "predict", "able", "##ably", "##pre", "##dict", "##ic", "##bly", "##un", "[UNK]"].

  • Our ans list: tokens = []
  • First we find longest prefix that’s in vocab: un matches; tokens=['un']
  • Next, we try finding longest prefix from ##predictably
  • Note that we have predict, but not ##predict in our vocab (they’re different)
  • ##pre is what we match; tokens=['un', '##pre']
  • Next, we match ##dict; tokens = ['un', '##pre', '##dict']
  • Finally, we match ##ably; tokens = ['un', '##pre', '##dict', '##ably']

We can verify this:

tokenizer.tokenize("unpredictably")
# ['un', '##pre', '##dict', '##ably']