Topics
A really simplified version of BPE (for understanding purposes only). Here we use a pre-tokenized corpus (words are split already) and omit encoding/decoding steps.
# Example: Learn BPE merges on a tiny corpus
corpus = ["low", "lower", "newest", "widest"]
# Step 1: Start with character tokens for each word
words = [list(w) + ["</w>"] for w in corpus] # adding </w> end-of-word marker for clarity
# Function to get pair frequencies
from collections import Counter
def get_pair_freq(words):
pairs = Counter()
for tokens in words:
for i in range(len(tokens)-1):
pair = (tokens[i], tokens[i+1])
pairs[pair] += 1
return pairs
# Learn BPE merges
vocab = set(ch for tokens in words for ch in tokens)
print("Initial vocab:", vocab)
for merge_round in range(5):
pairs = get_pair_freq(words)
if not pairs:
break
most_freq_pair, freq = pairs.most_common(1)[0]
if freq < 2:
break
new_token = most_freq_pair[0] + most_freq_pair[1]
print(f"Merge {merge_round+1}: merge {most_freq_pair} -> '{new_token}'")
# Merge the most frequent pair in all words
new_words = []
for tokens in words:
merged = []
i = 0
while i < len(tokens)-1:
if (tokens[i], tokens[i+1]) == most_freq_pair:
merged.append(new_token)
i += 2
else:
merged.append(tokens[i])
i += 1
# append last token if not merged
if i < len(tokens):
merged.append(tokens[i])
new_words.append(merged)
words = new_words
vocab.add(new_token)
print("Final vocab:", vocab)
print("Transformed words:", words)
Running this code would show the iterative merges (for instance, merging ('l','o') -> 'lo'
, ('est','w</w>') -> 'est</w>'
, etc.):
Initial vocab: {'w', 'r', 'n', 's', 't', '</w>', 'o', 'i', 'e', 'd', 'l'}
Merge 1: merge ('l', 'o') -> 'lo'
Merge 2: merge ('lo', 'w') -> 'low'
Merge 3: merge ('e', 's') -> 'es'
Merge 4: merge ('es', 't') -> 'est'
Merge 5: merge ('est', '</w>') -> 'est</w>'
Final vocab: {'w', 'r', 'n', 's', 't', 'est</w>', '</w>', 'est', 'o', 'i', 'e', 'lo', 'low', 'es', 'd', 'l'}
Transformed words: [['low', '</w>'], ['low', 'e', 'r', '</w>'], ['n', 'e', 'w', 'est</w>'], ['w', 'i', 'd', 'est</w>']]
After training, tokenizing a new word means applying these merges. In practice, one would store the merge table and then use a deterministic greedy longest-match algorithm to split input text into the longest possible tokens from the vocab (which is equivalent to applying merges). Most BPE implementations use this greedy matching for encoding.