Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.
Subword Tokenization (WordPiece)¶
The WordPiece algorithm is a widely used method for subword tokenization, particularly in natural language processing (NLP) tasks. Originally introduced for speech recognition, WordPiece gained prominence with its use in pre-trained language models such as BERT. The algorithm addresses key challenges in text representation, such as handling out-of-vocabulary (OOV) words and efficiently representing rare words, by breaking text into subword units instead of relying solely on full words or individual characters.
At its core, WordPiece seeks to build a vocabulary of subword units by balancing frequency and efficiency. The algorithm begins with an initial vocabulary consisting of individual characters and iteratively merges pairs of tokens that maximize the likelihood of the training data. This likelihood is computed based on the frequency of token pairs and their impact on the overall representation. The result is a compact vocabulary that includes common words as single units and decomposes rare or complex words into smaller, meaningful subword components. For example, a word like "unbelievable" might be tokenized into ["un", "##believable"], where the "##" prefix indicates that the subword is part of a larger word.
One of the key strengths of WordPiece is its ability to balance generalization and specificity. By using subwords, the algorithm can effectively handle new words that were not seen during training by combining known subword units. This property reduces the OOV problem and enables models to better understand rare words or morphologically rich languages. Additionally, subword tokenization allows for smaller vocabulary sizes compared to full-word tokenization, which reduces memory requirements and computational costs.
WordPiece has become a cornerstone in modern NLP pipelines due to its efficiency and adaptability. It enables pre-trained language models to achieve state-of-the-art performance across various tasks, including text classification, machine translation, and question answering. Its design, which merges data-driven insights with linguistic intuition, has inspired other tokenization algorithms like Byte Pair Encoding (BPE) and SentencePiece. As a result, WordPiece continues to be an essential component in advancing NLP technologies.
In this notebook, we will take a closer look at the WordPiece algorithm and implement a basic WordPiece tokenizer from scratch in a step-by-step and illustrative manner.
Note: The two subword tokenization algorithms WordPiece and Byte-Pair Encoding are very similar. Since BPE is arguably slightly simpler, and we occasionally refer to BPE in this notebook — to highlight the fundamental difference between WordPiece and BPE — we recommend first reading up on BPE. However, this is not required to understand how WordPiece works, only to appreciate how it compares to BPE.
Setting up the Notebook¶
Make Required Imports¶
This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.
import re, regex, collections, json
from src.text.preprocessing.tokenizing import MyWordPieceTokenizer
from src.utils.data.files import *
Download Required Data¶
Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.
treasure_island_book, _ = download_dataset("text/corpora/books/treasure-island.txt")
File 'data/datasets/text/corpora/books/treasure-island.txt' already exists (use 'overwrite=True' to overwrite it).
WordPiece from Scratch¶
The fundamental idea behind WordPiece is quite straightforward to understand and easy to implement, and will go through the algorithm step by step in the following. Practical implementations of WordPiece will be more sophisticated as they consider additional refinements or aim for more efficient implementations. Here, the focus is on the understanding of WordPiece for tokenization. This includes that we use a very artificial example document to better illustrate the inner workings of the algorithm. This example document is defined in the code cell below.
doc = 'low low low low low lower lower newest newest newest newest newest newest widest widest widest longer'
The algorithm also needs some special character to mark continuation of a word; its purpose will be clear once we go through the algorithm. Again, to keep it simple, we simply use the underscore character _ for this. Note that this means that our input document is not allowed to contain underscore characters. This works perfectly fine for our example document here.
CTOKEN = '_'
Core Steps¶
We first go through the core steps of WordPiece, before combining them to the final learning algorithm.
Pretokenize Text¶
WordPiece assumes that the corpus has been pretokenized into a initial list of tokens. The most basic approach is to pretokenize a text based on whitespace characters — in practice, slightly more sophisticated methods are used (discussed later). The code cell below contains the method pretokenize() the uses the built-in Python method split() to convert a text string into a list of tokens by splitting the string with respect to whitespace characters.
def pretokenize(doc):
return doc.split()
To show an example, we can apply this method to our example document.
print(pretokenize(doc))
['low', 'low', 'low', 'low', 'low', 'lower', 'lower', 'newest', 'newest', 'newest', 'newest', 'newest', 'newest', 'widest', 'widest', 'widest', 'longer']
Initialize Corpus State and Vocabulary¶
The corpus state represents a data structure, here a dictionary, that maps each unique token after pretokenization to its number of occurrences on the training corpus. Since WordPiece is a bottom-up approach, we also need to split each token (typically a word) into its characters, and the CTOKEN is added to each character except the first one. Let's first create a utility method that performs this step for an initial word/token.
def generate_sequence(word, ctoken=CTOKEN):
return ' '.join([c if i == 0 else f"{ctoken}{c}" for i, c in enumerate(word)])
generate_sequence('fastest')
'f _a _s _t _e _s _t'
Appreciate the use of the special token to mark the continuation of a word (here the underscore character). Any token starting with that special character represents not the beginning of the word.
Now we can go through our all documents to initialize the corpus state, i.e., the set of unique words/tokens across the whole corpus converted to their sequences. This for each document we perform the following steps:
- Perform simple pretokenization using the
pretokenize()method. - Split each token into its characters using the utility method
generate_sequence(). - Add the generate sequence to the corpus state; to avoid repeated entries of the same sequence, we also keep track of the number of times a token appears in the corpus — we therefore implement the corpus state as dictionary with the sequences as the keys and the number of occurrences as the values.
During this process, we also initialize the vocabulary as the set of all unique characters. Later, the learning algorithm will add strings longer than a single character to the vocabulary. Note, however, that the vocabulary itself is not crucial for the WordPiece algorithm itself.
The method initialize_corpus_state() in the code cell below implements both steps and returns the initial corpus state and vocabulary.
def initialize_corpus_state(docs: list):
# Initial vocabulary as an empty set
vocabulary = set()
# Initialize dictionary representing the corpus state
corpus_state = collections.defaultdict(int)
# Loop over all documents
for doc in docs:
# For each word in the document, generate the sequence and update add it to the corpus state
for word in pretokenize(doc):
# Update vocabulary
for idx, char in enumerate(word):
if idx == 0:
vocabulary.add(char)
else:
vocabulary.add(f"{CTOKEN}{char}")
# Update sequence count
corpus_state[generate_sequence(word)] += 1
return dict(corpus_state), vocabulary
We can now run the method over our toy corpus which is just a single document.
corpus_state, vocabulary = initialize_corpus_state([doc])
Let's first have a look at the corpus state; we use the dumps() method of the json library for a more user-friendly output.
print(json.dumps(corpus_state, indent=2))
{
"l _o _w": 5,
"l _o _w _e _r": 2,
"n _e _w _e _s _t": 6,
"w _i _d _e _s _t": 3,
"l _o _n _g _e _r": 1
}
When looking at the result, you should recognize all five unique words/tokens that appear in our toy document, only converted to their initial sequences. The number reflects how often that word/token appeared in the document. For example, the key-value pair "w _i _d _e _s _t": 3 indicates that the word "widest" appeared three times in the document.
We can also look at the initial vocabulary which is simply the set of all unique characters in the document, together with the special CTOKEN.
print(vocabulary)
{'_o', '_e', '_r', '_w', '_i', '_g', 'l', 'w', '_t', '_s', '_d', 'n', '_n'}
If we would do nothing else — that is, not actually perform any training — the WordPiece tokenizer would behave like a character tokenizer. We will actually try this later.
Token Merging Step¶
The key idea of WordPiece is to iteratively merge the most frequent token pairs in the corpus state to build a vocabulary that efficiently represents the structure of the input text. The goal is to identify and represent commonly occurring patterns — such as subwords, roots, prefixes, or suffixes — with single tokens. This reduces the number of tokens needed to encode the text while maintaining flexibility for representing rare or unseen words through combinations of smaller subword units.
By iteratively merging the most frequent adjacent token pairs, WordPiece strikes a balance between character-level and word-level tokenization. Character-level tokenization ensures coverage for any text but can result in extremely long sequences that are computationally expensive to process. On the other hand, word-level tokenization creates inefficiencies when encountering rare words, as each unique word would need a separate token. WordPiece bridges this gap by creating a subword-level tokenization scheme: frequent word components are merged into single tokens, while rare or out-of-vocabulary words can still be represented as sequences of smaller subword units.
Calculate Number of Token Pairs¶
WordPiece and BPE follow a very similar bottom-up approach by first splitting words into character tokens and then iteratively finding the next best pair of tokens to be merged tokens into a new larger token. While there are minor differences in the general setup, both algorithms fundamentally only differ in the criterion they use to find the next token pair to merge. BPE simply picks the token pair with the most occurrences in the corpus state. For example, assuming the initial corpus state from above, BPE calculates the number of occurrences of "l _o", "_o _w", "_w _e", "_e _r", "n _e", and so on, and then merges the token pair with the highest number of occurrences (randomly breaking ties).
In contrast to BPE, WordPiece does not choose the most frequent token pair, but the one that maximizes the likelihood of the training data once added to the vocabulary. The likelihood of a token pair $(t_1, t_2)$ is calculated as follows:
$$\large \frac{P(t_1, t_2)}{P(t_1) P(t_2)} $$where $P(t_1, t_2)$ is the probability of seeing the token pair $(t_1, t_2)$ given the corpus, $P(t_1)$ is the probability of seeing token $t_1$ in the corpus, $P(t_2)$ is the probability of seeing token $t_2$ in the corpus. In other words, we need to calculate the likelihoods for all token pairs in our current token pairs and pick the one with the highest likelihood. While we could calculate all three probabilities to get the likelihood, let's see how we can simplify this formula. First, we can apply the formula for conditional probabilities $P(B|A) = P(B,A)P(A)$ for two random events $A$ and $B$. With $A$ and $B$ being our two tokens $t_1$ and $t_2$, we can write:
$$\large \frac{P(t_1, t_2)}{P(t_1) P(t_2)} = \frac{P(t_2|t_1)P(t_1)}{P(t_1) P(t_2)} = \frac{P(t_2|t_1)}{P(t_2)} $$In terms of n-gram language models, $P(t_2)$ is the unigram probability for token $t_2$, and $P(t_2|t_1)$ is the bigram probability for the token sequence/pair $(t_2,t_1)$. We can therefore calculate both probabilities as follows:
$$\large P(t_2) = \frac{count(t_2)}{N}\ , \quad P(t_2|t_1) = \frac{count(t_1,t_2)}{count(t_1)} $$where $N$ is the total number of tokens in the current corpus state, $count(t_1,t_2$ is the total number of occurrences of token pair $(t_1,t_2)$, $count(t_1)$ is the total number of occurrences of token $t_1$, and $count(t_1)$ is the total number of occurrences of token $t_2$. Plugging these term into our initial formula, we get:
$$\large \frac{P(t_1, t_2)}{P(t_1) P(t_2)} = \frac{P(t_2|t_1)}{P(t_2)} = \frac{\frac{count(t_1,t_2)}{count(t_1)}}{\frac{count(t_2)}{N}} = \frac{N\cdot count(t_1,t_2)}{count(t_1)\cdot count(t_2)} $$Thus, compared to BPE, we not only need to calculate the number of occurrences of each token pair $(t_1, t_2)$, but also the number of occurrences of the two component tokens $t_1$ and $t_2$. As a last little improvement, consider that we are not really interested in the exact likelihood values, but only want to find the token pair with the largest likelihood. Notice that the number of tokens $N$ does not depend on a current token pair, and therefore does not affect the final ranking of token pairs. This allows us to simply ignore $N$ in the calculation. Of course this means that the likelihood is not only proportional to the final formula
$$\large \frac{P(t_1, t_2)}{P(t_1) P(t_2)} \propto \frac{count(t_1,t_2)}{count(t_1)\cdot count(t_2)} $$The method find_best_token_pair() in the code cell below implements this calculation by iterating over all token sequences in the corpus state, and for each sequence, iterating over all to sum up the total number of occurrences of each token pair and each individual token. The method then calculates the score for each token pair, where the score is proportional to the pair's likelihood and calculated by the formula given above. The method finally returns the token pair with the highest (any ties are broken randomly). The method also returns a dictionary with all token pairs and their respective scores; this is only to have a look at the scores and not need for the algorithm itself.
def find_best_token_pair(corpus_state):
# Initialize dictionaries to keep track of all counts
token_counts = collections.defaultdict(int)
token_pair_counts = collections.defaultdict(int)
# Iterate over all token sequences in the current corpus state
for word, freq in corpus_state.items():
sequence = word.split()
# Special case: the sequence already a single token
if len(sequence) == 1:
token_counts[sequence[0]] += freq
continue
# Iterate over all token pair and update the count
for i in range(len(sequence)-1):
pair = (sequence[i], sequence[i+1])
token_counts[f"{sequence[i]}"] += freq
token_pair_counts[pair] += freq
# Don't forget the last token that was not captured by the previous loop
token_counts[sequence[-1]] += freq
# Calculate the score of all token pairs using the formula approximating the likelihood
token_pair_scores = {
' '.join(pair): count / (token_counts[pair[0]] * token_counts[pair[1]])
for pair, count in token_pair_counts.items()
}
# Return the most frequent pair (if their are ties, we just randomly break them) + all token pair counts
return max(token_pair_scores.keys(), key=(lambda key: token_pair_scores[key])), token_pair_scores
Using the method find_best_token_pair(), we can identify the first token pair to merge for our initial corpus state. In general, there might be multiple token pairs that have the same highest number of occurrences. In this situation, we pick any token pair from this subset randomly.
# We re-initialize the corpus state and the vocabulary to ensure a consistent output of this code cell
corpus_state, vocabulary = initialize_corpus_state([doc])
#top_token_pair, token_pair_counts = find_most_frequent_token_pair(corpus_state)
top_token_pair, token_pair_scores = find_best_token_pair(corpus_state)
Let's first look at the counts for all token pairs. Again, this is just for illustrative purposes and not required for the WordPiece algorithm itself.
print(json.dumps(token_pair_scores, indent=2))
{
"l _o": 0.125,
"_o _w": 0.0673076923076923,
"_w _e": 0.03418803418803419,
"_e _r": 0.05555555555555555,
"n _e": 0.05555555555555555,
"_e _w": 0.02564102564102564,
"_e _s": 0.05555555555555555,
"_s _t": 0.1111111111111111,
"w _i": 0.3333333333333333,
"_i _d": 0.3333333333333333,
"_d _e": 0.05555555555555555,
"_o _n": 0.125,
"_n _g": 1.0,
"_g _e": 0.05555555555555555
}
We can see that the token pair that has the highest score is : "_n _g", so this is the pair that will be merged next. Let's see `find_best_token_pair() has indeed returned this token pair.
print(top_token_pair)
_n _g
In the following, let's assume the most frequent token pair that was returned is "_n _g".
Perform Merge¶
With the next best token pair found, we can actually perform the merging step. This step includes three core substeps:
- Merge the token pair into a new token (e.g., "_n _g" becomes "_ng")
- Add this new token (here, "_ng") to the vocabulary
- Update the corpus state by replacing all occurrences of the token pair with the new token — for our example, replace all occurrences of "_n _g" in the corpus state with "_ng".
The first step is very straightforward. To merge a token pair, we remove the CTOKEN character from the second token — and the second token will always have this special token — and concatenate it with the first token. The first token may or may not have the CTOKEN at the beginning. For convenience the method create_new_token() wraps up this first step.
def create_new_token(token_pair):
t1, t2 = token_pair.split()
return ''.join([t1, re.sub(CTOKEN, "", t2)])
print(create_new_token("n _g"))
print(create_new_token("_n _g"))
ng _ng
The method perform_merge() implements these three substeps; using the method create_new_token() for Step 1. Updating the corpus state is only a bit more complex. Since the corpus state is implemented as a dictionary and replacing all occurrences of the token pair with the merge token means changing the keys of this dictionary, we cannot simply iterate over this dictionary and directly change the keys. We therefore break this update step up into two loops
In the first loop, we find all keys (i.e., the sequences in the corpus state) that contain the token pair (at least once) and therefore need to be updated. We identify these keys using a Regular Expression that matches the token pair in a key. For example, given the key "l _o _n _g _e _r", the Regular Expression looking for "_n _g" would match; but would not match for "n _e _w _e _s _t". We keep track of all matching keys using the dictionary
matcheswhere its keys are the old keys in corpus state and the values are the new keys for the corpus state. We use the same Regular Expression to generate the new keys; for example, "l _o _n _g _e _r" becomes "l _o _ng _e _r".In the second loop, we iterate over all matches to add the new keys to the corpus state and give them the values of their respective old keys (those values representing the initial count do not change). We do this using the built-in
pop()method which automatically removes the entry for the old key from the corpus state.
Finally we return the merge (i.e., the tuple of the token pair and the new token — we need this later), the updated corpus state, and the updated vocabulary.
def perform_merge(token_pair, corpus_state, vocabulary):
# Create new token by merging token pair
new_token = create_new_token(token_pair)
# Create merge as tuple of token pair and new token
merge = (token_pair, new_token)
# Add new token to vocabulary
vocabulary.add(new_token)
# Define search pattern
pattern = re.compile(r"(?<!\S)" + re.escape(token_pair) + r"(?!\S)")
# Loop through corpus state and record which keys/sequences need to be updated
matches = {}
for sequence, count in corpus_state.items():
for match in pattern.finditer(sequence):
matches[sequence] = pattern.sub(new_token, sequence)
# Perform the update of keys/sequences
for old, new in matches.items():
corpus_state[new] = corpus_state.pop(old)
# Return the updated corpus state and vocabulary
return merge, corpus_state, vocabulary
For testing, we can run the method perform_merge() over our initial corpus state and vocabulary. We only re-initialize the both corpus state and vocabulary to ensure the output of the code cell is always the same.
# We re-initialize the corpus state and the vocabulary to ensure a consistent output of this code cell
corpus_state, vocabulary = initialize_corpus_state([doc])
merge, corpus_state, vocabulary = perform_merge('_n _g', corpus_state, vocabulary)
Let's have a look at all three return values of this method.
print(f"Merge: {merge}")
print()
print("Updated corpus state:")
print(json.dumps(corpus_state, indent=2))
print()
print("Updated vocabulary")
print(vocabulary)
Merge: ('_n _g', '_ng')
Updated corpus state:
{
"l _o _w": 5,
"l _o _w _e _r": 2,
"n _e _w _e _s _t": 6,
"w _i _d _e _s _t": 3,
"l _o _ng _e _r": 1
}
Updated vocabulary
{'_o', '_e', '_r', '_w', '_i', '_g', 'l', 'w', '_t', '_ng', '_s', '_d', 'n', '_n'}
Particularly, notice the changes in the corpus state with "l _o _n _g _e _r" becoming "l _o _ng _e _r". Also, "_ng" has been added to the vocabulary.
WordPiece vs. BPE — Intuition¶
Since WordPiece and BPE fundamentally only differ with respect to how the next token pair for merging is determined, it is a good time to try to get some intuition behind that difference. Recall the the score for a token pair $(t_1, t_2)$ in WordPiece is computed as:
$$\large \frac{count(t_1,t_2)}{count(t_1)\cdot count(t_2)} $$In contrast, BPE only uses $count(t_1,t_2)$! This means that WordPiece will favor a token pair even if it is "not" the most frequent. More specifically, WordPiece favors token pairs where the two tokens, if they appear in a word, very often appear together. For example, given our toy dataset, the token pair ("_n", "_g") appears only once in the corpus state, while, e.g., ("_e", "_s") nine times (and would therefore be chosen by BPE). However, both tokens "_n" and "_g" happen to appear only once. In simple terms, this means that when we see the token "_n" we are very likely see it together with token "_g", and vice versa. In other words, the higher the score of a token pair, the more frequently both tokens co-occur than would be expected under an assumption of both tokens being independent.
Or just by dissecting the formula above, the score is high if the numerator (the count of the token pair) is large and the denominator (the product of the two token counts) is small. Of course, the following two inequalities $count(t_1,t_2) \leq count(t_1)$ and $count(t_1,t_2) \leq count(t_1)$ always hold. Thus, the maximum score for a token pair is $1$. Therefore, for the denominator to be as small as possible, the two tokens of a token pair may have as few other token pairs as possible (or as an individual). The more often one or both tokens appear "outside" that pair, the lower the score.
WordPiece Learning Algorithm¶
With these three core steps, that is:
- Initialization of the corpus state and vocabulary
- Finding the best token pair (for the current corpus state), and
- Performing the merging step and updating the corpus state
We now have everything in place to implement the WordPiece learning algorithm by plugging the methods implementing those three steps together; see the method wordpiece_learn() in the code cell below. Although the implementation of this method is rather straightforward, three small details are worth mentioning
One of the aforementioned advantages of WordPiece is that we can specify the maximum size of the resulting vocabulary. Since each merging step adds a new token to the vocabulary, we can restrict the size of the vocabulary by limiting the number of merging steps. However, since our initial vocabulary is not empty but the set of all unique characters, the number of merging steps — that is, the number of iterations
num_iterin the code — derives from the difference of the specified maximum vocabulary sizemax_vocab_sizeand the size of the initial vocabulary.In principle, particularly if the corpus is not very large, the algorithm might perform more merges than possible. This happens when all words in the corpus state have been merged to their original form. For example, after a certain amount of iterations, say, "n _e _w _e _s _t" will have been merged to "newest". If this is true for all words in the corpus state, the corpus state does no longer contain any pair of tokens. In this case, the method
find_best_token_pair()will throw an error and exit the loop since the learning algorithm has finished.In each iteration, we keep track of the recent merge by adding it to a list of all previous merges. This list of merges is in fact the most important return value of the learning algorithm as it is used for tokenizing text. This includes that the order matters as we want to merge tokens in a new text in the same order as we merged them during the learning phase.
def wordpiece_learn(corpus, max_vocab_size=10000):
# Initialize corpus state and vocabulary
corpus_state, vocabulary = initialize_corpus_state(corpus)
# Initialize the list of merges
merges = []
# Calculate the number of merging steps to ensure the maximum size of the vocabulary
num_iter = max_vocab_size - len(vocabulary)
for _ in range(num_iter):
# Find the most frequent pair; if this fails, no more merging was possible and we can stop
try:
top_token_pair, _ = find_best_token_pair(corpus_state)
except:
break
# Update corpus state and the vocabulary
merge, corpus_state, vocabulary = perform_merge(top_token_pair, corpus_state, vocabulary)
# Add newly merged token to vocabulary
merges.append(merge)
# Return list of merges, the corpus state, and the vocabulary
return merges, corpus_state, vocabulary
Let's train a WordPiece tokenizer over our toy document.
Your turn: Try different values for max_vocab_size and inspect the results. Of course, the larger this value, the larger the final vocabulary and the list of merges, but also the corpus state shows larger tokens. For example, a large value, say, max_vocab_size=1000, the learning algorithm tries to make more merging steps as actually possible. You can tell by looking at the corpus state where all words are merged into a single token.
merges, corpus_state, vocabulary = wordpiece_learn([doc], max_vocab_size=100)
print(f"Final corpus state:\n{json.dumps(corpus_state, indent=2)}\n")
print(f"Final vocabulary (size: {len(vocabulary)}):\n{vocabulary}\n ")
print(f"Final list of merges:\n{merges}")
Final corpus state:
{
"low": 5,
"longer": 1,
"lower": 2,
"widest": 3,
"newest": 6
}
Final vocabulary (size: 30):
{'_r', '_g', 'newest', 'ne', 'widest', '_s', 'wide', 'new', '_ng', '_i', '_w', 'wi', 'long', 'w', '_er', 'lo', 'longe', 'n', '_o', '_e', 'low', 'longer', '_t', 'newe', 'wid', '_n', 'l', '_d', 'lower', '_st'}
Final list of merges:
[('_n _g', '_ng'), ('w _i', 'wi'), ('wi _d', 'wid'), ('l _o', 'lo'), ('lo _ng', 'long'), ('_s _t', '_st'), ('lo _w', 'low'), ('long _e', 'longe'), ('longe _r', 'longer'), ('n _e', 'ne'), ('ne _w', 'new'), ('wid _e', 'wide'), ('_e _r', '_er'), ('new _e', 'newe'), ('low _er', 'lower'), ('wide _st', 'widest'), ('newe _st', 'newest')]
WordPiece Tokenization Algorithm¶
Once we applied the WordPiece learning algorithm over a corpus, using this learned model for actually tokenizing an arbitrary text is very straightforward. In fact, strictly speaking, we only need the list of merges that was returned from the learning algorithm (see above). In some sense, the tokenization algorithm mimics the learning algorithm. This includes that we first perform pretokenization and treat each initial token separately. The method tokenize_word() tokenizes a single (initial) token based on the learned list of merges as follows:
First, the method splits the word into character tokens and adds the
CTOKENtoken using thegenerate_sequence()method. For example, the words "newer" becomes "n _e _w _e _r"Then, the method iterates over the list of merges and checks if a merge can be applied, and does so if a match is found. For example, the merge
("_e _r", "_er")will find a match in "n _e _w _e _r" and convert it the "n _e _w _er". To find the matches and perform the merges, we use the same Regular Expression we have already seen in the learning algorithm.
The rest of the implementation is merely for printing the intermediate results for illustrative purposes.
def tokenize_word(word, merges, verbose=False):
sequence = generate_sequence(word)
if verbose == True:
print(sequence)
for p, m in merges:
if p not in sequence:
continue
p = re.compile(r'(?<!\S)' + re.escape(p) + r'(?!\S)')
sequence = p.sub(m, sequence)
if verbose == True:
print(sequence)
return sequence.split(' ')
We can now run the tokenize_word() over a word that was not in the training document. Of course, the exact output will depend on the value for max_vocab_size you chose to train the WordPiece tokenizer. For example, with max_vocab_size=0, the word will be split into its individual characters since a new merge is performed. In other words, with max_vocab_size=0, the WordPiece tokenizer becomes a character tokenizer. In contrast, if the value for max_vocab_size is very large, the WordPiece tokenizer is more likely to behave like a word tokenizer.
Your turn: Run the method wordpiece_learn() implementing the WordPiece learning algorithm with different values for max_vocab_size and see how the output of the code cell below changes.
tokens = tokenize_word('newer', merges, verbose=True)
print(tokens)
n _e _w _e _r ne _w _e _r new _e _r new _er new _er ['new', '_er']
To tokenize a complete document — again, mimicking the learning algorithm — we first need to pretokenize the document, and then run the method tokenize_word() over each initial token. The method tokenize implements these basic steps.
def tokenize(doc, merges, verbose=False):
pretokens = pretokenize(doc)
tokens = []
for pt in pretokens:
tokens.extend(tokenize_word(pt, merges, verbose=verbose))
return tokens
The code cell below defines another example document to test the behavior of method tokenize(). As before, the exact output will depend on the value of max_vocab_size when training the tokenizer and the document doc itself. Feel free to modify the document by adding new words or tweaking existing ones. You are also encouraged to run the code cell with different versions of the tokenizer (i.e., trained using different values for max_vocab_size).
doc2 = "newer longest knew ingest belong newest"
example_token_list = tokenize(doc2, merges, verbose=True)
n _e _w _e _r ne _w _e _r new _e _r new _er new _er l _o _n _g _e _s _t l _o _ng _e _s _t lo _ng _e _s _t long _e _s _t long _e _st longe _st k _n _e _w k _n _e _w i _n _g _e _s _t i _ng _e _s _t i _ng _e _st b _e _l _o _n _g b _e _l _o _ng b _e _l _o _ng n _e _w _e _s _t n _e _w _e _st ne _w _e _st new _e _st newe _st newest
print(example_token_list)
['new', '_er', 'longe', '_st', 'k', '_n', '_e', '_w', 'i', '_ng', '_e', '_st', 'b', '_e', '_l', '_o', '_ng', 'newest']
Detokenize¶
If we would use the WordPiece tokenizer only to tokenize a text to serve as input for a machine learning model, we could stop here. However, text generation tasks such as machine translation, question answering, chatbots, etc. not only take tokenized text as input but also generate text in the form of tokens from the learned vocabulary. This means we need ways to convert a list of tokens back to a proper text. However, at least in its basic form, can be done performing to simple steps:
- Concatenate all tokens into a single string, and
- Remove all strings containing a whitespace character followed by a special CTOKEN character(s).
The method detokenize() implements these two trivial steps; and let's test it on some example token list.
def detokenize(tokens: list):
doc = ' '.join(tokens)
return re.sub(f" {CTOKEN}", "", doc).strip()
print(detokenize(example_token_list))
newer longest knew ingest belong newest
This step of detokenizing a list of tokens to a string actually shows why we need the special CTOKEN character. Without it, we could not tell — at least not easily and reliably — which tokens should be merged to form a word in the output string.
Discussion & Limitations¶
Representation of tokens: In our implementation, the tokens in the corpus state are represented as their actual strings of characters. The advantage is that it is much easier to follow how the algorithm works. Practical applications, however, commonly represent the tokens as unique ids, and the vocabulary maintains a mapping between the ids and their respective tokens. For example, instead of representing and entry in the corpus state like
{
...
"wid _e _st": 3,
...
}
the alternative representation using ids could look like
{
...
"230 4 108": 3,
...
}
Where $230$ maps to "wid", $4$ to "_e", and $108$ to "_st". Thus, every time a token pair gets merged into a new token, a new id gets created for that token. This approach has a couple advantages. Firstly, the implementation in terms of memory management gets easier since integer values have a fixed size in bytes, whereas string tokens vary in size during the learning when the token gets merged. And secondly, token ids are the "natural" input for most machine learning algorithms, incl. neural networks. It is therefore more efficient if the tokenizer directly outputs a list of ids. If needed, ids can always be decoded using the mapping between ids and string tokens maintained in the vocabulary.
Special characters: For our example implementation we used the underscore _ as a special character to mark the continuation of a word. We saw that this was needed to reconstruct a list of tokens into a proper text. We already mentioned that the choice of _ was simply to ease the representation, but we had to make the assumption that the training data does not contain underscores. In real-world text corpora, of course, underscores might very well occur. Therefore, practical WordPiece tokenizer algorithms favor characters that are arbitrarily unlikely to appear in a text document. For example, the common BERT tokenizer uses ## as a special character (sequence) for the tokenizer; whether a single character or a sequence of characters is used to mark the continuation of a word does not affect the algorithm.
Smart(er) pretokenization: WordPiece — as well as most other subword tokenization algorithms — requires a pretokenization to an initial list of tokens to initialize the corpus state. For English and many other languages, doing this by breaking up a text with respect to whitespace characters, is a quick and simple approach — and it works, as we have seen throughout the notebook. However, it is very common in English that there is no whitespace between tokens of different categories. For example, there is no whitespace before punctuation marks, and no whitespace before/after a closing/opening parenthesis or quote character. Simple whitespace pretokenization therefore yields initial tokens that do not "belong together". For example, an initial corpus state might look like this:
{
...
"w _i _d _e _s _t _.": 30,
"w _i _d _e _s _t _?": 16,
"w _i _d _e _s _t _!": 9,
"w _i _d _e _s _t _,": 3,
"w _i _d _e _s _t _:": 5,
"w _i _d _e _s _t _;": 10,
...
}
While, in principle, WordPiece still works, it might lead to suboptimal allocation of limited vocabulary slots. For example, with this corpus state, both "_st?" and "_st!" (and maybe others) might make it into the vocabulary although both tokens are from the perspective of a word they are the same. and model capacity. To avoid this, practical WordPiece implementations use some smarter pretokenization to the learning algorithm from merging across character categories. For example, with a pretokenizer that splits words from punctuation marks, our corpus state from above might look as follows:
{
...
"w _i _d _e _s _t": 73,
".": 10085,
"?": 3120,
"!": 5467,
",": 8985,
":": 1403,
";": 2050,
...
}
To give a concrete example, GPT2 used the following Regular Expression the pretokenize input texts (note: the expression has be slightly adapted to fit its use in this notebook, and later GPT uses more revised expression; however, here it's only used to show an alternative to naive whitespace pretokenization).
gpt2pattern = regex.compile(r"""'s|'t|'re|'ve|'m|'ll|'d|\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+""")
print(regex.findall(gpt2pattern, "Hello've world123 how's are you!!!? "))
['Hello', "'ve", 'world', '123', 'how', "'s", 'are', 'you', '!!!?']
From the output of the previous code cell, you can already see how the Regular Expression is working; in simple terms it splits an input text into tokens that are:
- from a predefined set of clitics ("'s", "'t", "'re", "'ve", "'m", "'ll" ,"'d")
- a sequence of letters of arbitrary length
- a sequence of digits of arbitrary lengths
- a sequence of anything but letter, digits, and whitespaces of arbitrary length
It is obvious that this Regular Expression can be modified to potentially improve the pretokenization step further. However, none of those additions change the fundamental WordPiece learning and tokenization algorithm covered in this notebook. In fact, many of these improvements you are likely to add yourself to the basic algorithm covered here.
Example Application¶
So far, we only run our WordPiece tokenizer implementation only over a very simple and "artificial" example document to better understand all the steps of the algorithm. Now let's use a larger document to see how our tokenizer performs. While in practice, huge corpora are used to train a subword tokenizer such as WordPiece, here we limit ourselves to a single book to keep the training time in check. The result will still give us very interesting insights.
Revised WordPiece Tokenizer Implementation¶
For this application use case, we provide the class MyWordPieceTokenizer in the file src/tokenizer.py. This class contains exactly the methods we used so far to train and use our tokenizer. However, incorporating all methods into this class allows for a cleaner code and a much easier usage — now that we understand how WordPiece works. This means, we can train our WordPiece tokenizer now with a single line of code. The code for the class contains only two minor changes:
- Instead of the underscore character to mach the continuation of a word — which kept the illustrations how the algorithm works cleaner — we now use
##for this purpose (like the BERT tokenizer based on WordPiece) - For pretokenization, it supports naive whitespace tokenization as well as approach done by GPT2 (see above).
Let's first do this for our example document before using the real-world document.
my_tokenizer_example = MyWordPieceTokenizer(pretokenize=MyWordPieceTokenizer.PRE_TOKENIZE__SPLIT).fit([doc], max_vocab_size=100, verbose=True)
Initilize corpus and vocabulary... Perform 87 iterations...
20%|█████████████████████▎ | 17/87 [00:00<00:00, 4181.76it/s]
The progress bar will stop before 100% if the value for max_vocab_size is large enough so that the loop will stop before the expected number of iterations has been reached. Recall, this happens if all possible token pairs have been merged, and therefore no further merge is possible. And this will happen very quickly with very small documents like our toy document.
As this is our small toy document, we can still look at the final corpus state, vocabulary and the list of merges.
print(f"Final corpus state:\n{json.dumps(my_tokenizer_example._corpus_state, indent=2)}\n")
print(f"Final vocabulary (size: {len(my_tokenizer_example._vocabulary)}):\n{my_tokenizer_example._vocabulary}\n ")
print(f"Final list of merges:\n{my_tokenizer_example._merges}")
Final corpus state:
{
"low": 5,
"longer": 1,
"lower": 2,
"widest": 3,
"newest": 6
}
Final vocabulary (size: 30):
{'##g', 'newest', '##st', 'ne', 'widest', 'wide', 'new', 'wi', 'long', 'w', 'lo', '##ng', 'longe', '##o', 'n', '##w', '##n', 'low', 'longer', '##r', '##t', 'newe', '##er', '##i', '##e', 'wid', '##d', 'l', '##s', 'lower'}
Final list of merges:
[('##n ##g', '##ng'), ('w ##i', 'wi'), ('wi ##d', 'wid'), ('l ##o', 'lo'), ('lo ##ng', 'long'), ('##s ##t', '##st'), ('lo ##w', 'low'), ('long ##e', 'longe'), ('longe ##r', 'longer'), ('n ##e', 'ne'), ('ne ##w', 'new'), ('wid ##e', 'wide'), ('##e ##r', '##er'), ('new ##e', 'newe'), ('low ##er', 'lower'), ('wide ##st', 'widest'), ('newe ##st', 'newest')]
Of course, when using the same value for max_vocab_size, the result should be exactly the same as seen before — apart from the different special characters.
Training using Real-World Data¶
For the training we will use content from Project Gutenberg. Project Gutenberg is a digital library that provides free access to a vast collection of public domain books and literary works. Founded by Michael S. Hart in 1971, it is one of the oldest digital libraries in existence, aiming to democratize access to literature and knowledge. The project offers over 60,000 eBooks, including classic novels, historical documents, and reference works, in a variety of formats such as plain text, HTML, and ePub, making them accessible across different devices. The initiative relies heavily on volunteers to digitize, proofread, and maintain its collection, ensuring that these works are preserved and made universally available. Since it focuses on texts that are no longer under copyright protection, Project Gutenberg plays a key role in keeping timeless literary and cultural works accessible to the public for free, fostering education and literacy worldwide.
The book of choice is Treasure Island by Robert Louis Stevenson. It is a classic adventure novel that tells the story of young Jim Hawkins and his journey to uncover buried pirate treasure. The tale begins when Jim discovers a mysterious map among the belongings of a deceased sailor at his family’s inn. The map leads to a hidden treasure on a distant island, and Jim joins an expedition to retrieve it, led by the noble Dr. Livesey and the eccentric Squire Trelawney. As the voyage unfolds, Jim realizes that not all the crew members can be trusted, particularly the cunning and charismatic Long John Silver, a one-legged cook with his own secret agenda. The story is packed with thrilling battles, daring escapes, and moments of betrayal and bravery, as Jim and his allies face off against mutinous pirates. Treasure Island is a timeless tale of adventure, exploration, and the moral complexities of greed and loyalty.
Let's first read the file into the variable book.
Your turn: You can download other/more materials from the Project Gutenberg website to expand the overall training corpus. Further down below, when you look at some example sentences that have been tokenized using our trained WordPiece tokenizer, you will notice some limitations when the training dataset is not large and diverse enough.
with open(treasure_island_book, "r") as file:
book = file.read().replace('\n', '').strip()
print(f"Number of characters: {len(book)}")
Number of characters: 375911
Using our own WordPiece tokenizer implementation, we can now run the learning algorithm using Treasure Island. Feel free to modify the pretokenization approach and maximum vocabulary size. Keep in mind that our implementation is not optimized for performance and this is not a tiny toy document. As such, the learning will take a couple of minutes.
my_tokenizer_book = MyWordPieceTokenizer(pretokenize=MyWordPieceTokenizer.PRE_TOKENIZE__GPT2).fit([book], max_vocab_size=20000, verbose=True)
Initilize corpus and vocabulary... Perform 19839 iterations...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 19839/19839 [15:31<00:00, 21.29it/s]
Let's tokenize a couple of example sentences.
print(my_tokenizer_book.tokenize("There is still a lot of treasure buried on the island."))
print(my_tokenizer_book.tokenize("I've checked, my last shipment was delayed."))
print(my_tokenizer_book.tokenize("The captain and the lieutenant had a discussion."))
print(my_tokenizer_book.tokenize("The team members are Alice, John, Jim, and Bob."))
print(my_tokenizer_book.tokenize("I've checked, but I will check again."))
['Th', '##e', '##r', '##e', 'is', 'still', 'a', 'lot', 'of', 'treasur', '##e', 'buried', 'on', 'th', '##e', 'island', '.'] ['I', "'", '##ve', 'check', '##e', '##d', ',', 'my', 'last', 'shipm', '##e', '##nt', 'was', 'd', '##e', '##lay', '##e', '##d', '.'] ['Th', '##e', 'captain', 'and', 'th', '##e', 'li', '##e', '##ut', '##e', '##nant', 'had', 'a', 'discussion', '.'] ['Th', '##e', 't', '##e', '##am', 'm', '##emb', '##e', '##rs', 'a', '##r', '##e', 'Al', '##ic', '##e', ',', 'John', ',', 'Jim', ',', 'and', 'B', '##ob', '.'] ['I', "'", '##ve', 'check', '##e', '##d', ',', 'but', 'I', 'will', 'check', 'again', '.']
Assuming max_vocab_size=20000, we see that the token list contain "full" words we would expect given our training corpus being the book Treasure Island. For example, "John" and "Jim" are the names of characters in the book and therefore appear frequently. In contrast, the "Alice" and "Bob" never appear in the book and are as such Out-of-Vocabulary (OOV) tokens which are split into known tokens. The same is true for the two ranks "captain" (often appears in the book) and "lieutenant" (never appears in the book).
Summary¶
WordPiece is a subword tokenization algorithm widely used in natural language processing (NLP) to handle the challenges of out-of-vocabulary (OOV) words and create efficient representations of text. Unlike word-level tokenization, which struggles with rare or novel words, and character-level tokenization, which loses semantic context, WordPiece breaks words into smaller, meaningful subword units. It begins with an initial vocabulary of characters and iteratively merges the most frequent token pairs in a corpus until a specified vocabulary size is reached. This approach ensures that both frequent and rare words are effectively represented, making WordPiece a cornerstone of modern NLP systems like BERT and its variants.
WordPiece is crucial for enabling language models to generalize across diverse text inputs. By tokenizing words into known subunits, it allows models to process rare or unseen words without losing their semantic meaning. For example, a word like "unhappiness" might be split into "un", "happy", and "ness", capturing its structure and meaning. WordPiece is especially effective in multilingual and morphologically rich languages, where it can create shared subword representations across languages. Its applications include machine translation, sentiment analysis, question answering, and more, where robust tokenization is essential for high model performance. Its main advantages are:
- OOV handling: WordPiece effectively tokenizes rare or novel words into smaller, meaningful components, preventing issues with OOV words.
- Compact vocabulary: The algorithm creates a vocabulary that balances granularity and efficiency, reducing memory and computational requirements.
- Multilingual support: Shared subwords across languages make WordPiece well-suited for multilingual models like mBERT.
- Improved Generalization: By focusing on statistically significant subword units, WordPiece supports better model performance across domains.
However, WordPiece also has some limitations or potential problems, mainly:
- Computational overhead: The iterative merging process during training can be computationally expensive for large corpora.
- Over-fragmentation: Small vocabulary sizes can lead to overly fragmented tokenization, potentially obscuring word-level semantic meaning.
- Dependence on corpus: WordPiece relies heavily on the training corpus, making the vocabulary less adaptable to new domains or specialized datasets.
- Interpretability: Subword tokenization can result in less human-interpretable token sequences, particularly in text generation tasks.
WordPiece strikes a balance between character- and word-level tokenization, offering an efficient and effective solution for modern NLP challenges. Despite its limitations, its ability to handle OOV words, create compact vocabularies, and support multilingual processing has made it indispensable in the development of state-of-the-art NLP models. However, careful tuning of vocabulary size and corpus selection is essential to maximize its benefits while mitigating its drawbacks.