Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.
Byte-Pair Encoding Tokenization¶
Byte-Pair Encoding (BPE) is a widely used method for subword tokenization in natural language processing (NLP). Subword tokenization refers to breaking text into smaller units, such as parts of words, instead of splitting only by full words or single characters. This technique is particularly useful in tasks like machine translation, text generation, and speech recognition, where handling large and diverse vocabularies efficiently is crucial. BPE achieves this by striking a balance between using full words and individual characters, creating tokens that are subword units.
The motivation for using BPE lies in the challenges posed by natural language. Words in any language vary widely in their frequency of usage — some appear often (like "the" or "and"), while others are rare (like "antidisestablishmentarianism"). Storing and processing all these words as separate tokens would make the vocabulary size enormous and computationally expensive. On the other hand, splitting text into individual characters would lose the meaningful structure of words, making it harder for models to learn semantic relationships. BPE provides an elegant solution by building a vocabulary of common subword units that can effectively represent both frequent and rare words.
BPE works by initially treating every character in a text as a token and then iteratively merging the most frequent pairs of tokens into new subword units. For instance, if "th" and "e" frequently appear together, they are merged into "the". This process continues until a predefined vocabulary size is reached. The resulting subword tokens capture common patterns in the language, such as prefixes, suffixes, or word stems, which helps models understand linguistic structures better while keeping the vocabulary manageable.
One of the key advantages of BPE is its ability to handle out-of-vocabulary words effectively. In traditional tokenization methods, a completely new word might be treated as unknown or ignored. With BPE, even unseen words can be broken into subword units that the model has already learned, allowing it to make better predictions. This capability makes BPE particularly important for working with morphologically rich languages or domains with specialized vocabularies, such as medical or legal texts.
In this notebook, we will take a closer look at BPE and implement a basic BPE tokenizer from scratch in a step-by-step and illustrative manner.
Setting up the Notebook¶
Make Required Imports¶
This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.
import re, regex, collections, json
from src.text.preprocessing.tokenizing import MyBpeTokenizer
from src.utils.data.files import *
Download Required Data¶
Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.
treasure_island_book, _ = download_dataset("text/corpora/books/treasure-island.txt")
File 'data/datasets/text/corpora/books/treasure-island.txt' already exists (use 'overwrite=True' to overwrite it).
BPE from Scratch¶
The fundamental idea behind BPE is quite straightforward to understand and easy to implement, and will go through the algorithm step by step in the following. Practical implementations of BPE will be more sophisticated as they consider additional refinements or aim for more efficient implementations. Here, the focus is on the understanding of BPE for tokenization. This includes that we use a very artificial example document to better illustrate the inner workings of the algorithm. This example document is defined in the code cell below.
doc = 'low low low low low lower lower newest newest newest newest newest newest widest widest widest longer'
The algorithm also needs some special character to mark the end of a word; its purpose will be clear once we go through the algorithm. Again, to keep it simple, we simply use the underscore character _ for this. Note that this means that our input document is not allowed to contain underscore characters. This works perfectly fine for our example document here.
TOKEN_EOS = '_'
Core Steps¶
We first go through the core steps of BPE, before combining them to the final learning algorithm.
Pretokenize Text¶
BPE assumes that the corpus has been pretokenized into a initial list of tokens. The most basic approach is to pretokenize a text based on whitespace characters — in practice, slightly more sophisticated methods are used (discussed later). The code cell below contains the method pretokenize() the uses the built-in Python method split() to convert a text string into a list of tokens by splitting the string with respect to whitespace characters.
def pretokenize(doc):
return doc.split()
To show an example, we can apply this method to our example document.
print(pretokenize(doc))
['low', 'low', 'low', 'low', 'low', 'lower', 'lower', 'newest', 'newest', 'newest', 'newest', 'newest', 'newest', 'widest', 'widest', 'widest', 'longer']
Initialize Corpus State and Vocabulary¶
The corpus state represents a data structure, here a dictionary, that maps each unique token after pretokenization to its number of occurrences on the training corpus. Since BPE is a bottom-up approach, we also need to split each token (typically a word) into its characters, and the special TOKEN_EOS is added. Let's first create a utility method that performs this step for an initial word/token.
def generate_sequence(word, eos=TOKEN_EOS):
return ' '.join((list(word) + [eos]))
generate_sequence('fastest')
'f a s t e s t _'
Now we can go through our all documents to initialize the corpus state, i.e., the set of unique words/tokens across the whole corpus converted to their sequences. This for each document we perform the following steps:
- Perform simple pretokenization using the
pretokenize()method. - Split each token into its characters using the utility method
generate_sequence(). - Add the generate sequence to the corpus state; to avoid repeated entries of the same sequence, we also keep track of the number of times a token appears in the corpus — we therefore implement the corpus state as dictionary with the sequences as the keys and the number of occurrences as the values.
During this process, we also initialize the vocabulary as the set of all unique characters. Later, the learning algorithm will add strings longer than a single character to the vocabulary. Note, however, that the vocabulary itself is not crucial for the BPE algorithm itself.
The method initialize_corpus_state() in the code cell below implements both steps and returns the initial corpus state and vocabulary.
def initialize_corpus_state(docs: list):
# Initial vocabulary as an empty set
vocabulary = set()
# Initialize dictionary representing the corpus state
corpus_state = collections.defaultdict(int)
# Loop over all documents
for doc in docs:
# Add all characters in the current document to the vocabulary
vocabulary.update(set(doc))
# For each word in the document, generate the sequence and update add it to the corpus state
for word in pretokenize(doc):
corpus_state[generate_sequence(word)] += 1
# Remove whitespace character from final vocabulary
vocabulary.discard(" ")
# Add EOS token
vocabulary.add(TOKEN_EOS)
return dict(corpus_state), vocabulary
We can now run the method over our toy corpus which is just a single document.
corpus_state, vocabulary = initialize_corpus_state([doc])
Let's first have a look at the corpus state; we use the dumps() method of the json library for a more user-friendly output.
print(json.dumps(corpus_state, indent=2))
{
"l o w _": 5,
"l o w e r _": 2,
"n e w e s t _": 6,
"w i d e s t _": 3,
"l o n g e r _": 1
}
When looking at the result, you should recognize all five unique words/tokens that appear in our toy document, only converted to their initial sequences. The number reflects how often that word/token appeared in the document. For example, the key-value pair "w i d e s t _": 3 indicates that the word "widest" appeared three times in the document.
We can also look at the initial vocabulary which is simply the set of all unique characters in the document, together with the special EOS token.
print(vocabulary)
{'t', 'e', '_', 'r', 'o', 'l', 's', 'i', 'g', 'w', 'n', 'd'}
If we would do nothing else — that is, not actually perform any training — the BPE tokenizer would behave like a character tokenizer. We will actually try this later.
Token Merging Step¶
The key idea of BPE is to iteratively merge the most frequent token pairs in the corpus state to build a vocabulary that efficiently represents the structure of the input text. The goal is to identify and represent commonly occurring patterns — such as subwords, roots, prefixes, or suffixes — with single tokens. This reduces the number of tokens needed to encode the text while maintaining flexibility for representing rare or unseen words through combinations of smaller subword units.
By iteratively merging the most frequent adjacent token pairs, BPE strikes a balance between character-level and word-level tokenization. Character-level tokenization ensures coverage for any text but can result in extremely long sequences that are computationally expensive to process. On the other hand, word-level tokenization creates inefficiencies when encountering rare words, as each unique word would need a separate token. BPE bridges this gap by creating a subword-level tokenization scheme: frequent word components are merged into single tokens, while rare or out-of-vocabulary words can still be represented as sequences of smaller subword units.
Calculate Number of Token Pairs¶
To know which token pair to merge, we first need to find the most frequent pair (or pairs) of tokens in the current corpus state. For example, assuming the initial corpus state from above, we need to calculate the number of occurrences of "l o", "o w", "w _", "w e", "e r", and so on, to know which pair to merge next. The method find_most_frequent_token_pair() in the code cell below implements this calculation by iterating over all words in the corpus state, and for each word, iterating over all token pairs to sum up each pair's total number of occurrences. The method then returns the most frequent token pairs as well as a dictionary with all token pairs and their respective counts. The latter is only to have a look at the counts; this dictionary is not need for the algorithm itself
def find_most_frequent_token_pair(corpus_state):
token_pair_counts = collections.defaultdict(int)
for word, freq in corpus_state.items():
sequence = word.split()
for i in range(len(sequence)-1):
token_pair_counts[f"{sequence[i]} {sequence[i+1]}"] += freq
# Return the most frequent pair (if their are ties, we just randomly break them) + all token pair counts
return max(token_pair_counts.keys(), key=(lambda key: token_pair_counts[key])), token_pair_counts
Using the method find_most_frequent_token_pair(), we can identify the first token pair to merge for our initial corpus state. In general, there might be multiple token pairs that have the same highest number of occurrences. In this situation, we pick any token pair from this subset randomly.
# We re-initialize the corpus state and the vocabulary to ensure a consistent output of this code cell
corpus_state, vocabulary = initialize_corpus_state([doc])
top_token_pair, token_pair_counts = find_most_frequent_token_pair(corpus_state)
Let's first look at the counts for all token pairs. Again, this is just for illustrative purposes and not required for the BPE algorithm itself.
print(json.dumps(token_pair_counts, indent=2))
{
"l o": 8,
"o w": 7,
"w _": 5,
"w e": 8,
"e r": 3,
"r _": 3,
"n e": 6,
"e w": 6,
"e s": 9,
"s t": 9,
"t _": 9,
"w i": 3,
"i d": 3,
"d e": 3,
"o n": 1,
"n g": 1,
"g e": 1
}
We can see that in fact three token pairs have the highest number of occurrences of 9: "e s", "s t", and "t _". Any of those three is a suitable candidate to get merged next. So let's see which pair the method find_most_frequent_token_pair() has returned.
print(top_token_pair)
e s
In the following, let's assume the most frequent token pair that was returned is "e s".
Perform Merge¶
With the next most frequent token pair found, we can actually perform the merging step. This step includes three core substeps:
- Merge the token pair into a new token (e.g., "e s" becomes "es")
- Add this new token (here, "es") to the vocabulary
- Update the corpus state by replacing all occurrences of the token pair with the new token — for our example, replace all occurrences of "e s" in the corpus state with "es".
The method perform_merge() implements these three substeps. Merging the token pair into a new token and adding this news token to the vocabulary is straightforward. Updating the corpus state is only a bit more complex. Since the corpus state is implemented as a dictionary and replacing all occurrences of the token pair with the merge token means changing the keys of this dictionary, we cannot simply iterate over this dictionary and directly change the keys. We therefore break this update step up into two loops
In the first loop, we find all keys (i.e., the words in the corpus state) that contain the token pair (at least once) and therefore need to be updated. We identify these keys using a Regular Expression that matches the token pair in a key. For example, given the key "n e w e s t _", the Regular Expression looking for "e s" would match; but would not match for "l o n g e r _". We keep track of all matching keys using the dictionary
matcheswhere its keys are the old keys in corpus state and the values are the new keys for the corpus state. We use the same Regular Expression to generate the new keys; for example, "n e w e s t _" becomes "n e w es t _".In the second loop, we iterate over all matches to add the new keys to the corpus state and give them the values of their respective old keys (those values representing the initial count do not change). We do this using the built-in
pop()method which automatically removes the entry for the old key from the corpus state.
Finally we return the merge (i.e., the tuple of the token pair and the new token — we need this later), the updated corpus state, and the updated vocabulary.
def perform_merge(token_pair, corpus_state, vocabulary):
# Create new token by merging token pair
new_token = re.sub(' ', '', token_pair)
# Create merge as tuple of token pair and new token
merge = (token_pair, new_token)
# Add new token to vocabulary
vocabulary.add(new_token)
# Define search pattern
pattern = re.compile(r"(?<!\S)" + re.escape(token_pair) + r"(?!\S)")
# Loop through corpus state and record which keys/sequences need to be updated
matches = {}
for sequence, count in corpus_state.items():
for match in pattern.finditer(sequence):
matches[sequence] = pattern.sub(new_token, sequence)
# Perform the update of keys/sequences
for old, new in matches.items():
corpus_state[new] = corpus_state.pop(old)
# Return the updated corpus state and vocabulary
return merge, corpus_state, vocabulary
For testing, we can run the method perform_merge() over our initial corpus state and vocabulary. We only re-initialize the both corpus state and vocabulary to ensure the output of the code cell is always the same.
# We re-initialize the corpus state and the vocabulary to ensure a consistent output of this code cell
corpus_state, vocabulary = initialize_corpus_state([doc])
merge, corpus_state, vocabulary = perform_merge('e s', corpus_state, vocabulary)
Let's have a look at all three return values of this method.
print(f"Merge: {merge}")
print()
print("Updated corpus state:")
print(json.dumps(corpus_state, indent=2))
print()
print("Updated vocabulary")
print(vocabulary)
Merge: ('e s', 'es')
Updated corpus state:
{
"l o w _": 5,
"l o w e r _": 2,
"l o n g e r _": 1,
"n e w es t _": 6,
"w i d es t _": 3
}
Updated vocabulary
{'t', 'e', '_', 'r', 'es', 'o', 'l', 's', 'i', 'g', 'w', 'n', 'd'}
Particularly, notice the changes in the corpus state: "n e w e s t _" became "n e w es t _", and "w i d e s t _" became "w i d es t _". Also, "es" has been added to the vocabulary.
BPE Learning Algorithm¶
With these three core steps, that is:
- Initialization of the corpus state and vocabulary
- Finding the most frequent token pair (for the current corpus state), and
- Performing the merging step and updating the corpus state
We now have everything in place to implement the BPE learning algorithm by plugging the methods implementing those three steps together; see the method bpe_learn() in the code cell below. Although the implementation of this method is rather straightforward, three small details are worth mentioning
One of the aforementioned advantages of BPE is that we can specify the maximum size of the resulting vocabulary. Since each merging step adds a new token to the vocabulary, we can restrict the size of the vocabulary by limiting the number of merging steps. However, since our initial vocabulary is not empty but the set of all unique characters, the number of merging steps — that is, the number of iterations
num_iterin the code — derives from the difference of the specified maximum vocabulary sizemax_vocab_sizeand the size of the initial vocabulary.In principle, particularly if the corpus is not very large, the algorithm might perform more merges than possible. This happens when all words in the corpus state have been merged to their original form (plus the EOS token). For example, after a certain amount of iterations, say, "n e w e s t _" will have been merged to "newest_". If this is true for all words in the corpus state, the corpus state does no longer contain any pair of tokens. In this case, the method
find_most_frequent_token_pair()will throw an error and exit the loop since the learning algorithm has finished.In each iteration, we keep track of the recent merge by adding it to a list of all previous merges. This list of merges is in fact the most important return value of the learning algorithm as it is used for tokenizing text. This includes that the order matters as we want to merge tokens in a new text in the same order as we merged them during the learning phase.
def bpe_learn(corpus, max_vocab_size=10000):
# Initialize corpus state and vocabulary
corpus_state, vocabulary = initialize_corpus_state(corpus)
# Initialize the list of merges
merges = []
# Calculate the number of merging steps to ensure the maximum size of the vocabulary
num_iter = max_vocab_size - len(vocabulary)
for _ in range(num_iter):
# Find the most frequent pair; if this fails, no more merging was possible and we can stop
try:
top_token_pair, _ = find_most_frequent_token_pair(corpus_state)
except:
break
# Update corpus state and the vocabulary
merge, corpus_state, vocabulary = perform_merge(top_token_pair, corpus_state, vocabulary)
# Add newly merged token to vocabulary
merges.append(merge)
# Return list of merges, the corpus state, and the vocabulary
return merges, corpus_state, vocabulary
Well, let's train a BPE tokenizer over our toy document.
Your turn: Try different values for max_vocab_size and inspect the results. Of course, the larger this value, the larger the final vocabulary and the list of merges, but also the corpus state shows larger tokens. For example, a large value, say, max_vocab_size=1000, the learning algorithm tries to make more merging steps as actually possible. You can tell by looking at the corpus state where all words are merged into a single token.
merges, corpus_state, vocabulary = bpe_learn([doc], max_vocab_size=21)
print(f"Final corpus state:\n{json.dumps(corpus_state, indent=2)}\n")
print(f"Final vocabulary (size: {len(vocabulary)}):\n{vocabulary}\n ")
print(f"Final list of merges:\n{merges}")
Final corpus state:
{
"w i d est_": 3,
"lo n g e r _": 1,
"low e r _": 2,
"newest_": 6,
"low_": 5
}
Final vocabulary (size: 21):
{'t', 'ne', 'new', 'w', 'newest_', 'e', 'r', 'es', 'low', 'est', 'low_', 'i', 'est_', 'lo', '_', 'o', 'l', 'n', 'd', 's', 'g'}
Final list of merges:
[('e s', 'es'), ('es t', 'est'), ('est _', 'est_'), ('l o', 'lo'), ('lo w', 'low'), ('n e', 'ne'), ('ne w', 'new'), ('new est_', 'newest_'), ('low _', 'low_')]
BPE Tokenization Algorithm¶
Once we applied the BPE learning algorithm over a corpus, using this learned model for actually tokenizing an arbitrary text is very straightforward. In fact, strictly speaking, we only need the list of merges that was returned from the learning algorithm (see above). In some sense, the tokenization algorithm mimics the learning algorithm. This includes that we first perform whitespace tokenization and treat each initial token separately. The method tokenize_word() tokenizes a single (initial) token based on the learned list of merges as follows:
First, the method splits the word into character tokens and adds the EOS token using the
generate_sequence()method. For example, the words "newer" becomes "n e w e r _"Then, the method iterates over the list of merges and checks if a merge can be applied, and does so if a match is found. For example, the merge
("e w", "ew")will find a match in "n e w e r _" and convert it the "n ew e r _". To find the matches and perform the merges, we use the same Regular Expression we have already seen in the learning algorithm.
The rest of the implementation is merely for printing the intermediate results for illustrative purposes.
def tokenize_word(word, merges, verbose=False):
sequence = generate_sequence(word)
if verbose == True:
print(sequence)
for p, m in merges:
if p not in sequence:
continue
p = re.compile(r'(?<!\S)' + re.escape(p) + r'(?!\S)')
sequence = p.sub(m, sequence)
if verbose == True:
print(sequence)
return sequence.split(' ')
Let's run the tokenize_word() over a word that was not in the training document. Of course, the exact output will depend on the value for max_vocab_size you chose to train the BPE tokenizer. For example, with max_vocab_size=0, the word will be split into its individual characters since a new merge is performed. In other words, with max_vocab_size=0, the BPE tokenizer becomes a character tokenizer. In contrast, if the value for max_vocab_size is very large, the BPE tokenizer is more likely to behave like a word tokenizer.
Your turn: Run the method bpe_learn() implementing the BPE learning algorithm with different values for max_vocab_size and see how the output of the code cell below changes.
tokens = tokenize_word('newer', merges, verbose=True)
print(tokens)
n e w e r _ ne w e r _ new e r _ ['new', 'e', 'r', '_']
To tokenize a complete document — again, mimicking the learning algorithm — we first need to pretokenize the document, and then run the method tokenize_word() over each initial token. The method tokenize implements these basic steps.
def tokenize(doc, merges, verbose=False):
pretokens = pretokenize(doc)
tokens = []
for pt in pretokens:
tokens.extend(tokenize_word(pt, merges, verbose=verbose))
return tokens
The code cell below defines another example document to test the behavior of method tokenize(). As before, the exact output will depend on the value of max_vocab_size when training the tokenizer and the document doc itself. Feel free to modify the document by adding new words or tweaking existing ones. You are also encouraged to run the code cell with different versions of the tokenizer (i.e., trained using different values for max_vocab_size).
doc2 = "newer longest knew ingest belong newest"
print(tokenize(doc2, merges, verbose=True))
n e w e r _ ne w e r _ new e r _ l o n g e s t _ l o n g es t _ l o n g est _ l o n g est_ lo n g est_ k n e w _ k ne w _ k new _ i n g e s t _ i n g es t _ i n g est _ i n g est_ b e l o n g _ b e lo n g _ n e w e s t _ n e w es t _ n e w est _ n e w est_ ne w est_ new est_ newest_ ['new', 'e', 'r', '_', 'lo', 'n', 'g', 'est_', 'k', 'new', '_', 'i', 'n', 'g', 'est_', 'b', 'e', 'lo', 'n', 'g', '_', 'newest_']
Detokenize¶
If we would use the BPE tokenizer only to tokenize a text to serve as input for a machine learning model, we could stop here. However, text generation tasks such as machine translation, question answering, chatbots, etc. not only take tokenized text as input but also generate text in the form of tokens from the learned vocabulary. This means we need ways to convert a list of tokens back to a proper text. However, at least in its basic form, can be done performing to simple steps:
- Concatenate all tokens into a single string, and
- Replace the special EOS character with a whitespace character.
The method detokenize() implements these two trivial steps; and let's test it on some example token list.
def detokenize(tokens: list):
doc = ''.join(tokens)
return re.sub(TOKEN_EOS, " ", doc).strip()
print(detokenize(['new', 'er_', 'long', 'est_', 'k', 'new', '_', 'i', 'n', 'g', 'est_', 'b', 'e', 'long', '_', 'newest_']))
newer longest knew ingest belong newest
If you have followed the notebook very closely so far, you will notice that this is the actual reason why we need this special EOS character. Without it, for the detokenizing step, it would be impossible to say which tokens need to be merged, or more specifically, which tokens should not be merged.
Discussion & Limitations¶
Representation of tokens: In our implementation, the tokens in the corpus state are represented as their actual strings of characters. The advantage is that it is much easier to understand the inner workings of the algorithms. Practical applications, however, commonly represent the tokens as unique ids, and the vocabulary maintains a mapping between the ids and their respective tokens. For example, instead of representing and entry in the corpus state like
{
...
"w id est_": 3,
...
}
the alternative representation using ids could look like
{
...
"23 324 108": 3,
...
}
Where $23$ maps to "w", $324$ to "id", and $108$ to "est_". Thus, every time a token pair gets merged into a new token, a new id gets created for that token. This approach has a couple advantages. Firstly, the implementation in terms of memory management gets easier since integer values have a fixed size in bytes, whereas string tokens vary in size during the learning when the token gets merged. And secondly, token ids are the "natural" input for most machine learning algorithms, incl. neural networks. It is therefore more efficient if the tokenizer directly outputs a list of ids. If needed, ids can always be decoded using the mapping between ids and string tokens maintained in the vocabulary.
Special characters: For our example implementation we used the underscore _ as a special character to mark the end of a word. We saw that this was needed to reconstruct a list of tokens into a proper text. We already mentioned that the choice of _ was simply to ease the representation, but we had to make the assumption that the training data does not contain underscores. In real-world text corpora, of course, underscores might very well occur. Therefore, practical BPE tokenizer algorithms favor characters that are arbitrarily unlikely to appear in a text document. For example, GPT often uses Ġ as a special character for the tokenizer. The character Ġ (uppercase "G" with a dot above; Unicode U+0120) is a letter in the Latin alphabet used in specific languages and contexts such as Maltese, as well as Old English and transcriptions. The special character is also not required to mark the end of a word. Using it to mark the start of a word (e.g., again, like GPT) has the same effect. Lastly, a special placeholder might be more than just a single character but can be represented by multiple characters. For example, the BERT tokenizer uses "##" to mark the start of a word.
Smart(er) pretokenization: BPE requires a pretokenization to an initial list of tokens to initialize the corpus state. For English and many other languages, doing this by breaking up a text with respect to whitespace characters, is a quick and simple approach — and it works, as we have seen throughout the notebook. However, it is very common in English that there is no whitespace between tokens of different categories. For example, there is no whitespace before punctuation marks, and no whitespace before/after a closing/opening parenthesis or quote character. Simple whitespace pretokenization therefore yields initial tokens that do not "belong together". For example, an initial corpus state might look like this:
{
...
"w i d e s t . _": 30,
"w i d e s t ? _": 16,
"w i d e s t ! _": 9,
"w i d e s t , _": 3,
"w i d e s t : _": 5,
"w i d e s t ; _": 10,
...
}
While, in principle, BPE still works, it might lead to suboptimal allocation of limited vocabulary slots. For example, with this corpus state, both "st?" and "st!" (and maybe others) might make it into the vocabulary although both tokens are from the perspective of a word they are the same. and model capacity. To avoid this, practical BPE implementations use some smarter pretokenization to the learning algorithm from merging across character categories. For example, with a pretokenizer that splits words from punctuation marks, our corpus state from above might look as follows:
{
...
"w i d e s t _": 73,
". _": 10085,
"? _": 3120,
"! _": 5467,
", _": 8985,
": _": 1403,
"; _": 2050,
...
}
To give a concrete example, GPT2 used the following Regular Expression the pretokenize input texts (note: the expression has be slightly adapted to fit its use in this notebook, and later GPT uses more revised expression; however, here it's only used to show an alternative to naive whitespace pretokenization).
gpt2pattern = regex.compile(r"""'s|'t|'re|'ve|'m|'ll|'d|\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+""")
print(regex.findall(gpt2pattern, "Hello've world123 how's are you!!!? "))
['Hello', "'ve", 'world', '123', 'how', "'s", 'are', 'you', '!!!?']
From the output of the previous code cell, you can already see how the Regular Expression is working; in simple terms it splits an input text into tokens that are:
- from a predefined set of clitics ("'s", "'t", "'re", "'ve", "'m", "'ll" ,"'d")
- a sequence of letters of arbitrary length
- a sequence of digits of arbitrary lengths
- a sequence of anything but letter, digits, and whitespaces of arbitrary length
As you can see, there are many tweaks and improvements that practical BPE implementations are including to yield better results. However, none of those additions change the fundamental BPE learning and tokenization algorithm covered in this notebook. In fact, many of these improvements you are likely to add yourself to the basic algorithm covered here.
Example Application¶
So far, we only run our BPE tokenizer implementation only over a very simple and "artificial" example document to better understand all the steps of the algorithm. Now let's use a larger document to see how our tokenizer performs. While in practice, huge corpora are used to train a subword tokenizer such as BPE, here we limit ourselves to a single book to keep the training time in check. The result will still give us very interesting insights.
Revised BPE Tokenizer Implementation¶
For this application use case, we provide the class MyBpeTokenizer in the file src/text/preprocessing/tokenizing/tokenizer.py. This class contains exactly the methods we used so far to train and use our tokenizer. However, incorporating all methods into this class allows for a cleaner code and a much easier usage — now that we understand how BPE works. This means, we can train our BPE tokenizer now with a single line of code. The code for the class contains only two minor changes:
- By default, like GPT2, it uses the character Ġ (instead of the underscore) as a special character. We also move this special character from the end of a word to that start of the word. This has no effect on the algorithm, and we only do this so our output mimics the one from pretrained BPE tokenizer models.
- For pretokenization, it supports naive whitespace tokenization as well as approach done by GPT2 (see above).
Let's first do this for our example document before using the real-world document.
my_tokenizer_example = MyBpeTokenizer(pretokenize=MyBpeTokenizer.PRE_TOKENIZE__SPLIT).fit([doc], max_vocab_size=100, verbose=True)
Initilize corpus and vocabulary... Perform 88 iterations...
20%|██████████████████████▎ | 18/88 [00:00<00:00, 5307.38it/s]
The progress bar will stop before 100% if the value for max_vocab_size is large enough so that the loop will stop before the expected number of iterations has been reached. Recall, this happens if all possible token pairs have been merged, and therefore no further merge is possible. And this will happen very quickly with very small documents like our toy document.
As this is our small toy document, we can still look at the final corpus state, vocabulary and the list of merges.
print(f"Final corpus state:\n{json.dumps(my_tokenizer_example._corpus_state, indent=2)}\n")
print(f"Final vocabulary (size: {len(my_tokenizer_example._vocabulary)}):\n{my_tokenizer_example._vocabulary}\n ")
print(f"Final list of merges:\n{my_tokenizer_example._merges}")
Final corpus state:
{
"\u0120low": 5,
"\u0120newest": 6,
"\u0120widest": 3,
"\u0120lower": 2,
"\u0120longer": 1
}
Final vocabulary (size: 30):
{'t', 'Ġlo', 'Ġlow', 'Ġne', 'Ġwid', 'Ġ', 'w', 'Ġlonger', 'Ġl', 'Ġw', 'Ġlong', 'e', 'r', 'es', 'est', 'i', 'Ġn', 'Ġlon', 'o', 'l', 'Ġnewest', 'Ġnew', 'n', 'd', 'er', 'Ġlower', 's', 'Ġwi', 'g', 'Ġwidest'}
Final list of merges:
[('e s', 'es'), ('es t', 'est'), ('Ġ l', 'Ġl'), ('Ġl o', 'Ġlo'), ('Ġlo w', 'Ġlow'), ('Ġ n', 'Ġn'), ('Ġn e', 'Ġne'), ('Ġne w', 'Ġnew'), ('Ġnew est', 'Ġnewest'), ('Ġ w', 'Ġw'), ('e r', 'er'), ('Ġw i', 'Ġwi'), ('Ġwi d', 'Ġwid'), ('Ġwid est', 'Ġwidest'), ('Ġlow er', 'Ġlower'), ('Ġlo n', 'Ġlon'), ('Ġlon g', 'Ġlong'), ('Ġlong er', 'Ġlonger')]
Of course, when using the same value for max_vocab_size, the result should be exactly the same as seen before — apart from the different special characters.
Training using Real-World Data¶
For the training we will use content from Project Gutenberg. Project Gutenberg is a digital library that offers free access to thousands of public domain eBooks, including classic literature, historical texts, and reference works. Established in 1971 by Michael S. Hart, it is one of the oldest digital libraries in existence, aiming to democratize access to knowledge by making literature widely available in electronic formats. The collection primarily focuses on works for which copyright has expired, ensuring they are freely and legally accessible to readers worldwide. Books are available in various formats, including plain text, ePub, and Kindle, and the project relies on volunteers to digitize and proofread content.
The book of choice is Treasure Island by Robert Louis Stevenson. It is an adventurous tale of a young boy named Jim Hawkins who discovers a pirate's treasure map and embarks on a perilous journey to find the hidden fortune. He joins a ship's crew led by Captain Smollett, but the voyage is fraught with danger as the ship’s cook, Long John Silver, turns out to be a cunning pirate with plans to seize the treasure for himself. The story unfolds with thrilling mutinies, battles, and betrayals, as Jim outsmarts the pirates and helps secure the treasure. Ultimately, Jim and the loyal crew members return home safely, leaving much of the treasure behind, having learned valuable lessons about bravery, trust, and greed.
Let's first read the file into the variable book.
Your turn: You can download other/more materials from the Project Gutenberg website to expand the overall training corpus. Further down below, when you look at some example sentences that have been tokenized using our trained BPE tokenizer, you will notice some limitations when the training dataset is not large and diverse enough.
with open(treasure_island_book, "r") as file:
book = file.read().replace('\n', '').strip()
print(f"Number of characters: {len(book)}")
Number of characters: 375911
Using our own BPE tokenizer implementation, we can now run the learning algorithm using Treasure Island. Feel free to modify the pretokenization approach and maximum vocabulary size. Keep in mind that our implementation is not optimized for performance and this is not a tiny toy document. As such, the learning will take a couple of minutes.
my_tokenizer_book = MyBpeTokenizer(pretokenize=MyBpeTokenizer.PRE_TOKENIZE__GPT2).fit([book], max_vocab_size=10000, verbose=True)
Initilize corpus and vocabulary... Perform 9911 iterations...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9911/9911 [04:16<00:00, 38.63it/s]
Let's tokenize a couple of example sentences.
print(my_tokenizer_book.tokenize("There is still a lot of treasure buried on the island."))
print(my_tokenizer_book.tokenize("I've checked, my last shipment was delayed."))
print(my_tokenizer_book.tokenize("The captain and the lieutenant had a discussion."))
print(my_tokenizer_book.tokenize("The team members are Alice, John, Jim, and Bob."))
print(my_tokenizer_book.tokenize("I've checked, but I will check again."))
['ĠThere', 'Ġis', 'Ġstill', 'Ġa', 'Ġlot', 'Ġof', 'Ġtreasure', 'Ġburied', 'Ġon', 'Ġthe', 'Ġisland', 'Ġ.'] ['ĠI', 'Ġ', "'", 've', 'Ġche', 'cked', 'Ġ,', 'Ġmy', 'Ġlast', 'Ġship', 'ment', 'Ġwas', 'Ġdelayed', 'Ġ.'] ['ĠThe', 'Ġcaptain', 'Ġand', 'Ġthe', 'Ġlieuten', 'ant', 'Ġhad', 'Ġa', 'Ġdiscu', 'ss', 'ion', 'Ġ.'] ['ĠThe', 'Ġte', 'am', 'Ġme', 'mb', 'ers', 'Ġare', 'ĠA', 'li', 'ce', 'Ġ,', 'ĠJohn', 'Ġ,', 'ĠJim', 'Ġ,', 'Ġand', 'ĠB', 'o', 'b', 'Ġ.'] ['ĠI', 'Ġ', "'", 've', 'Ġche', 'cked', 'Ġ,', 'Ġbut', 'ĠI', 'Ġwill', 'Ġcheck', 'Ġagain', 'Ġ.']
Unsurprisingly — assuming max_vocab_size=10000 for the following discussion — the tokenizer results reflect the nature of our training document. This means that words are likely to appear frequently in Treasure Island, such as "treasure", "island" but also other common words like prepositions, are unlikely to be broken up. We can also see this in the names. "John" and "Jim" are the names of characters in the book and therefore appear frequently. In contrast, the "Alice" and "Bob" never appear in the book and are as such Out-of-Vocabulary (OOV) tokens which are split into known tokens. The same is true for the two ranks "captain" (often appears in the book) and "lieutenant" (never appears in the book). Another interesting observation is that "check" and "checked" get tokenized very differently. While this might seem a bit unintuitive at first, it shows that the order of merges matters. For example, if we assume that the merge ("k ed", "ked") is checked very early and there is no merge like ("ec ked", "ecked"), the word "checked" has no chance to get merged back to its full string.
Summary¶
Byte Pair Encoding (BPE) is an algorithm widely used for subword tokenization in natural language processing tasks. It starts with a base vocabulary consisting of individual characters and iteratively merges the most frequent adjacent pairs of tokens to form new, larger tokens. This process continues until the vocabulary reaches a predefined size or no more frequent pairs can be found. BPE generates a hierarchical vocabulary of subwords that represent commonly occurring patterns, making it efficient for tasks that involve processing text data with varying word frequencies.
One of the major advantages of BPE is its ability to balance coverage and efficiency. Unlike word-level tokenization, which struggles with rare or unseen words, BPE can decompose any word into a sequence of subword units, ensuring no out-of-vocabulary (OOV) issues. It also avoids the inefficiencies of character-level tokenization, which would require longer sequences to represent text. By focusing on frequent subword patterns, BPE creates compact token sequences that preserve meaning while being computationally manageable. Its main advantages are:
Compact vocabulary: BPE creates a smaller, more manageable vocabulary by focusing on the most frequent subword patterns. This reduces memory requirements and speeds up training compared to models with word-level vocabularies.
No OOV problems: BPE ensures that any word, including rare or novel ones, can be represented using subword units. This is particularly important in languages with rich morphology or compound word formation.
Efficiency in training and inference: The use of common subword units reduces sequence length compared to character-level tokenization, improving computational efficiency.
On the flips side, BPE has also some characteristics that can be considered disadvantages, mainly:
Hard-coded vocabulary size: BPE requires a predefined vocabulary size, which might not always align with the complexity or diversity of the input text. Too small a vocabulary may lead to inefficient tokenization, while too large a vocabulary could introduce redundancy.
Insensitive to context: BPE merges subword pairs based solely on their frequency without considering contextual meaning, which may lead to suboptimal token splits in cases where the same subword pair has different meanings in different contexts.
Static rules: Once trained, the BPE vocabulary is fixed and cannot adapt to new data or unseen linguistic patterns without retraining. This can limit its flexibility compared to dynamic tokenization methods.
Compared to other methods, such as SentencePiece or WordPiece, BPE is simpler and faster to implement. However, techniques like WordPiece can incorporate probabilistic models that better handle context, making them more effective in certain scenarios. SentencePiece, on the other hand, can operate on raw text directly without requiring explicit tokenization into characters, offering additional preprocessing flexibility. In summary, BPE is a powerful and efficient method for subword tokenization that excels in handling OOV issues and maintaining a manageable vocabulary. However, its static and frequency-based approach can be a limitation in scenarios where context or dynamic adaptability is crucial.