Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Token Indexing with Vocabularies¶

Token indexing using vocabularies in the context of machine learning for text is the process of mapping words, subwords, or characters in a text to unique numerical identifiers based on a predefined vocabulary. A vocabulary is a set of known tokens that a model can recognize, with each token assigned a corresponding index. For example, if a vocabulary consists of {hello: 1, world: 2, machine: 3, learning: 4}, then the phrase "hello world" would be represented as [1, 2]. This transformation allows text data to be processed by machine learning models, which operate on numerical inputs rather than raw text.

Token indexing is essential because it provides a structured way to represent text, enabling machine learning models to process and analyze it efficiently. Without this step, words would have no standardized numerical representation, making it difficult for models to recognize patterns or relationships in language. Additionally, token indexing helps manage computational resources effectively by ensuring that text input is transformed into compact numerical sequences that can be fed into neural networks. Advanced tokenization methods, such as subword tokenization (e.g., Byte Pair Encoding or WordPiece), further enhance this process by breaking down rare or unknown words into smaller, more frequently occurring components.

One of the most important aspects of token indexing is its connection to word embeddings. Once text is converted into token indices, these indices serve as inputs to an embedding layer in a neural network. The embedding layer maps each token index to a dense, high-dimensional vector that captures semantic and syntactic relationships between words. For instance, words with similar meanings, such as "king" and "queen", would have embeddings that are closer in vector space. This transformation is crucial for deep learning models, as it allows them to understand language beyond simple numerical representations.

By linking token indexing to word embeddings, machine learning models gain the ability to generalize and make sense of complex linguistic patterns. Pretrained embeddings, such as Word2Vec, GloVe, or contextual embeddings from Transformer models like BERT and GPT, enable models to leverage large amounts of text data to learn meaningful word associations. Ultimately, token indexing serves as the foundation for powerful NLP applications, including text classification, sentiment analysis, machine translation, and chatbots, making it a fundamental component of modern natural language processing.

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:
from src.utils.libimports.textidx import *
from src.utils.data.files import *
from src.text.vectorizing.vocab import *

Download Required Data¶

Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.

In [2]:
sentences_pos, _ = download_dataset("text/classification/sentence-polarity/sentence-polarity.pos")
sentences_neg, _ = download_dataset("text/classification/sentence-polarity/sentence-polarity.neg")
File 'data/datasets/text/classification/sentence-polarity/sentence-polarity.pos' already exists (use 'overwrite=True' to overwrite it).
File 'data/datasets/text/classification/sentence-polarity/sentence-polarity.neg' already exists (use 'overwrite=True' to overwrite it).

This notebook also generates new data and saves it as files into a specified output folder. You can change the default output folder in the code cell below to customize your file management.

In [3]:
output_folder = create_folder("data/generated/")

Example Dataset¶

We start with a very small example dataset and later look at (arguably still small) real-world dataset. This allows us to clearly see and understand all the involved steps and inspect the intermediate result after each step. For example, given the very small dataset size, we can actually print the complete vocabulary.

Create Simple Classification Dataset¶

The array dataset_news below represent a very simple classification dataset to train a model to predict whether a sentence from a news article should be labeled "politics" or "sports". As we don't really train a model and follow each step, this dataset contains only 7 samples.

In [4]:
dataset_news = [
    ("The mayor was elected for this term and the next term.", "politics"),
    ("A mayor's goal for the next term is to win.", "politics"),
    ("The goal for this term was to win the vote.", "politics"),
    ("This term's goals are next term's goals.", "politics"),
    ("The goal of any team player is the win.", "sports"),
    ("A win for the team is a win for each player.", "sports"),
    ("Players vote other players for another term.", "sports")
]

inputs_news = [ tup[0] for tup in dataset_news ]
targets_news = [ tup[1] for tup in dataset_news ]

Prepare Class Labels¶

Assuming that $C$ denotes the number of classes, most algorithms for training classification models in existing libraries — including neural network libraries such as PyTorch — assume that all classes are labeled from $0$ to $C-1$. Right now, our classes are labeled with the string "sports" and "politics", so we need to convert those to adhere to the required format. Since we only have 2 classes for our example, we can do this "manually" by creating two dictionaries that map each string class label to a valid integer class label:

In [5]:
label2index_news = { "politics": 0, "sports": 1 }
index2label_news = { 0: 'politics', 1: 'sports' }

While this works just fine, we have to be careful to avoid mistakes, particularly when the number of classes increases. For one, we have to ensure that the mapping is consistent in both directions. For example, if "politics" maps to $0$, we have to make sure that $0$ maps back to "politics"; this must hold for all $C$ classes. We also have to make sure that we do not map two different class labels to the same index — although this would throw an error when trying to create the dictionary to map from indices back to the class labels since an index can only map to one label.

To avoid these kinds of mistakes, we therefore do not create these two mappings manually but automatically. For this, let's first identify the unique set of labels for a given dataset, which is just two class labels for our very simple news dataset. We already have all labels for each data sample as a list. To get all unique labels, we can simply convert this list to a set.

In [6]:
labels_news = set(targets_news)

print(labels_news)
{'sports', 'politics'}

To create a unique index from $0$ to $C-1$ for each class label, we can utilize the enumerate() method to make life easy. This is a built-in method in Python that adds a counter to an iterable (e.g., set, list, tuple, or dictionary) and returns it as an enumerate object, which can be directly converted into a list or iterated through in a loop. It is commonly used in for loops to simultaneously access both the index and the value of each item in the iterable. This method improves code readability and eliminates the need to manually maintain a counter when iterating over collections.

The code cell below also uses the nice concept of dictionary comprehension in Python. Dictionary comprehensions provide a concise and elegant way to create dictionaries by combining loops and conditional logic in a single line of code. This approach is more readable and efficient compared to using explicit loops for dictionary creation.

In [7]:
label2index_news = { label:index for index, label in enumerate(set(targets_news)) }

print(label2index_news)
{'sports': 0, 'politics': 1}

Notice that this automated approach ensures that we do not accidentally map two different class labels to the same index. The only minor limitation is that we cannot specify which label gets mapped to which index, and vice versa. However, in practice, this basically never matters. It's only important that the indices are between 0 and C-1 which is easily achieved. Of course, if the exact mapping does indeed matter, this can still easily be achieved, for example, by manually creating a list of all class labels with the desired order.

We now can map from our string class labels to integer class labels, but we still miss the opposite direction that allows us to map back from the indices to the original class labels. To ensure that both mappings are consistent, we can create the mapping from indices to strings by "reversing" the label2index_news dictionary. Again, using dictionary comprehension this only takes a single line of code to accomplish:

In [8]:
index2label_news = { v:k for k,v in label2index_news.items() }

print(index2label_news)
{0: 'sports', 1: 'politics'}

With these two dictionaries, we can now map back and forth between the original string labels and the index labels that serve as input for a classification algorithm. This to convert our list of string labels for all data samples to a vector of label indices, we can now use a list comprehension and the dictionary label2index_news:

In [9]:
target_vector_news = [ label2index_news[label] for label in targets_news ]

print(f"{targets_news} ==> {target_vector_news}")
['politics', 'politics', 'politics', 'politics', 'sports', 'sports', 'sports'] ==> [1, 1, 1, 1, 0, 0, 0]

Once a model is trained with these integer class labels, this model will output these integer class labels as its predictions. If we want to convert these integers back to the original string class labels, we can use the dictionary index2label_news. For example, if we assume that the model returns 1 as the prediction class label for some input text, we can simple get the string label using:

In [10]:
print(index2label_news[1])
politics

With respect to the class labels we are done. The more interesting step is to convert our sentences.

Create Vocabulary & Mappings¶

Similar to the class labels, the overall goal is to map each unique word/token to a unique index, i.e., an integer identifier. And also similar to the labels, given a vocabulary size of V these unique indices must be of the range from $0$ to $V-1$. However, compared to the list of class labels we could directly convert to a set, for the inputs, we need to perform some additional steps; more specifically:

  • Create vocabulary: Like for the class labels, the vocabulary is simply the set of unique words — strictly speaking, unique tokens as the vocabulary may contain punctuation marks, numbers, or any other non-standard words. However, in practice, we also often want the number of occurrences of each token to potentially limit the vocabulary to the most common tokens

  • Create mappings: Once we have the vocabulary, we again need to map each token to a unique index, and vice versa. Of course, we again have to make sure that the mappings are consistent. We therefore create both mappings the same way we did for the class labels

  • Additional considerations: Converting between token and indices may involve various application-dependent considerations. The most common one is that we need to decide how to convert unseen tokens to a meaningful index. Not only do we typically limit the size of the vocabulary, but even if we would not, a new input text may always have tokens not present in the vocabulary.

Well, let's do this...

Create Vocabulary¶

Compute Token Frequencies¶

As already mentioned, we often want to filter out (very) tokens from the vocabulary. We therefore need to first compute the number of occurrences of all tokens/words in our corpus. Using the Counter class from Python's collections library makes this very easy. This is a specialized dictionary subclass designed to count the occurrences of elements in an iterable, such as a list or string. It stores elements as dictionary keys and their counts as values, making it easy to tally frequencies.

Important: In this step, we also need to decide how we want to preprocess or filter the input text. For example, in the code cell below, we perform case-folding (i.e., converting all words to lowercase). In contrast, we do not perform stemming or lemmatization, and do keep all stopwords, punctuation marks, etc. Which preprocessing step are performed, strongly depends on the exact application context or other assumption.

In [11]:
# Auxiliary method to preprocess a text string
def preprocess(text):
    return [token.text.lower() for token in nlp(text) ]
    

# Create counter (a specialized dictionary)
token_counter_news = Counter()

for text in inputs_news:
    for token in preprocess(text):
        token_counter_news[token] += 1
        
print(token_counter_news)        
Counter({'the': 8, 'term': 7, '.': 7, 'for': 6, 'win': 5, 'this': 3, 'next': 3, 'a': 3, "'s": 3, 'goal': 3, 'is': 3, 'mayor': 2, 'was': 2, 'to': 2, 'vote': 2, 'goals': 2, 'team': 2, 'player': 2, 'players': 2, 'elected': 1, 'and': 1, 'are': 1, 'of': 1, 'any': 1, 'each': 1, 'other': 1, 'another': 1})
Sort Tokens by Frequency¶

By default, the Counter class does not guarantee that the entries are sorted with respect to their counts. So let's create a sorted list of tuples, where each tuple contains the token and the number of occurrences. As usual, Python makes such things exceedingly easy using a single line of code using the sorted() method. This is a built-in method of Python that returns a new sorted list from the elements of any iterable, such as lists, tuples, or strings, without modifying the original iterable. The syntax is sorted(iterable, key=None, reverse=False), where key is an optional parameter specifying a function to determine the sorting criteria, and reverse is a Boolean indicating whether to sort in descending order (default is False for ascending order).

In [12]:
# Sort by word frequency
token_counter_news_sorted = sorted(token_counter_news.items(), key=lambda x: x[1], reverse=True)

print("Number of tokens: {}".format(len(token_counter_news_sorted)))
print(token_counter_news_sorted)
Number of tokens: 27
[('the', 8), ('term', 7), ('.', 7), ('for', 6), ('win', 5), ('this', 3), ('next', 3), ('a', 3), ("'s", 3), ('goal', 3), ('is', 3), ('mayor', 2), ('was', 2), ('to', 2), ('vote', 2), ('goals', 2), ('team', 2), ('player', 2), ('players', 2), ('elected', 1), ('and', 1), ('are', 1), ('of', 1), ('any', 1), ('each', 1), ('other', 1), ('another', 1)]
Limit Vocabulary¶

Since we now have a sorted list, where the tokens are sorted with respect to their number of occurrences in an descending order, we can easily filter out only the k most frequent tokens. Since we only have 27 tokens in our "full" vocabulary, let's consider the top-25 most frequent tokens; see the code cell below. In practice, a "full" vocabulary can easily contain several hundred thousand tokens, with only top-20k to top-50k tokens considered.

In [13]:
TOP_TOKENS_NEWS = 25

token_counter_news_sorted_filtered = token_counter_news_sorted[:TOP_TOKENS_NEWS]

print("Number of tokens: {}".format(len(token_counter_news_sorted_filtered)))
print(token_counter_news_sorted_filtered)
Number of tokens: 25
[('the', 8), ('term', 7), ('.', 7), ('for', 6), ('win', 5), ('this', 3), ('next', 3), ('a', 3), ("'s", 3), ('goal', 3), ('is', 3), ('mayor', 2), ('was', 2), ('to', 2), ('vote', 2), ('goals', 2), ('team', 2), ('player', 2), ('players', 2), ('elected', 1), ('and', 1), ('are', 1), ('of', 1), ('any', 1), ('each', 1)]
Create Final Vocabulary¶

The get the final vocabulary, i.e., only the list or set of tokens we want to consider, we can use another list comprehension to extract all the tokens from the (token, count)-pairs:

In [14]:
tokens_news = [ tup[0] for tup in token_counter_news_sorted_filtered ]

print(tokens_news)
['the', 'term', '.', 'for', 'win', 'this', 'next', 'a', "'s", 'goal', 'is', 'mayor', 'was', 'to', 'vote', 'goals', 'team', 'player', 'players', 'elected', 'and', 'are', 'of', 'any', 'each']

Create Mappings¶

Special Tokens¶

Many neural network models working with text require or benefit from special tokens beyond the tokens in the vocabulary derived from a training dataset. Special tokens such as SOS (start of Sequence), EOS (end of Sequence), PAD (padding), UNK (unknown), SEP (separator), and CLS (classification) play critical roles in preparing text data for neural networks, particularly in models for tasks like text generation, machine translation, and language understanding.

  • SOS and EOS mark the start and end of a sequence, helping models learn the boundaries of input or output text. These tokens guide sequence models in identifying where text begins and ends, which is crucial for tasks such as translation and text generation.
  • PAD tokens are used to standardize the lengths of sequences by padding shorter sequences to match the length of the longest one in a batch, facilitating efficient batch processing.
  • UNK represents out-of-vocabulary words not present in the training set, allowing models to handle unseen words during inference.

In transformer-based models like BERT, SEP separates segments of text within a single input sequence, enabling the model to distinguish between different sentences or clauses. CLS is a special token placed at the beginning of the input and is used to aggregate the representation of the entire sequence for classification tasks. These special tokens help structure input text, improve processing efficiency, and enhance the ability to capture semantic information for various NLP tasks.

These are the most common special tokens but others exists; which special tokens are indeed need depends on the exact model and task. For our example here, let's consider all the six special tokens we have just described by creating a list containing all special tokens:

In [15]:
TOKEN_PAD, TOKEN_UNK, TOKEN_SOS, TOKEN_EOS, TOKEN_SEP, TOKEN_CLS = "<PAD>", "<UNK>", "<SOS>", "<EOS>", "<SEP>", "<CLS>"

SPECIAL_TOKENS = [TOKEN_PAD, TOKEN_UNK, TOKEN_SOS, TOKEN_EOS, TOKEN_SEP, TOKEN_CLS]

Note that the exact string representing the special tokens does not matter. It is only important to ensure, none of the special tokens is likely to already exist in our initial vocabulary derived from the training data. For this reason, we add angled brackets to each token since tokens such as "pad" or "sos" may already occur in the dataset. Still, instead of <PAD>, we could also use [PAD], <PADDING>, ###PAD###, and so on. In short, there is nothing intrinsically specific about the strings for the special tokens. However, our choices above adhere to best practices used in many implementations and educational materials.

Side note: While not mandatory, we use <PAD> as the first special token in the list above. This ensures in a moment, that <PAD> will be mapped to the index $0$. This index is commonly assumed to be the index for padding tokens. However, it is generally not mandatory since one can specify explicitly which index indicates padding.

Token2Index & Index&Token Mappings¶

With our initial vocabulary and the list of special tokens, we can create a mapping from tokens to unique indices the same way as we did for the class labels using a dictionary comprehension. Note that the dictionary comprehension iterates over the concatenation of the list of special tokens and the list of vocabulary tokens. We chose this order to ensure that the special tokens get index first, particularly so that <PAD> will get the index $0$. Again, this is not mandatory but a very common best practice.

In [16]:
token2index_news = { token:index for index, token in enumerate(SPECIAL_TOKENS + tokens_news) }

print(token2index_news)
{'<PAD>': 0, '<UNK>': 1, '<SOS>': 2, '<EOS>': 3, '<SEP>': 4, '<CLS>': 5, 'the': 6, 'term': 7, '.': 8, 'for': 9, 'win': 10, 'this': 11, 'next': 12, 'a': 13, "'s": 14, 'goal': 15, 'is': 16, 'mayor': 17, 'was': 18, 'to': 19, 'vote': 20, 'goals': 21, 'team': 22, 'player': 23, 'players': 24, 'elected': 25, 'and': 26, 'are': 27, 'of': 28, 'any': 29, 'each': 30}

With this first dictionary mapping from tokens to indices, we can create the dictionary mapping the indices to tokens by reversing token2index_news using a dictionary comprehension. This ensures that both mappings are consistent using a single line of code.

In [17]:
index2token_news = { v:k for k,v in token2index_news.items() }

print(index2token_news)
{0: '<PAD>', 1: '<UNK>', 2: '<SOS>', 3: '<EOS>', 4: '<SEP>', 5: '<CLS>', 6: 'the', 7: 'term', 8: '.', 9: 'for', 10: 'win', 11: 'this', 12: 'next', 13: 'a', 14: "'s", 15: 'goal', 16: 'is', 17: 'mayor', 18: 'was', 19: 'to', 20: 'vote', 21: 'goals', 22: 'team', 23: 'player', 24: 'players', 25: 'elected', 26: 'and', 27: 'are', 28: 'of', 29: 'any', 30: 'each'}
Additional Considerations: Handling Unknown Words¶

With the dictionary token2index_news we can not map from tokens to their respective indices. However, this only works if the token is either a known special token or a token from the initial vocabulary. Otherwise, the token is not a valid key in our dictionary and trying to access it would result in an error. We therefore need to handle this case meaningfully. This is where the special token <UNK> comes into play. Any time we encounter a token that is not in our vocabulary, we treat it as "unknown" and map it to its respective index. We consider this the default index for unknown tokens, and we have to explicitly specify it for later use:

In [18]:
default_index = token2index_news[TOKEN_UNK]

print(default_index)
1

Working with the Vocabulary & Mappings¶

With the vocabulary and both mappings created, we can now use them for preparing our text data to serve as input for neural networks. For example, let's see which indices have been assigned to our special tokens. This task comes down to a simple lookup in the dictionary token2index_news.

In [19]:
for special_token in SPECIAL_TOKENS:
    token_index = token2index_news[special_token]
    print("The index of {} in the vocabulary is: {}".format(special_token, token_index))
The index of <PAD> in the vocabulary is: 0
The index of <UNK> in the vocabulary is: 1
The index of <SOS> in the vocabulary is: 2
The index of <EOS> in the vocabulary is: 3
The index of <SEP> in the vocabulary is: 4
The index of <CLS> in the vocabulary is: 5

As you can see, the indices reflect the order of our special tokens in the list SPECIAL_TOKENS. And as mentioned above, having <PAD> mapped to $0$ is often convenient in practice. Of course, we can also map back from indices to the tokens. Let's do this for the first 10 indices.

In [20]:
for idx in range(10):
    print("Token at index {}: {}".format(idx, index2token_news[idx]))
Token at index 0: <PAD>
Token at index 1: <UNK>
Token at index 2: <SOS>
Token at index 3: <EOS>
Token at index 4: <SEP>
Token at index 5: <CLS>
Token at index 6: the
Token at index 7: term
Token at index 8: .
Token at index 9: for

In practice, we don't need to do this for each individual token or index. It's much more convenient to do the encoding for a complete list of tokens. Here, we have to address the case that a token might not have been in our vocabulary and therefore not a key in the dictionary token2index_news. Recall that we want to map to the index for the special token UNK (Unknown) in this case. For convenience and easy re-use, let's implement this using an auxiliary method encode():

In [21]:
def encode(tokens: list[str]):
    return [ token2index_news[t] if t in token2index_news else default_index for t in tokens ]

Let's first apply this method for a list of token where all tokens are in our vocabulary.

In [22]:
encode(['the', 'mayor', 'was', 'elected'])
Out[22]:
[6, 17, 18, 25]

The code cell below shows an example for what happens if the input list contains a previously unseen token (here: president). As required, unseen tokens get replaced by the index representing our special tokens <UNK> which is 1 in our case here.

In [23]:
encode(['the', 'president', 'was', 'elected'])
Out[23]:
[6, 1, 18, 25]

Similarly, we also want a method that decodes a list of indices back to their respective tokens; the method decode() below accomplishes this. Notice this method also takes a default_token as an input parameter in case of an unknown index. When decoding the output of a neural network, this case should never happen, as the network will never return an unknown index. Still, users can call this method manually with arbitrary indices and we should handle this case gracefully without throwing an error.

In [24]:
def decode(indices: list[int], default_token="<???>"):
    return [ index2token_news[i] if i in index2token_news else default_token for i in indices ]

The code cell below shows the basic use of this method; here, all indices are mapped to existing tokens.

In [25]:
decode([6, 17, 18, 25, 3])
Out[25]:
['the', 'mayor', 'was', 'elected', '<EOS>']

Of course, if the list of indices contains the index that is not known, it maps to the specified default_token.

In [26]:
decode([6, 99, 18, 25, 3])
Out[26]:
['the', '<???>', 'was', 'elected', '<EOS>']

Vectorize Corpus¶

Finally, we can now vectorize our text documents (i.e., our news article sentences) by simply applying the encode() method to each sentence in our dataset.

In [27]:
input_vectors_news = [ encode(preprocess(text)) for text in inputs_news ]

input_vectors_news
Out[27]:
[[6, 17, 18, 25, 9, 11, 7, 26, 6, 12, 7, 8],
 [13, 17, 14, 15, 9, 6, 12, 7, 16, 19, 10, 8],
 [6, 15, 9, 11, 7, 18, 19, 10, 6, 20, 8],
 [11, 7, 14, 21, 27, 12, 7, 14, 21, 8],
 [6, 15, 28, 29, 22, 23, 16, 6, 10, 8],
 [13, 10, 9, 6, 22, 16, 13, 10, 9, 30, 23, 8],
 [24, 20, 1, 24, 9, 1, 7, 8]]

Important: While now each sentence is now a list of indices (e.g., integer values) and thus strictly speaking a vector, the indices are still just labels and carry no semantic meaning. As such, for example, it would not make sense to compute the vector similarity between those vectors (even when assuming both vectors have the same length). However, this is the default representation for text input. Word semantics are added using a word embedding layer (covered) in a later notebook.

Practical Application¶

Class Implementation¶

These basic steps of (a) creating the vocabulary, (b) creating the mappings between tokens and indices, and (c) any additional consideration (e.g., default index) are very common and almost always exactly the same. It is therefore very useful consolidate all these steps into their own class for easy re-use. We did so by implementing a class Vocabulary which you can find in the file src.text.vectorizing.vocab.py. In its core, this class implements the methods all the methods we have seen so far. So let's use our example datset of news article sentences to see how we can use this class.

In [28]:
vocabulary_news = Vocabulary(tokens_news, SPECIAL_TOKENS)

vocabulary_news.set_default_index(vocabulary_news[TOKEN_UNK])

As this class implements the encode() method we have seen before, we can use it to encode a list of input tokens.

In [29]:
print(vocabulary_news.encode(['the', 'president', 'was', 'elected']))
[ 6  1 18 25]

Conversely, we can use the class method decode() to map a list of indices back to their respective tokens.

In [30]:
print(vocabulary_news.decode([6, 17, 18, 25, 3]))
['the', 'mayor', 'was', 'elected', '<EOS>']

If you think about it out initial set of string class label is also just a vocabulary; for our news dataset only containing the two tokens "sports" and "politics". As such, we can also use the Vocabulary class not only for the tokens but also for the class labels. With this, we can directly benefit from the encode() and decode() method provided by the class without required additional code for handling the class labels:

In [31]:
vocabulary_news_targets = Vocabulary(labels_news)

print(vocabulary_news_targets.encode(["sports", "sports", "politics"]))

print(vocabulary_news_targets.decode([1, 0, 1]))
[0 0 1]
['politics', 'sports', 'politics']

Save Vectorized Dataset & Vocabularies¶

In practice, we often deal with very large datasets. This means that creating the vocabulary and vectorizing the corpus can take a significant amount of time — note this also includes any potentially time-consuming preprocessing. It is therefore common to consider this as an individual step and save the vectorized dataset to be used for training later.

Save Inputs & Targets¶

In the code cell below, we loop through all sentences in our toy corpus, vectorize a sentence and save the resulting sequence together with the class label directly to a file as a new line. This has the advantage that there is no need to keep the whole vectorized dataset in memory.

Side note: In the code cells below, we use a naming scheme to reflect the number of tokens in the vocabulary (excluding the special tokens). Such naming schemes can be useful when the same raw input data gets converted into different datasets using different preprocessing steps of vocabulary settings.

In [32]:
output_file = open(f"{output_folder}toy-news-dataset-vectors-{TOP_TOKENS_NEWS}.txt", "w")

for idx, text in enumerate(inputs_news):
    # Get label
    label = vocabulary_news_targets.encode([targets_news[idx]])[0]
    # Get sentence and vectorize it using the vocabulary
    vector = vocabulary_news.encode(preprocess(text))
    # Write sequence and label to file (separate sequence and label using a tab)
    output_file.write(f"{' '.join([str(idx) for idx in vector])}\t{label}\n")
        
output_file.flush()
output_file.close()
Save Vocabularies¶

We also need to save the two vocabularies to save the mappings between the tokens/labels and their indices. Without it, we would only have a dataset of integer sequences without knowing which tokens or class labels those integers represent. We can still train a model — after all, this is why we vectorize the dataset to begin with — however, then we could not predict the class labels for new sentences since we would now know how to vectorize those new sentences. For this, we can simply use the pickle library for Python. This library is used for serializing and deserializing Python objects, meaning it converts Python objects into a byte stream (serialization) and reconstructs them back into their original form (deserialization). This is useful for saving objects to a file, sending data over a network, or storing complex data structures for later use. The pickle module supports a wide range of Python data types, including lists, dictionaries, and even custom objects.

The dump() method in the pickle library takes two main arguments: the object to be serialized and the file where the serialized data should be stored. The syntax is pickle.dump(obj, file, protocol=None), where protocol specifies the serialization format (defaulting to the latest protocol available).

In [33]:
with open(f"{output_folder}toy-news-dataset-{TOP_TOKENS_NEWS}.vocab", 'wb') as out_file:
    pickle.dump(vocabulary_news, out_file)

with open(f"{output_folder}/toy-news-dataset-targets-{TOP_TOKENS_NEWS}.vocab", 'wb') as out_file:
    pickle.dump(vocabulary_news_targets, out_file)

Of course, once saved, we can load both vocabularies using the corresponding load() method. The syntax is pickle.load(file), where file is the binary file containing the serialized data. This method reads the stored byte stream and converts it back into the original Python object.

In [34]:
with open(f"{output_folder}toy-news-dataset-{TOP_TOKENS_NEWS}.vocab", "rb") as in_file:
    vocabulary_news = pickle.load(in_file)

with open(f"{output_folder}/toy-news-dataset-targets-{TOP_TOKENS_NEWS}.vocab", "rb") as in_file:
    vocabulary_news_targets = pickle.load(in_file)

Real-World Dataset¶

The Sentence Polarity Dataset is a popular benchmark for binary text classification tasks, particularly for sentiment analysis. It consists of a total of 10,662 positive and negative sentence samples extracted from movie reviews. The dataset contains two classes: one for positive sentiment and another for negative sentiment, with no neutral examples. Each sentence is labeled according to its sentiment polarity — either positive or negative. This dataset is commonly used for training and evaluating machine learning models to understand and classify the sentiment expressed in short text snippets. It serves as a foundation for various NLP applications, including sentiment analysis tools, recommendation systems, and opinion mining. The binary classification task aims to correctly predict whether a given sentence conveys a positive or negative sentiment based solely on its textual content.

The sentences are split across two files according to their sentiment. As such, the sentiment labels can be derived from the file names, more specifically, from their extensions .pos and .neg. We have downloaded both files at the beginning of the notebook and have both file names stored in the variables sentences_pos and sentences_neg.

Read Files & Compute Word Frequencies¶

The first step is again to go through the whole corpus and count the number of occurrences for each token. For really large corpora this can actually take quite some time. 10k sentences is basically nothing these days, but the purpose of this notebook is not to focus on large scale data as the steps would be exactly the same.

In [35]:
token_counter_polarity = Counter()

targets_polarity = []

with tqdm(total=10662) as pbar:
    # Loop over all file names
    for file_name in [sentences_pos, sentences_neg]:
        # Get sentiment label from file name extensions
        label = file_name.split(".")[-1]
        # Loop over each sentence (1 sentence per line)
        with open(file_name) as file:
            for line in file:
                # Update token counts
                for token in preprocess(line):
                    token_counter_polarity[token] += 1            
                # Add label to targets list
                targets_polarity.append(label)
                # Update progress bar
                pbar.update(1)

# Identify set of unique class labels
labels_polarity = set(targets_polarity)
100%|████████████████████████████████████| 10662/10662 [00:53<00:00, 199.75it/s]

Prepare Class Labels¶

When preprocessing the sentences, we also create a list of class labels for each sentence. We can use this list of labels as input for the Vocabulary class to get our mapping from class labels to class indices and vice versa.

In [36]:
vocabulary_polarity_targets = Vocabulary(labels_polarity)

print(vocabulary_polarity_targets.token2index)
print(vocabulary_polarity_targets.index2token)
{'neg': 0, 'pos': 1}
{0: 'neg', 1: 'pos'}

Create Vocabulary¶

To create our vocab object, we perform exactly the same steps as above. The only difference is that our "full" vocabulary is now larger (although with less than 20k tokens still rather small). We therefore limit the vocabulary here to the 10,000 most frequent tokens. We also combine steps of sorting and filtering the tokens in the same code cell; we already saw what the individually steps are doing for our simple news article dataset.

In [37]:
# Sort by token frequency
token_counter_polarity_sorted = sorted(token_counter_polarity.items(), key=lambda x: x[1], reverse=True)

# Limit number of tokens to the top-10000 most frequent tokens
TOP_TOKENS_POLARITY = 10000
token_counter_polarity_sorted_filtered = token_counter_polarity_sorted[:TOP_TOKENS_POLARITY]

# Extract final list of tokens
tokens_polarity = [ tup[0] for tup in token_counter_polarity_sorted_filtered ]

With tokens_polarity containing all the tokens we want to capture in our vocabulary, we create a Vocabulary instance using this list. To keep it simple and consistent, we also include the same list of special tokens as before. We also should not forget to set the default index or appropriately handle unknown tokens when encoding an input text.

In [38]:
vocabulary_polarity = Vocabulary(tokens_polarity, special_tokens=SPECIAL_TOKENS)

vocabulary_polarity.set_default_index(vocabulary_polarity[TOKEN_UNK])

For illustration, we can use this vocabulary to encode an example sentence to its corresponding list of token indices.

In [39]:
print(vocabulary_polarity.encode(["the", "movie", "was", "not", "that", "good", ",", "but", "i", "left", "the", "cinema", "entertained", "."]))
[   8   26  106   34   19   62    9   21   50  490    8  257 2554    6]

Save Dataset & Vocabularies¶

Lastly, for later use, we can again save all the data to files.

Vectorize and Save Dataset¶

Like before, we use the vocabulary to vectorize all sentences — convert all sentences to their corresponding list of token indices — and save all vectorized sentences, together with the transformed class label, to a file.

In [40]:
output_file = open(f"{output_folder}polarity-dataset-vectors-{TOP_TOKENS_POLARITY}.txt", "w")

with tqdm(total=10662) as pbar:
    for file_name in [sentences_pos, sentences_neg]:
        # Get class label from file name    
        label_name = file_name.split(".")[-1]
        # Iterate over all sentences, vectorize and save them
        with open(file_name) as file:
            for line in file:
                label = vocabulary_polarity_targets.encode([label_name])[0]
                vector = vocabulary_polarity.encode(preprocess(line))
                output_file.write(f"{' '.join([str(idx) for idx in vector])}\t{label}\n")
                pbar.update(1)
            
output_file.flush()
output_file.close()            
100%|████████████████████████████████████| 10662/10662 [00:50<00:00, 213.08it/s]

Save Vocabularies¶

We need to remember both mappings between the tokens and their indices, as well as between the class labels and their indices. The easiest way to to do this is once again to simply save both Vocabulary instances using the pickle library.

In [41]:
with open(f"{output_folder}polarity-dataset-{TOP_TOKENS_POLARITY}.vocab", "wb") as out_file:
    pickle.dump(vocabulary_polarity, out_file)

with open(f"{output_folder}polarity-dataset-targets-{TOP_TOKENS_POLARITY}.vocab", 'wb') as out_file:
    pickle.dump(vocabulary_polarity_targets, out_file)

Summary¶

In machine learning, particularly in natural language processing (NLP), converting text into sequences of token IDs based on a predefined vocabulary is essential for training and deploying models. Since machine learning algorithms work with numerical data, raw text must be transformed into a numerical representation that models can process effectively. Tokenization, followed by mapping tokens to unique numerical IDs, enables this conversion while preserving the structure and meaning of the text.

A vocabulary serves as a reference that assigns a unique integer ID to each token (word, subword, or character) found in the dataset. This process helps standardize input data and ensures consistency in how text is represented across different models and tasks. Using token IDs instead of raw words reduces memory consumption and computational complexity, allowing models to handle large-scale text data more efficiently. Additionally, assigning token IDs enables the use of embedding layers, where words with similar meanings can be mapped to nearby points in a continuous vector space, improving model performance.

The importance of converting text into token sequences extends to various NLP applications, such as text classification, machine translation, and sentiment analysis. Pretrained language models like BERT and GPT use subword tokenization techniques (e.g., WordPiece or Byte Pair Encoding) to handle rare and out-of-vocabulary words, ensuring that even unknown words are decomposed into meaningful subunits. This enhances model generalization and allows it to process diverse linguistic patterns more effectively.

Moreover, tokenization and vocabulary-based encoding play a critical role in sequence-based models like recurrent neural networks (RNNs) and transformers. These models rely on structured numerical input to learn contextual relationships between words. The use of token IDs also enables batching, padding, and attention mechanisms, which are crucial for efficient training and inference. Without proper text-to-token conversion, NLP models would struggle to learn meaningful representations, leading to suboptimal performance.

In summary, converting text into token sequences using a vocabulary is a foundational step in NLP and machine learning. It bridges the gap between human language and numerical computation, facilitating effective model training and deployment. By standardizing text representation, improving efficiency, and enabling better generalization, this process ensures that machine learning models can process and understand language in a structured and meaningful way.

In [ ]: