Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Working with Batches for Sequence Tasks¶

When working with sequential data in neural networks, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), one major problem arises when processing data in batches: sequences often have varying lengths. Unlike fixed-size inputs like images, text and time series data can have different lengths, making it difficult to create uniform batches for training. Since neural networks expect inputs of the same shape, handling sequences of different sizes requires special preprocessing techniques, such as padding or truncation.

Another challenge is maintaining the temporal dependencies in the data while batching multiple sequences together. RNNs, for example, process sequences step by step, maintaining a hidden state that carries information from previous time steps. When batching sequences, the model must ensure that dependencies between time steps are preserved, which becomes complicated if sequences in a batch have different lengths. Padding shorter sequences with zeros (or a special padding token in NLP) helps align batch sizes, but this can introduce redundant information that affects learning if not handled properly.

Additionally, in CNNs designed for sequence processing (such as 1D convolutional networks for text or time series), the issue of varying sequence lengths can complicate convolution operations. Since convolutional filters slide over fixed-size windows of input data, sequences of different lengths might require resizing, padding, or adaptive pooling strategies to ensure compatibility within a batch. If not handled properly, this can lead to loss of meaningful information or inefficient training due to excessive padding.

Lastly, batching sequential data can introduce inefficiencies in computation and memory usage. When padding is used to make all sequences in a batch the same length, the model must process padded elements even though they contain no useful information. This can lead to unnecessary computations and increased memory usage, slowing down training. Advanced techniques like masking (to ignore padded elements) and dynamic batching (grouping sequences of similar lengths together) help mitigate this issue, but they add complexity to model implementation.

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:
from src.utils.libimports.varseq import *
from src.utils.sampling.batchsampler import EqualLengthsBatchSampler

Create Example Batches¶

Throughout this notebook we make use of two example datasets to illustrate how to handle sequences of variable length for different sequence tasks.

Text Classification¶

We first consider a many-to-one sequence task where the input for the neural network is a sequence (or batch of sequences!) and the output is a single value such as a class label. A very common example is text classification. In the code cell below, we create a simple batch where each entry is a tuple containing the input sequence and the class label. Each sequence represents a sentence, where the values are the token indices after converting all words in the sentence into their unique indices based on the vocabulary. For further processing, we also split the batch into the list of sequences and the list/array of class labels (i.e., the targets).

In [2]:
data_classification = [
    ([ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8], 0),
    ([13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8], 0),
    ([ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8], 0),
    ([11,  7, 14, 21, 27, 12,  7, 14, 21,  8], 0),
    ([ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8], 1),
    ([13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8], 1),
    ([24, 20,  1, 24,  9,  1,  7,  8], 1),
    ([12, 13,  4, 15, 18,  2,  4, 10,  8], 1)
]
# Extract all sequences and convert each sequence to a tensor of long values
sequences = [ torch.LongTensor(sample[0]) for sample in data_classification ]
# Extract targets (i.e., class labels for a binary classificaton task)
targets = torch.LongTensor([ sample[1] for sample in data_classification ])

Sequence-to-Sequence¶

For a second example dataset we consider a many-to-many (or sequence-to-sequence) sequence task where both the inputs and the targets are sequences. A very common example is machine translation, where the input sequences are the sentences in the source language and the target sequences are the translated sentences in the target language. The code cell below creates a mock sequence-to-sequence (seq2seq) dataset for machine translation. Again, the sequence values may represent token indices based on the vocabulary. The index values carry no meaning here as only the length of the input and target sequences matter here. The code also extracts both the list of input sequences and the list of target sequences for further processing.

In [3]:
data_seq2seq = [
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3], [1, 2, 3, 4]),
    ([1, 2, 3, 4], [1, 2, 3, 4]),
    ([1, 2, 3, 4], [1, 2, 3, 4]),
    ([1, 2, 3, 4], [1, 2, 3, 4]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3, 4], [1, 2]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
    ([1, 2, 3], [1, 2, 3, 4, 5]),
]
# Extract input and target sequences
input_sequences  = [ tup[0] for tup in data_seq2seq ]
target_sequences = [ tup[1] for tup in data_seq2seq ]

Approach 1: Padding & Truncating¶

The minimum requirement for an input batch — i.e, a list of sequences grouped together — is that all sequences in the same batch have the same length. In the following, we refer to the length of a batch to mean to length of the sequences in the batch and the size of a batch to mean the number of sequences in that batch. We therefore want to create a batch of size 8 to contain all of our sentences of the example classification dataset. Right now, most of those 8 sequences have a different length, so we need to fix this.

Basic Padding¶

With the requirement that each batch may only contain sequences of the same length, we can utilize the PyTorch method pad_sequence. This method first finds the longest sequence(s) and then pads all shorter sequences using a specified value. The result, of course, is that all sequences after the same length. As such the result is no longer a list of 8 1d tensors but a 2d tensor with a shape of (batch_size, max_seq_len) where batch_size is 8 as the number of sequences and max_seq_len is 12 as the length of the longest sequence(s).

In [4]:
sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=0)

print(sequences_padded)
print(sequences_padded.shape)
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0]])
torch.Size([8, 12])

Important: The padding_value cannot be chosen arbitrarily. Using the correct value for padding sequences as input for neural networks is crucial because improper padding can introduce unintended biases and negatively impact model performance. Padding is commonly used when dealing with variable-length sequences in tasks like natural language processing (NLP) and time-series analysis, ensuring that all input sequences have the same length for batch processing. However, if the padding value is not chosen correctly, the model may misinterpret it as meaningful data rather than a placeholder. For example, using a common word (e.g., "the" in NLP) as a padding token instead of a distinct padding value can lead to incorrect learning patterns.

Additionally, incorrect padding values can affect loss calculation and attention mechanisms in models like recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and transformers. Many neural networks use masking to ignore padding values during computations, ensuring that padded elements do not contribute to learning. If an improper padding value is used without proper masking, the model might allocate attention to irrelevant padded positions, reducing overall accuracy. Choosing a distinct padding token (e.g., zero for numerical sequences or a special <PAD> token for text) and correctly implementing masking techniques help the model focus on meaningful data while avoiding unnecessary computational overhead.

In our example use case here, we assume that the sequences of word/token indices were created using a vocabulary containing the special token <PAD> associated with the index $0$. We therefore have to pick $0$ as the padding value now. However, note that there is nothing special about the value $0$. If the index of <PAD> in the vocabulary would have been $1455$, then we would have needed to use padding_value=1455 in the method pad_sequence(). It has become a best practice to use $0$ representing the padding token, and that is why we assume the vocabulary the way we do here.

By default, the method pad_sequence() pads to the right. However, with the latest version of PyTorch (2.5 or higher), you can also specify that sequences are padded to the left.

In [5]:
sequences_padded_left = pad_sequence(sequences, batch_first=True, padding_value=0, padding_side='left')

print(sequences_padded_left)
print(sequences_padded_left.shape)

# We only run this line since we want to assume right-padding for subsequent code cells
sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=0, padding_side="right") # default side
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [ 0,  6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8],
        [ 0,  0, 11,  7, 14, 21, 27, 12,  7, 14, 21,  8],
        [ 0,  0,  6, 15, 28, 29, 22, 23, 16,  6, 10,  8],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [ 0,  0,  0,  0, 24, 20,  1, 24,  9,  1,  7,  8],
        [ 0,  0,  0, 12, 13,  4, 15, 18,  2,  4, 10,  8]])
torch.Size([8, 12])

The shape of the batch will again be (batch_size, max_seq_len) as we still only pad up to the length of the longest sequence.

The difference between left and right padding in neural networks refers to where padding tokens are inserted relative to the original sequence. Left padding (pre-padding) adds padding tokens to the beginning of the sequence, while right padding (post-padding) adds them to the end. For example, given a sequence [1, 2, 3], left padding to length 5 would result in [0, 0, 1, 2, 3], while right padding would give [1, 2, 3, 0, 0].

Right padding is more commonly used in neural network training, particularly with architectures like recurrent neural networks (RNNs), LSTMs, and transformers, because it allows the model to process the sequence in its natural order without interruption from padding tokens. Left padding may sometimes be used to align recent data points at the same position, such as in certain time-series tasks. Regardless of the padding method, models typically need masking mechanisms to ensure that padding tokens are ignored during training to avoid negatively impacting model performance.

Additional Steps for Convolutional Neural Networks (CNNs)¶

Apart from ensuring that all sequences in our batch have the same length, many network architectures such as Convolutional Neural Networks (CNNs) impose an additional requirement that all input sequences across all batches must have the same length. In the case of CNNs, this is to ensure that subsequent layers (after the Convolution Layer(s) and MaxPooling/AveragePooling layer(s)) receive inputs with the expected size.

Important: PyTorch also support adaptive max pooling and adaptive average pooling which allows specifying a fixed output size. Thus, the output size of the pooling layer does not depend on the input size. However, particularly for text, where the length of sequences can vary significantly, ensuring the same output size means that inputs are treated quite differently. To keep it straightforward here, we ignore adaptive pooling.

When we need to ensure that each batch has to be the same shape regarding the length of the sequences, there are 2 cases to consider:

  • If the sequences are too long, we need to truncate them

  • If the sequences are too short, we need to (further) pad them

Let's have a look at how we can accomplish this.

Truncate to Required Length¶

Truncating or shortening the sequences in our batch tensor is very straightforward since we can simply use normal array/tensor indexing. We only need to ensure that we truncate the correct dimension -- after all, we want to shorten the sequences not reduce the number of sequences in the batch. To give an example, the code cell below shortens all sequences to a fixed length of 5. Of course, if the specified value for FIXED_LENGTH is larger than the number of items in the longest sequence, the code cell below has no effect on the batch

Your turn: Try different values for FIXED_LENGTH to see how the output changes and if it matches your expectations.

In [6]:
FIXED_LENGTH = 5

sequences_padded_truncated = sequences_padded[:,:FIXED_LENGTH]

print(sequences_padded_truncated)
tensor([[ 6, 17, 18, 25,  9],
        [13, 17, 14, 15,  9],
        [ 6, 15,  9, 11,  7],
        [11,  7, 14, 21, 27],
        [ 6, 15, 28, 29, 22],
        [13, 10,  9,  6, 22],
        [24, 20,  1, 24,  9],
        [12, 13,  4, 15, 18]])

Pad to Require Length¶

For handling batch containing sequences that are too short, we can utilize the pad() method of PyTorch to make our lives easier. Since pad() expects a tensor as input, here we first need to have to call pad_sequence() to get from a list of 1d tensors to a 2d tensor.

The method pad() is very flexible, allowing the padding of tensors with respect to all dimensions. Since our tensor is of shape (batch_size, seq_len) we can pad both the batch_size dimension and seq_len dimension. Here, of course, we are only interested in padding out sequences and not the number of sequences in our batch. Like above, we can use right padding as well as left padding. This means we have to be careful that we use pad() correctly to get the expected output. Since our batch has 2 dimensions, we need to specify a 4-tuple to tell the method how to pad our batch tensor. Assuming that pad_size is the difference between the desired length and the current length of all sequences, we can do

  • Left padding: (pad_size, 0, 0, 0)
  • Right padding: (0, pad_size, 0, 0)

The last two values are always $0$ as they refer to the batch_size dimension, which we do not want to change. Run the code cell below to perform either left or right padding.

In [7]:
TARGET_LENGTH = 15

pad_size = TARGET_LENGTH - sequences_padded.shape[1]

#sequences_max_padded = torch.nn.functional.pad(sequences_padded, (pad_size, 0, 0, 0), mode="constant", value=0) # left padding
sequences_max_padded = torch.nn.functional.pad(sequences_padded, (0, pad_size, 0, 0), mode="constant", value=0) # right padding

print (sequences_max_padded)
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8,  0,  0,  0],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8,  0,  0,  0],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0,  0,  0,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0,  0,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0,  0,  0,  0],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8,  0,  0,  0],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0,  0,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0,  0,  0,  0]])

Again, we need to use value=0 as this is our index of the <PAD> token in the vocabulary.

Complete Auxiliary Method¶

In practice, we don't know ahead of time if we need to truncate or pad a batch. It's therefore convenient to have a method that truncates or pads the sequences of a batch depending on the size of the longest sequence(s). The method create_fixed_length_batch() is a simple example implementation; it merely combines the ideas from the code cells above into a single method. Keep in mind that the way sequences of variable lengths should be handled in practice often depends on the nature of the data, the exact task, and the network architecture that is used. So the implementation of method create_fixed_length_batch() only illustrates some basic ideas. For example, this method assumes that the batch_size dimension is the first dimension; notice the batch_first=True parameter in the code below.

In [8]:
def create_fixed_length_batch(sequences, target_length, padding_value=0, padding_side="right"):
    
    # Pad sequences w.r.t. longest sequences
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=padding_value, padding_side=padding_side)

    # Get the current sequence length
    max_seq_len = sequences_padded.shape[1]
    
    if max_seq_len > target_length:
        # Truncate sequences if too long
        return sequences_padded[:,:target_length]
    else:
        # Pad sequences if too short
        if padding_side == "right":
            pad_tuple = (0, target_length-max_seq_len, 0, 0)
        else:
            pad_tuple = (target_length-max_seq_len, 0, 0, 0)
        return torch.nn.functional.pad(sequences_padded, pad_tuple, mode="constant", value=0)

First, let's run the method over our batch of 8 sequences assuming that we need to enforce a length of 5. Since the sequences in our padded tensor are longer than that, we need to shorten all sequences.

In [9]:
create_fixed_length_batch(sequences, 5, padding_side="right")
#create_fixed_length_batch(sequences, 5, padding_side="left")
Out[9]:
tensor([[ 6, 17, 18, 25,  9],
        [13, 17, 14, 15,  9],
        [ 6, 15,  9, 11,  7],
        [11,  7, 14, 21, 27],
        [ 6, 15, 28, 29, 22],
        [13, 10,  9,  6, 22],
        [24, 20,  1, 24,  9],
        [12, 13,  4, 15, 18]])

If we use create_fixed_length_batch() to enforce a sequence length of 15, we have to pad all sequences since all are shorter than 15.

In [10]:
create_fixed_length_batch(sequences, 15, padding_side="right")
#create_fixed_length_batch(sequences, 15, padding_side="left")
Out[10]:
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8,  0,  0,  0],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8,  0,  0,  0],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0,  0,  0,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0,  0,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0,  0,  0,  0],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8,  0,  0,  0],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0,  0,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0,  0,  0,  0]])

With this method, we can now "resize" any batch of sequence to the correct size. Again this size is determined by the overall network architecture, mainly the layers (often linear layers) following the convolution and pooling layers.

Additional Steps for Recurrent Neural Networks (RNNs)¶

Let's briefly have a look again at our batch with the minimum required padding:

In [11]:
print(sequences_padded)
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0]])

Now, this representation can be used as inputs for a Recurrent Neural Network (RNN). The RNN will process each item in all sequences in parallel. Of course, this is possible since all sequences have the same length. However, there are two issues to consider for padded sequences:

  • The padding token <PAD> does not really have any meaning. While we generally can assume that the RNN will "learn" that <PAD> does not mean anything it potentially still might affect the results, particularly if an initial sequence is very short and a lot of padding was required to adjust the final length.

  • Even if we assume that <PAD> won't negatively affect the results, processing the padding still requires computational steps. In principle, an RNN could stop processing a sequence when it reaches the first padding index.

To this end, PyTorch introduces the notion of packing. A PackedSequence is a Python object for an internal representation of a tensor with padding. It keeps the true lengths of all sequences (i.e., total length - number of padding indices). Such a PackedSequence object can then be used as input for a RNN layer to tell it when each sequence can be stopped processing. This is all done under the hood, transparent to the user.

To create a PackedSequence object, we can use the method pack_padded_sequence(). Note that without preprocessing of the batch, the method pack_padded_sequence() will change the order of the sequences in the batch. This is a problem since then the order of sequences does no longer match the order of our target labels.

To solve this, the best way is to "manually" sort all sequences in a batch from longest to shortest, and rearrange the order of target labels accordingly. The reason is that if the sequences in the batch are sorted from longest to shortest, the method pack_padded_sequence() will not change order, and both sequences and target labels stay aligned.

The method sort_batch() below accomplishes this. Note how inputs and targets get re-ordered the same ways by using the same list of indices. We also need to return the list of lengths since this information is needed by the method pack_padded_sequence().

In [12]:
def sort_batch(inputs, targets, lengths):
    # Sort sequences w.r.t. their lengths from longest to shortest
    lengths_sorted, sorted_idx = lengths.sort(descending=True)
    # Return re-ordered inputs and targets, as well as the lengths (sorted from longest to shortest)
    return inputs[sorted_idx], targets[sorted_idx], lengths_sorted

Let's apply this method on our batch with the minimum padding.

In [13]:
# Extract the lengths for all sequences in the batch
lengths = torch.LongTensor([ len(seq) for seq in sequences ])

# Sort inputs and targets in parallel to ensure they remain aligned
sequences_padded_sorted, targets_sorted, lengths_sorted = sort_batch(sequences_padded, targets, lengths)

print(sequences_padded_sorted)
tensor([0, 0, 0, 0, 1, 1, 1, 1])
tensor([12, 12, 11, 10, 10, 12,  8,  9])
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0]])

As you can clearly see, all sequences in the batch are now sorted from longest to shortest.

Now we have everything to create a PackedSequence object using the method pack_padded_sequence().

In [14]:
sequences_packed = torch.nn.utils.rnn.pack_padded_sequence(sequences_padded_sorted, lengths_sorted, batch_first=True)

print(sequences_packed)
PackedSequence(data=tensor([ 6, 13, 13,  6, 11,  6, 12, 24, 17, 17, 10, 15,  7, 15, 13, 20, 18, 14,
         9,  9, 14, 28,  4,  1, 25, 15,  6, 11, 21, 29, 15, 24,  9,  9, 22,  7,
        27, 22, 18,  9, 11,  6, 16, 18, 12, 23,  2,  1,  7, 12, 13, 19,  7, 16,
         4,  7, 26,  7, 10, 10, 14,  6, 10,  8,  6, 16,  9,  6, 21, 10,  8, 12,
        19, 30, 20,  8,  8,  7, 10, 23,  8,  8,  8,  8]), batch_sizes=tensor([8, 8, 8, 8, 8, 8, 8, 8, 7, 6, 4, 3]), sorted_indices=None, unsorted_indices=None)

This output from the code cell above is arguably not obvious to interpret. But again, a PackedSequence implements a internal representation to speed up the performance when using RNNs.

Not very surprisingly, the method for reversing this operation, i.e., unpacking a PackedSequence is called pad_packed_sequence(). For example, we can unpack our PackedSequence we just created as shown in the code cell below:

In [15]:
sequences_unpacked, lengths_unpacked = torch.nn.utils.rnn.pad_packed_sequence(sequences_packed, batch_first=True)

print(sequences_unpacked)
tensor([[ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8],
        [13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8,  0],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8,  0,  0],
        [ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8,  0,  0],
        [12, 13,  4, 15, 18,  2,  4, 10,  8,  0,  0,  0],
        [24, 20,  1, 24,  9,  1,  7,  8,  0,  0,  0,  0]])

In practice, of course, we would first push a PackedSequence through a RNN layer, and then unpack the results.


Approach 2: Enforcing Equal-Length Batches with a Batch Sampler¶

When using RNNs, compared to CNNs, only the sequences within the same batch have to be of the same length, but two different batches may contain sequences of different lengths. We saw how we can solve this using padding and packing (optional). However, both steps add some overhead in terms of code to write — with the risk of making mistakes — and steps to execute. It also does not explicitly address the issue that a batch may contain very short and very long sequences, as inputs are often randomized. While padding and packing still works, having a mix of sequences with very different lengths means that the short sentences need to be padded a lot. Packing ensures that the padding tokens do not affect the training but we still create unnecessarily large tensors.

So here is the question: Why do we not only put sequences into the same batch that have the same length to begin with? If all sequences in a batch have the same length, then there is no longer any need for padding and packing. The data utilities provided by PyTorch make this surprisingly easy. In more detail, we can implement our own BatchSampler to create only batches that contain sequences of the same length.

Important: Throughout the rest of the notebook, we assume that our example dataset containing the 8 sentences is the complete dataset and not just a single batch!

Create Dataset Class¶

We first create a simple Dataset. The Dataset class in the PyTorch library is an abstract base class that provides a standard way to represent datasets. It allows users to define custom datasets by implementing two key methods: __len__() to return the number of samples and __getitem__() to retrieve a specific data sample and its corresponding label. PyTorch also offers ready-to-use datasets through subclasses like torchvision.datasets. This class is useful because it provides flexibility in data handling and preparation. It enables efficient loading and preprocessing of data on-the-fly, making it suitable for large datasets that cannot fit entirely in memory. When combined with the DataLoader class (see below), it supports batch loading, shuffling, and parallel data processing, which are essential for efficient training of neural networks.

Our new BaseDataset class only stores out inputs and targets and needs to implement the __len__() and __getitem__() methods. Notice that targets maybe None. A common example is when working with datasets for training language models, where the targets for the training can directly be derived from the inputs on the fly. Thus, there is no need to store the targets explicitly. Since our class extends the abstract class Dataset, we can use an instance later to create a DataLoader. Without going into too much detail, this approach does not only allow for cleaner code but also supports parallel processing on many CPUs, or on the GPU as well as to optimize data transfer between the CPU and GPU, which is critical when processing very large amounts of data. It is therefore the recommended best practice.

In [16]:
class BaseDataset(Dataset):

    def __init__(self, inputs, targets=None):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, index):
        if self.targets is None:
            return np.asarray(self.inputs[index])
        else:
            return np.asarray(self.inputs[index]), np.asarray(self.targets[index])

Our BaseDataset class is trivial since it already receives lists as input. In practice, a custom Dataset class may also implement file handling or other preprocessing steps.

Classification Dataset Example¶

We first look at our example corpus for text classification. This means only our inputs are sequences while our targets are class labels. Later, we also show the case where both inputs and targets are sequences using a mock corpus for machine translation. Our BaseDataset class is simple and flexible enough to handle both cases.

Create Dataset¶

With our BaseDataset class implemented we can create an instance using our example news dataset. Note that the class gets the list of initial sequences, i.e., without any padding!

In [17]:
dataset = BaseDataset(sequences, targets)

If all sequences would have the same lengths, we could already create a DataLoader and use it as shown below. However, since our sequences are of different lengths, the code cell below would throw an error. This is because the DataLoader returns tensors which need to be "full" multidimensional arrays. You can uncomment the code below to convince yourself of the error.

In [18]:
#loader = DataLoader(dataset, batch_size=5)

#for X_batch, y_batch in loader:
#    print(X_batch)
#    print(y_batch)

Create Batch Sampler¶

To solve this problem, we first create our custom Sampler. The Sampler class in PyTorch is an abstract base class that defines how indices are selected from a dataset for data loading. It serves as the foundation for implementing custom sampling strategies by requiring subclasses to implement the __iter__() method, which yields dataset indices, and the __len__() method to specify the number of samples. This class is useful because it provides flexibility in controlling data loading behavior beyond simple sequential or random sampling. It allows for tailored sampling strategies such as stratified sampling, weighted sampling for imbalanced datasets, or dynamic data selection during training. By integrating with PyTorch's DataLoader, Sampler ensures efficient and customizable data retrieval, optimizing model training performance in various applications.

For this and other notebooks, we provide the class EqualLengthsBatchSampler which analyzes the input sequences to organize all sequences into groups of sequences of the same length. Then, each batch is sampled for a single group, ensuring that all sequences in the batch have the same length. Feel free to have a look at the EqualLengthsBatchSampler class in file src/sampler.py to see how this organization is done; it's pretty straightforward.

In [19]:
BATCH_SIZE = 5

sampler = EqualLengthsBatchSampler(BATCH_SIZE, sequences, targets)

Create Data Loader¶

Now, we are ready to finally create a DataLoader using the instance of our custom sampler as an input parameter. The DataLoader class in PyTorch is a key utility that facilitates the efficient loading of data during model training and evaluation. It wraps around datasets to provide iterable batches of data, handling shuffling, batching, and parallel loading via multiple worker threads. This abstraction allows developers to focus on model implementation without worrying about the complexities of data loading.

A DataLoader is particularly useful for training on large datasets by splitting them into smaller batches, which reduces memory usage and speeds up computation through parallelization. Additionally, features like shuffling ensure that models generalize better by mitigating the risk of learning patterns specific to data order. It supports custom datasets, making it highly flexible for diverse machine learning tasks. Since our custom sampler does the organization, we have to tell the data loader not to shuffle the samples with shuffle=False.

In [20]:
loader = DataLoader(dataset, batch_sampler=sampler, shuffle=False, drop_last=False)

Now let's use the data loader as we would in a training loop.

In [21]:
for batch_nr, (X_batch, y_batch) in enumerate(loader):
    print("========= Batch {} =========".format(batch_nr+1))
    print(X_batch)
========= Batch 1 =========
tensor([[ 6, 15,  9, 11,  7, 18, 19, 10,  6, 20,  8]])
========= Batch 2 =========
tensor([[12, 13,  4, 15, 18,  2,  4, 10,  8]])
========= Batch 3 =========
tensor([[24, 20,  1, 24,  9,  1,  7,  8]])
========= Batch 4 =========
tensor([[ 6, 15, 28, 29, 22, 23, 16,  6, 10,  8],
        [11,  7, 14, 21, 27, 12,  7, 14, 21,  8]])
========= Batch 5 =========
tensor([[13, 17, 14, 15,  9,  6, 12,  7, 16, 19, 10,  8],
        [13, 10,  9,  6, 22, 16, 13, 10,  9, 30, 23,  8],
        [ 6, 17, 18, 25,  9, 11,  7, 26,  6, 12,  7,  8]])

As you can see, all batches contain only sequences of the same length, so no padding and packing is required. But also note that we naturally cannot guarantee each batch is of size 5 as specified. If there are not enough sequences of the same length, the respective batch won't be full. However, this is not an issue in practice where we typically deal with a dataset of hundreds of thousands or much more sequences.

Sequence-to-Sequence (Seq2Seq) Dataset Example¶

Our initial corpus was an example data set for text classification, so only the inputs are sequences (the targets are the class labels). However, for sequence-to-sequence (seq2seq) tasks such as machine translation, both inputs and targets are sequences. To avoid padding and packing — and potentially masking which we do not cover in this notebook — we have to ensure that a batch only contains sequence pairs of the same length.

To clarify, this does not mean that the input sequences and target sequences need to have the same length, but only that all the input sequences have the same length and all the target sequences have the same length. For example, we can have a batch where all input sequences are of length 10, and all target sequences are of length 14.

Create Dataset, Batch Sampler & Data Loader¶

We can now create a data loader like above. The only little difference is that now that targets are also sequences — compared to simple class labels like for the classification dataset. However, this does not affect the code itself. The handling of dataset samples with different lengths is completely handled by the EqualLengthsBatchSampler.

In [22]:
# Create Dateset
dataset_seq2seq = BaseDataset(input_sequences, target_sequences)
# Create BatchSampler
sampler_seq2seq = EqualLengthsBatchSampler(BATCH_SIZE, input_sequences, target_sequences)
# Create DataLoader
loader_seq2seq = DataLoader(dataset_seq2seq, batch_sampler=sampler_seq2seq, shuffle=False, drop_last=False)

Again, we can loop through all batches using the created data loader as we would within a training loop.

In [23]:
for idx, (batch_inputs, batch_targets) in enumerate(loader_seq2seq):
    print("========= BATCH {} =========".format(idx))
    print(batch_inputs)
    print(batch_targets)    
========= BATCH 0 =========
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
========= BATCH 1 =========
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
========= BATCH 2 =========
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
tensor([[1, 2],
        [1, 2],
        [1, 2]])
========= BATCH 3 =========
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
tensor([[1, 2],
        [1, 2],
        [1, 2]])
========= BATCH 4 =========
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
tensor([[1, 2, 3, 4],
        [1, 2, 3, 4],
        [1, 2, 3, 4]])
========= BATCH 5 =========
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])
tensor([[1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5]])
========= BATCH 6 =========
tensor([[1, 2, 3],
        [1, 2, 3],
        [1, 2, 3],
        [1, 2, 3]])
tensor([[1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5],
        [1, 2, 3, 4, 5]])

With the default value of BATCH_SIZE = 5, appreciate (a) that no batch contains more than 5 input-target pairs and (b) that, within a batch, all input-target samples have the same combinations of lengths. Of course, a batch might have less than 5 samples in case there are less than 5 samples with a specific combination of input and target lengths. But again, for large real-world datasets, the likelihood for this to happen is arbitrarily small.

Since we shuffle the batches in EqualLengthsBatchSampler, running the code cell below will naturally result in different outputs. Feel free to change the value of BATCH_SIZE, to see how it affects the results and if you can understand the results.


Summary¶

When working with text, properly handling sequences of different lengths is crucial. While we could avoid any issues with using only batches of size 1, this would sacrifice too much performance in practice when training our models over large/huge datasets. Training neural networks using batches instead of sample by sample is more efficient because it leverages vectorized operations and hardware acceleration, particularly on GPUs. In batch training, multiple samples are processed simultaneously, which allows for matrix multiplications and other operations to be optimized for parallel computation. This significantly speeds up training compared to handling one sample at a time.

Additionally, batch training helps stabilize gradient estimates during optimization. Calculating gradients over multiple samples provides a better approximation of the true gradient compared to noisy, individual updates. This balance between computational efficiency and gradient stability makes batch processing both faster and more effective for training deep learning models. So since we aim for a larger batch size, we need to ensure that each batch contains sequences of the same length.

In this notebook, we look into the most common best practices to solve this, particularly when working with CNNs and RNNs. To ensure sequences in a batch have the same length during neural network training, common strategies we have covered include:

  • Padding: Sequences are extended to match the length of the longest sequence in the batch by adding a special token (e.g., zero) to the shorter sequences. This maintains a consistent input size but requires masking during computation to ignore padded values.

  • Truncation: Sequences longer than a predefined maximum length are cut off to ensure uniform length, which helps reduce computation time but may lead to information loss.

  • Bucketing: Sequences are grouped into batches of similar or even the same lengths, minimizing the need for excessive padding while maintaining efficient processing. Out custom EqualLengthsBatchSampler shows an example for this approach

These strategies improve training efficiency and ensure compatibility with models requiring fixed-size inputs.

A closely related topic to padding is masking. Masking refers to a technique where the model is explicitly told to ignore padded but also other values during computations, so these these masked elements do not contribute to learning or affect the output. This is important because masked values are not meaningful and should not impact the model's performance. Masking usually involves creating a mask (a binary matrix) that indicates which positions in the sequence are padded (or should otherwise be ignored) and which are actual data points. Neural networks can then use this mask to skip over those values when computing gradients, loss, or attention. Although it solves similar issues like padding, masking is a distinct concept and therefore covered in a separate notebook.

In [ ]: