Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

The Softmax Function¶

The softmax function is a fundamental building block in modern neural networks, particularly in models designed for multi-class classification. Given a vector of real-valued scores (often called logits) softmax transforms them into a probability distribution over classes. Each output value lies in the range $[0, 1]$, and all outputs sum to $1$, making them directly interpretable as class probabilities. Intuitively, softmax amplifies differences between scores: larger logits receive disproportionately higher probabilities, while smaller ones are suppressed. This simple transformation allows neural networks to move from arbitrary numerical outputs to meaningful, probabilistic predictions.

Because of this property, softmax is most commonly used as the final layer in classification networks. Whether in image recognition, natural language processing, or speech understanding, the combination of a linear layer followed by softmax forms the standard interface between a model’s internal representations and its final predictions. Importantly, while softmax itself contains no trainable parameters, it plays a crucial role in shaping the gradients that flow backward through the network during training. Its interaction with the loss function — most often cross-entropy — determines how strongly each class influences weight updates in earlier layers.

To train a neural network end-to-end, we must understand not only how softmax works in the forward pass, but also how to compute its derivative during backpropagation. Unlike elementwise activation functions such as ReLU or sigmoid, softmax couples all output dimensions: changing a single input logit affects every output probability. This means its derivative is naturally expressed as a Jacobian matrix, and understanding its structure is key to implementing efficient and numerically stable backward passes. While deep learning frameworks hide these details, the underlying mathematics governs how learning actually happens.

This notebook explores the softmax function in depth, carefully deriving both the forward computation and the gradients required for backpropagation. By working through these details explicitly, the softmax function becomes more than just a black-box "last layer" and instead a concrete, understandable operation within a computational graph. Developing a solid understanding of foundational components like softmax is essential for truly understanding how neural networks work. While frameworks such as PyTorch and TensorFlow make it easy to build and train models, they abstract away the mechanics of forward and backward passes. By engaging directly with the mathematics and implementations of core layers, we gain deeper insight into model behavior, debugging, optimization, and the design of new architectures — skills that go far beyond simply using existing tools.

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:

import numpy as np

Preliminaries¶

This notebook assumes a basic understanding of calculus and the chain rules, including the general concept of backpropagation train neural networks.

Basic Idea¶

The softmax function is an activation function that transforms a vector of real-valued scores (often called logits) into a probability distribution over multiple classes. Each output lies between $0$ and $1$, and all outputs sum to $1$, making them directly interpretable as class probabilities. Here is a simple illustration of a logits vector with 3 values transformed into a probability distribution using the softmax function.

$$\large \text{softmax}\left( \begin{bmatrix} -1.2\\ 1.6\\ 0.9 \end{bmatrix} \right) = \begin{bmatrix} 0.04\\ 0.64\\ 0.32 \end{bmatrix} $$

Notice that all output values are in the interval $[0, 1]$ and sum up to $1$. This allows us to interpret the softmax output as a probability distribution. The main purpose of the softmax function or softmax layer in a neural network is to serve as the final layer for multi-class classification tasks, where the model must choose among more than two mutually exclusive classes.

Side note: The softmax function can be understood as a direct generalization of the sigmoid function. The sigmoid maps a single scalar input to a probability and is therefore suitable for binary classification. In fact, if softmax is applied to a vector of two logits, it reduces exactly to the sigmoid function applied to the difference between those logits. Conceptually, sigmoid answers the question "What is the probability of Class 1 versus Class 0?", while softmax extends this idea to "What is the probability of each class among many?". This close relationship highlights softmax as the natural multi-class counterpart of sigmoid, grounding its widespread use in classification-focused neural network architectures.

Softmax (individual layer)¶

We first consider the softmax as its own individual without any particularl assumption of previous and subsequent layers. However, the softmax function is generally unsuitable as an activation function for arbitrary hidden layers because it enforces a global normalization and competition across its inputs, turning them into a probability distribution that sums to one. This coupling means that increasing the activation of one neuron necessarily decreases the activations of others, which is an undesirable constraint for intermediate representations where neurons are meant to learn independent or complementary features. Hidden layers benefit from activations like ReLU or GELU that allow each unit to respond independently and preserve expressive capacity, whereas softmax compresses information into relative proportions and can discard useful magnitude information needed by subsequent layers.

In contrast, softmax is well matched to the final layer of a classification model, where the goal is precisely to model a categorical probability distribution over mutually exclusive classes. At this stage, competition between outputs is meaningful, and the normalization makes the outputs directly interpretable as probabilities that can be compared against labels using losses such as cross-entropy. Using softmax earlier in the network would not only restrict representational power but also complicate gradient flow due to the dense Jacobian coupling all units, while providing no clear modeling benefit. As a result, softmax is almost exclusively used as the activation of the final layer, immediately before computing the loss. We therefore later consider the softmax and cross-entropy loss as a combined layer.

Forward Pass¶

The forward pass in a neural network is the process of taking an input (or a batch of inputs) and propagating it through the network layer by layer to produce an output. This continues until the final layer produces the network’s prediction, which can then be compared to the ground-truth target using a loss function; the computed loss serves as the starting point for the backward pass and backpropagation.

Considering the softmax function as a layer in the network, the softmax function is defined as function $\sigma: \mathbb{R}^{D}\rightarrow \mathbb{R}^{D}$ that maps a real-valued vector of size $D$ into another real-valued vector of size $D$. If we denote vector $\mathbf{x}\in \mathbb{R}^D$ as the input vector, the $i$-th element of the output vector $\boldsymbol{\sigma}(\mathbf{x})$ is defined as:

$$\large \boldsymbol{\sigma}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{k=1}^D e^{x_k} } $$

In simple terms, the softmax function takes in a vector with arbitrary values and normalizes them so that each values gets mapped into the interval $[0, 1]$ which allows for interpreting $\boldsymbol{\sigma}(\mathbf{x})$ as a probability distribution. Although different such mapping strategies are possible, Using the exponential function $e^x$ in the softmax has several important advantages:

Strictly positive outputs: The exponential is always positive, ensuring all softmax outputs are non-negative and can be interpreted as probabilities after normalization.
Order preservation: $e^x$ is strictly monotonic, so larger logits lead to larger probabilities, preserving the ranking of scores.
Smooth and differentiable: The exponential is smooth everywhere, which makes softmax differentiable and well-suited for gradient-based optimization.
Amplifies differences: Exponentiation accentuates differences between logits, making the model more confident when one class score is clearly larger than the others.
Convenient gradients: Using (e^x) leads to a clean, well-behaved Jacobian for softmax, simplifying analysis and making backpropagation numerically stable.

Using NumPy, which allows to easily apply operations such as $e^x$ to each element in an array, as well as provides methods to sum up all elements in an array, it is very easy to implement the softmax function in Python. In fact, it only requires a single line of code as shown in the code cell below — again, we assume that x is a $1$-dimensional array containing unnormalized real values.

In [2]:

def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=0)

As a quick example, we can compute the softmax outputs for the example vector from the beginning.

In [3]:

softmax([-1.2, 1.6, 0.9])

Out[3]:

array([0.039046  , 0.64209771, 0.31885629])

Just note that the output values will have a much higher precision than shown in the example.

Summing up, the softmax function converts a vector of real-valued logits into a normalized probability distribution by exponentiating each entry and dividing by the sum of all exponentials, making its forward implementation conceptually simple and efficient. Beyond this simplicity, softmax offers important advantages during backpropagation: its smooth and fully differentiable form leads to a structured Jacobian where gradients can be computed without explicitly forming large matrices. In practice, this results in concise and numerically stable gradient expressions — especially when combined with the cross-entropy loss — allowing efficient computation of gradients while preserving clear probabilistic interpretations of the model's outputs. So let's see now how the backward pass works.

Backward Pass¶

The backward pass in a neural network is the phase where gradients of the loss function with respect to all learnable parameters are computed, starting from the output layer and moving backward through the network. Using the chain rule, each layer receives the gradient of the loss with respect to its outputs and transforms it into gradients with respect to its inputs and parameters (such as weights and biases). These gradients quantify how a small change in each parameter would affect the loss and are then used by an optimization algorithm, such as gradient descent, to update the parameters and improve the model's performance.

In the following, we assume that the softmax receives in input $\mathbf{x}$, typically the output (i.e., logits) from a previous linear layer. The softmax then generates its output $\boldsymbol{\sigma}$ — in the following, we write $\boldsymbol{\sigma}$ instead of $\boldsymbol{\sigma}(\mathbf{x})$, and $\sigma_i$ instead of $\boldsymbol{\sigma}(\mathbf{x})_i$ to simplify the expressions. For the backward pass we assume that the upstream gradient $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\sigma}}$ has been computed. Since loss $\mathcal{L}$ is a scaler and $\boldsymbol{\sigma}$ is matrix of shape $1\times D$, the upstream gradient $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\sigma}}$ will also have a shape of $1\times D$, with each element of $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\sigma}}$ being the derivative of $\mathcal{L}$ with respect to one element in $\sigma_i in \boldsymbol{\sigma}$. In other words, $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\sigma}}$ is the Jacobian matrix (or just Jacobian), i.e., the matrix of all first-order partial derivatives of a function with multiple inputs and multiple outputs.

$$\large \frac{\partial\mathcal{L}}{\partial \boldsymbol{\sigma}} = \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \sigma_{1}} & \frac{\partial \mathcal{L}}{\partial \sigma_{2}} & \cdots & \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}} & \frac{\partial \mathcal{L}}{\partial \sigma_{D}} \end{bmatrix} $$

Do goal of the backward pass is not to compute the downstream gradient $\frac{\partial\mathcal{L}}{\mathbf{\partial x}}$ to be passed to the previous layer (with respect to the forward pass) to continue the backpropagation. By using the chain rule, we can compute the downstream gradient as follows:

$$\large \frac{\partial\mathcal{L}}{\mathbf{\partial x}} = \frac{\partial\mathcal{L}}{\partial \boldsymbol{\sigma}} \frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}} $$

Since we got $\frac{\partial \mathcal{L}}{\partial \boldsymbol{\sigma}}$ as the upstream gradient from the subsequent layer, we now have to compute the gradient, i.e., the Jacobian, $\frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}$. This the matrix of all first-order partial derivatives of the softmax functions for all outputs $\sigma_i$ with respect to all inputs $x_j$ has the following form:

$$\large \frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}} = \begin{bmatrix} \frac{\partial \sigma_1}{\partial x_1} & \frac{\partial \sigma_1}{\partial x_2} & \cdots & \frac{\partial \sigma_1}{\partial x_{D-1}} & \frac{\partial \sigma_1}{\partial x_D}\\ \frac{\partial \sigma_2}{\partial x_1} & \frac{\partial \sigma_2}{\partial x_2} & \cdots & \frac{\partial \sigma_2}{\partial x_{D-1}} & \frac{\partial \sigma_2}{\partial x_D}\\ \vdots & \vdots & \ddots & \vdots & \vdots \\ \frac{\partial \sigma_{D-1}}{\partial x_1} & \frac{\partial \sigma_{D-1}}{\partial x_2} & \cdots & \frac{\partial \sigma_{D-1}}{\partial x_{D-1}} & \frac{\partial \sigma_{D-1}}{\partial x_D}\\ \frac{\partial \sigma_{D}}{\partial x_1} & \frac{\partial \sigma_{D}}{\partial x_2} & \cdots & \frac{\partial \sigma_{D}}{\partial x_{D-1}} & \frac{\partial \sigma_{D}}{\partial x_D} \end{bmatrix} \ \in \mathbb{R}^{D\times D} $$

Thus, we now need to find the partial derivatives $\frac{\partial \sigma_i}{\partial x_j}$. By plugging in the formula to compute the softmax output $\sigma_i$, we get the following expression:

$$\large \frac{\partial \sigma_i}{\partial x_j} = \frac{\partial \frac{e^{x_i}}{\sum_{k=1}^D e^{x_k} }}{\partial x_j} $$

Since both the numerator and denominator depends on $x_j$, we have apply the quotient rule. Recall that the quotient rule in calculus finds the derivative of a function, say, $f(x)$ that is the fraction of two other functions, say, $g(x)$ and $h(x)$:

$$\large f(x) = \frac{g(x)}{h(x)} $$

Given this format, the quotient rules states that the derivative $\frac{\partial f(x)}{\partial x}$ is:

$$\large \frac{\partial f(x)}{\partial x} = \frac{\frac{\partial g(x)}{x}h(x) - \frac{\partial h(x)}{x}g(x) }{[h(x)]^2} $$

Mapping this format to our expression for the softmax, we have:

$$\large g_i = e^{x_i}\ , \qquad h_i = \sum_{k=1}^D e^{x_k} $$

Again, we write $g_i$ and $h_i$ instead of $g(\mathbf{x}_i)$ and $h(\mathbf{x}_i)$ to simplify the expressions.

According to the quotient rule, we now have to compute the two derivatives $\frac{\partial g_i}{\partial x_j}$ and $\frac{\partial h_i}{\partial x_j}$. Let's first look at $\frac{\partial h_i}{\partial x_j}$. Notice that the derivative is always $e^{x_j}$ for an $x_j$, i.e.:

$$\large \frac{\partial h_i}{\partial x_j} = \frac{\sum_{k=1}^D e^{x_k} }{x_j} = e^{x_j} $$

If not obvious, you can easily convince yourself by expanding the sum to $e^{x_1} + e^{x_2} + \dots + e^{x_j} + \dots e^{x_{D-1}} + e^{x_D}$ so that we get:

$$\large \frac{\partial (e^{x_1} + e^{x_2} + \dots + e^{x_j} + \dots e^{x_{D-1}} + e^{x_D})}{\partial x_j} = e^{x_j} $$

since all $e^{x_i}$ with $i\neq j$ are constants with respect to $x_j$ and the derivative of the exponential function $e^x$ is just the function itself.

The derivative $\frac{\partial g_i}{\partial x_j}$ is a bit more interesting as it depends whether $i=j$ or not. We therefore need to consider both cases individually.

Case 1 $(i=j)$: If $i=j$, the partial derivative is $\frac{\partial g_i}{\partial x_j} = e^{x_i}$. Plugging this result, together with $\frac{\partial h_i}{\partial x_j} = e^{x_j}$, into the quotient rule, we get the following:

$$ \begin{align} \large\frac{\partial \sigma_i}{\partial x_j} &\large= \frac{\partial \frac{e^{x_i}}{\sum_{k=1}^D e^{x_k} }}{\partial x_j}\\[1em] &\large= \frac{e^{x_i}\sum - e^{x_j}e^{x_i}}{\sum^2} \end{align} $$

Note that $\sum$ is simply a placeholder for $\sum_{k=1}^D e^{x_k}$, again to ease readability. We can now rewrite the right-hand side to get an expression containing the softmax outputs $\sigma_i$ and $\sigma_2$:

$$ \begin{align} \large\frac{\partial \sigma_i}{\partial x_j} &\large= \frac{e^{x_i}\sum - e^{x_j}e^{x_i}}{\sum^2}\\[1em] &\large= \frac{e^{x_i}}{\sum}\frac{\sum - e^{x_j}}{\sum}\\[1em] &\large= \frac{e^{x_i}}{\sum}\left( 1 - \frac{e^{x_j}}{\sum} \right)\\[1em] &\large= \sigma_i(1-\sigma_j) \end{align} $$

Note that it is a common observation that when a function involves the exponential $e^x$, its derivative can often be expressed directly in terms of the output of the function itself, and this stems from the unique nature of the exponential function. The key property of $e^x$ is that it is its own derivative, which means that differentiation does not introduce a fundamentally new functional form but instead preserves the original structure. As a result, many derivatives involving exponentials (incl. the softmax and sigmoid) can be written compactly using the function's outputs, leading to simpler expressions and more efficient gradient computations during backpropagation.

Case 2 $(i\neq j)$: If $i\neq j$, the partial derivative is $\frac{\partial g_i}{\partial x_j} = 0$ since $e^{x_i}$ is a constant with respect to $x_j$. Using this result within the quotient rule, we get:

$$ \begin{align} \large\frac{\partial \sigma_i}{\partial x_j} &\large= \frac{\partial \frac{e^{x_i}}{\sum_{k=1}^D e^{x_k} }}{\partial x_j}\\[1em] &\large= \frac{0 - e^{x_j}e^{x_i}}{\sum^2} \end{align} $$

Like for Case 1, we can rearrange the right-hand side a little bit to get an expression containing the softmax outputs $\sigma_i$ and $\sigma_j$:

$$ \begin{align} \large\frac{\partial \sigma_i}{\partial x_j} &\large= \frac{0 - e^{x_j}e^{x_i}}{\sum^2}\\[1em] &\large= - \frac{e^{x_i}}{\sum} \frac{e^{x_j}}{\sum}\\[1em] &\large= - \sigma_i \sigma_j \end{align} $$

Thus, for the partial derivative $\frac{\partial \sigma_i}{\partial x_j}$ we get:

$$\large \large\frac{\partial \sigma_i}{\partial x_j} = \begin{cases} \sigma_i(1-\sigma_j), & i=j,\\ -\sigma_i \sigma_j, & i\neq j \end{cases} $$

This expression is often written using the Kronecker delta $\delta_{ij}$ which is a simple function that returns $1$ if both $i$ and $j$ are equal, and 0 otherwise:

$$\large \delta_{ij} = \begin{cases} 1, & i=j,\\ 0, & i\neq j \end{cases} $$

Using the Kronecker delta, we get the following definition for the partial derivative $\frac{\partial \sigma_i}{\partial x_j}$:

$$\large \large\frac{\partial \sigma_i}{\partial x_j} = \sigma_i(\delta_{ij} - \sigma_j) $$

Now that we know how to compute all partial derivatives $\frac{\partial \sigma_i}{\partial x_j}$, we can now also compute the complete the Jacobian $\frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}$:

$$\large \frac{\partial \boldsymbol{\sigma}}{\partial \mathbf{x}} = \begin{bmatrix} \sigma_1(1-\sigma_1) & -\sigma_1\sigma_2 & \cdots & -\sigma_1\sigma_{D-1} & -\sigma_1\sigma_{D}\\ -\sigma_2\sigma_1 & \sigma_2(1-\sigma_2) & \cdots & -\sigma_2\sigma_{D-1} & -\sigma_2\sigma_{D}\\ \vdots & \vdots & \ddots & \vdots & \vdots \\ -\sigma_{D-1}\sigma_1 & -\sigma_{D-1}\sigma_2 & \cdots & \sigma_{D-1}(1-\sigma_{D-1}) & -\sigma_{D-1}\sigma_{D}\\ -\sigma_{D}\sigma_1 & -\sigma_{D}\sigma_2 & \cdots & -\sigma_{D-1}\sigma_{D-1} & \sigma_{D}(1-\sigma_{D}) \end{bmatrix} \ \in \mathbb{R}^{D\times D} $$

Lastly, we can compute he final downstream gradient $\frac{\partial\mathcal{L}}{\mathbf{\partial x}}$ as:

$$ \begin{align} \large \frac{\partial\mathcal{L}}{\mathbf{\partial x}}\ &\large= \frac{\partial\mathcal{L}}{\partial \boldsymbol{\sigma}} \frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}\\[1em] &\large= \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \sigma_{1}} & \frac{\partial \mathcal{L}}{\partial \sigma_{2}} & \cdots & \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}} & \frac{\partial \mathcal{L}}{\partial \sigma_{D}} \end{bmatrix} \begin{bmatrix} \frac{\partial \sigma_1}{\partial x_1} & \frac{\partial \sigma_1}{\partial x_2} & \cdots & \frac{\partial \sigma_1}{\partial x_{D-1}} & \frac{\partial \sigma_1}{\partial x_D}\\ \frac{\partial \sigma_2}{\partial x_1} & \frac{\partial \sigma_2}{\partial x_2} & \cdots & \frac{\partial \sigma_2}{\partial x_{D-1}} & \frac{\partial \sigma_2}{\partial x_D}\\ \vdots & \vdots & \ddots & \vdots & \vdots \\ \frac{\partial \sigma_{D-1}}{\partial x_1} & \frac{\partial \sigma_{D-1}}{\partial x_2} & \cdots & \frac{\partial \sigma_{D-1}}{\partial x_{D-1}} & \frac{\partial \sigma_{D-1}}{\partial x_D}\\ \frac{\partial \sigma_{D}}{\partial x_1} & \frac{\partial \sigma_{D}}{\partial x_2} & \cdots & \frac{\partial \sigma_{D}}{\partial x_{D-1}} & \frac{\partial \sigma_{D}}{\partial x_D} \end{bmatrix}\\[1em] &\large= \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{1}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{1}}\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{2}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{2}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{2}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{2}}\\ \vdots\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{D-1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{D-1}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{D-1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{D-1}}\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{D}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{D}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{D}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{D}}\\ \end{bmatrix}^\top \in \mathbb{R}^{D\times 1} \end{align} $$

Note that we show the transpose of $\frac{\partial\mathcal{L}}{\mathbf{\partial x}}$ above. Without the transpose, the shape of $ \frac{\partial\mathcal{L}}{\mathbf{\partial x}}$ is $1\times D$ as we already knew it must be from the beginning.

We now have everything to perform the backward pass through the softmax layer. However, notice that the Jacobian $\frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}$ is a $D\times D$ matrix for a single input vector $\mathbf{x}$. This is because each output $\sigma_i$ depends on each input $x_j$. In practice, we typically deal with batched input where we combine multiple input vectors into an input matrix $\mathbf{X}$ of size $N\times D$, where $N$ is the number of input vectors. Each vector $\mathbf{x}\in \mathbf{X}$ will yield its own Jacobian $\frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}$ during the backward pass. This means that we have to compute a total of $ND^2$ partial derivatives for the whole batch.

It turns out, in many cases we can do better than that. For that, check out the next section.

Softmax + Cross-Entropy¶

The softmax function and the cross-entropy loss are often used together because they form a mathematically and conceptually well-aligned pair for multi-class classification. Softmax converts raw model outputs (logits) into a normalized probability distribution, making the model's predictions interpretable as class probabilities that sum to one. Cross-entropy then measures how well this predicted probability distribution matches the true target distribution, which is typically one-hot encoded. In this sense, softmax defines what the model predicts (a probability distribution), while cross-entropy defines how good that prediction is relative to the ground truth.

Beyond this conceptual fit, their combination leads to a particularly simple and numerically stable gradient during the backward pass. When cross-entropy is applied directly to the softmax outputs, the derivative of the loss with respect to the logits simplifies to the difference between the predicted probabilities and the target labels. This avoids explicitly computing the full Jacobian of the softmax and results in efficient, stable backpropagation even for large numbers of classes. As a result, the softmax–cross-entropy pairing is not only intuitive but also computationally efficient, which explains why it has become the standard choice in modern neural network classifiers. Let's see how this works.

Forward Pass¶

We already know that we can compute the softmax output for a $D$-dimensional vector of logits using the following formula:

$$\large \sigma_{i} = \boldsymbol{\sigma}(\mathbf{x})_i = \frac{e^{x_i}}{\sum_{k=1}^D e^{x_k} } $$

The cross-entropy loss $\mathcal{L}$ for a given $D$-dimensional vector $\mathbf{y}$ containing the ground-truth labels is defined as:

$$\large \mathcal{L}_{CE} = - \sum_{i=1}^D y_{i} \log{\sigma_i} $$

where $y_{i}$ is either $0$ or $1$ depending if the $i$-th class is the true class ($1$) or the wrong class ($0$). The cross-entropy loss is only applicable to multiclass classification tasks where only one class is correct. This naturally implies that $\mathbf{y}$ is a one-hot vector containing only a single $1$ at the position reflecting the correct class label. Note that $D$ here reflects the number of classes, which must match the size of the output of the last linear layer before the softmax.

Intuitively, the cross-entropy loss measures how "surprised" the model is by the correct answer. If the model assigns a high probability to the true class, the surprise is low and the loss is small; if it assigns a low probability, the surprise is high and the loss becomes large. In this sense, cross-entropy directly rewards confident and correct predictions while strongly penalizing confident but wrong ones. Rather than caring about all predicted probabilities equally, it focuses on the probability assigned to the true label, encouraging the model to shift probability mass toward the correct class and to produce well-calibrated probability distributions over classes.

Backward Pass¶

Now that we have a concrete loss function, we can compute the upstream gradient $\frac{\partial\mathcal{L}_{CE}}{\partial \boldsymbol{\sigma}}$ for the backward pass through the softmax function. Using basic calculus rules, we can compute the partial derivative $\frac{\partial \mathcal{L}_{CE} }{\partial \sigma_i}$ in the upstream gradient as:

$$\large \frac{\partial \mathcal{L}_{CE} }{\partial \sigma_i} = -\frac{y_i}{\sigma_i} $$

With all values available, we can now compute the gradient $\large \frac{\partial\mathcal{L}}{\mathbf{\partial x}}$ by plugging in the values for $\frac{\partial\mathcal{L}}{\partial \boldsymbol{\sigma}}$ and $\frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}$:

$$ \begin{align} \large \frac{\partial\mathcal{L}}{\mathbf{\partial x}}\ &\large= \frac{\partial\mathcal{L}}{\partial \boldsymbol{\sigma}} \frac{\partial \boldsymbol{\sigma}}{\mathbf{\partial x}}\\[1em] &\large= \begin{bmatrix} \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{1}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{1}}\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{2}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{2}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{2}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{2}}\\ \vdots\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{D-1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{D-1}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{D-1}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{D-1}}\\ \frac{\partial \mathcal{L}}{\partial \sigma_{1}}\frac{\partial \sigma_{1}}{\partial x_{D}} + \frac{\partial \mathcal{L}}{\partial \sigma_{2}}\frac{\partial \sigma_{2}}{\partial x_{D}} + \dots + \frac{\partial \mathcal{L}}{\partial \sigma_{D-1}}\frac{\partial \sigma_{D-1}}{\partial x_{D}} + \frac{\partial \mathcal{L}}{\partial \sigma_{D}}\frac{\partial \sigma_{D}}{\partial x_{D}}\\ \end{bmatrix}\\[1em] &\large= \begin{bmatrix} -\frac{y_1}{\sigma_1}\sigma_1(1-\sigma_1) + \frac{y_2}{\sigma_2}\sigma_2\sigma_1 + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_1 + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_1\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_2 - \frac{y_2}{\sigma_2}\sigma_2(1-\sigma_2) + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_2 + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_2\\ \vdots\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_{D-1} + \frac{y_2}{\sigma_2}\sigma_2\sigma_{D-1} + \dots - \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}(1-\sigma_{D-1}) + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_{D-1}\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_{D} + \frac{y_2}{\sigma_2}\sigma_2\sigma_{D} + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_{D} - \frac{y_D}{\sigma_D}\sigma_{D}(1-\sigma_{D})\\ \end{bmatrix}^\top \end{align} $$

While this looks rather overwhelming at a first glance, we can perform several simplification steps. Firstly, we can cancel out each $\sigma_j$ of the term $\frac{y_i}{\sigma j}$. We can multiply the factors $-y_j(1-\sigma_j)$ to get $y_j + \sigma_jy_j$, and lastly reorder all terms to spot the relevant pattern:

$$ \begin{align} \large \frac{\partial\mathcal{L}}{\mathbf{\partial x}}\ &\large= \begin{bmatrix} -\frac{y_1}{\sigma_1}\sigma_1(1-\sigma_1) + \frac{y_2}{\sigma_2}\sigma_2\sigma_1 + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_1 + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_1\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_2 - \frac{y_2}{\sigma_2}\sigma_2(1-\sigma_2) + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_2 + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_2\\ \vdots\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_{D-1} + \frac{y_2}{\sigma_2}\sigma_2\sigma_{D-1} + \dots - \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}(1-\sigma_{D-1}) + \frac{y_D}{\sigma_D}\sigma_{D}\sigma_{D-1}\\ \frac{y_1}{\sigma_1}\sigma_1\sigma_{D} + \frac{y_2}{\sigma_2}\sigma_2\sigma_{D} + \dots + \frac{y_{D-1}}{\sigma_{D-1}}\sigma_{D-1}\sigma_{D} - \frac{y_D}{\sigma_D}\sigma_{D}(1-\sigma_{D})\\ \end{bmatrix}^\top\\[1em] &\large= \begin{bmatrix} -y_1 + \sigma_1y_1 + \sigma_1y_2 + \dots + \sigma_1y_{D-1} + \sigma_1y_{D}\\ -y_2 + \sigma_2y_1 + \sigma_2y_2 + \dots + \sigma_2y_{D-1} + \sigma_2y_{D}\\ \vdots \\ -y_{D-1} + \sigma_{D-1}y_1 + \sigma_{D-1}y_2 + \dots + \sigma_{D-1}y_{D-1} + \sigma_{D-1}y_{D}\\ -y_{D} + \sigma_{D}y_1 + \sigma_{D}y_2 + \dots + \sigma_{D}y_{D-1} + \sigma_{D}y_{D} \end{bmatrix}^\top\\[1em] &\large= \begin{bmatrix} -y_1 + \sigma_1 \sum_{k=1}^D y_k\\ -y_2 + \sigma_2 \sum_{k=1}^D y_k\\ \vdots\\ -y_{D-1} + \sigma_{D-1} \sum_{k=1}^D y_k\\ -y_{D} + \sigma_{D-1} \sum_{k=1}^D y_k\\ \end{bmatrix} \end{align} $$

The last transformation step using the $\sum$ notation is mainly to see that all entries share this sum. Moreover, since $\mathbf{y}$ is a one-hot vector with only one entry $y_k = 1$, we know that $\sum_{k=1}^D y_k = 1$. Thus, at last, we get the very elegant solution to compute the downstream gradient $\frac{\partial\mathcal{L}}{\mathbf{\partial x}}$:

$$ \begin{align} \large \frac{\partial\mathcal{L}}{\mathbf{\partial x}}\ &\large= \begin{bmatrix} \sigma_1 - y_1\\ \sigma_2 - y_2\\ \vdots\\ \sigma_{D-1} - y_{D-1}\\ \sigma_{D} - y_D \end{bmatrix}^\top\\[1em] &\large= \boldsymbol{\sigma} - \mathbf{y} \end{align} $$

Basic Implementation¶

Since the softmax and cross-entropy are commonly used together and they provide such an elegant solution for the downstream gradient, frameworks such PyTorch or Tensorflow provide built-in classes combining both operations. To illustrate this, the code cell below contains a NumPy-only Python class that combines the softmax and the cross entropy loss, supporting batched inputs. The forward() method returns the loss for the batch, and the backward() method returns the downstream gradient to be passed the previous layer (with respect to the forward pass) to continue backpropagation. Compared to the definition we have seen so far, this implementation adds two practical extensions:

Stability trick: The stability trick in the softmax computation consists of subtracting the maximum input value from all logits before applying the exponential, which does not change the final output probabilities but greatly improves numerical stability. Since the exponential function grows very rapidly, large positive logits can cause overflow, while very negative logits can lead to underflow and loss of precision. By shifting the logits so that the largest value becomes $0$, all exponentials are guaranteed to be at most $1$, keeping the computation in a safe numerical range. This works because softmax is invariant to adding or subtracting the same constant from all inputs, making the trick a simple yet essential step for reliable and stable training.
Loss and gradient averaging: Dividing the loss and gradients by the batch size ensures that their scale is independent of how many samples are processed at once, making training behavior consistent across different batch sizes. When the loss is defined as the mean over the batch rather than the sum, each sample contributes equally regardless of batch size, and the magnitude of the gradients remains stable as the batch size changes. This helps keep learning rates meaningful and comparable: doubling the batch size does not automatically double the gradient magnitudes or require returning the optimizer. In practice, averaging over the batch leads to more predictable optimization dynamics and simplifies both theoretical reasoning and practical implementation of gradient-based learning.

In [4]:

class SoftmaxCrossEntropy:
    def __init__(self):
        self.probs = None
        self.labels = None

    def forward(self, logits, labels):
        """
        Computes the loss.
        logits: (batch_size, num_classes) - Raw scores from the previous layer
        labels: (batch_size, num_classes) - One-hot encoded ground truth
        """
        self.labels = labels
        
        # Numerical stability trick: subtract max logit from all logits
        # This prevents e^x from exploding to infinity.
        exps = np.exp(logits - np.max(logits, axis=0, keepdims=True))
        self.probs = exps / np.sum(exps, axis=0, keepdims=True)
        
        # Avoid log(0) by adding a tiny epsilon
        epsilon = 1e-12
        batch_size = logits.shape[0]
        
        # Compute Cross-Entropy loss
        loss = -np.sum(labels * np.log(self.probs + epsilon)) / batch_size
        return loss

    def backward(self):
        """
        Computes the gradient of the loss with respect to the logits.
        Returns: (batch_size, num_classes)
        """
        batch_size = self.labels.shape[0]
        
        # The simplified gradient: (p - y) / batch_size
        grad = (self.probs - self.labels) / batch_size
        return grad

For a quick test, let's consider our small logits vector $[-1.2\ 1.6\ 0.9]^\top$ from our initial example. Since this implies that our classification tasks has $3$ classes, we can compute the loss under the assumption that the 1st class (labels=[1,0,0]), the 2nd class (labels=[0,1,0]), or the 3rd class (labels=[0,0,1]) is the correct class; see the following code cell computing all three possible losses.

In [5]:

sce = SoftmaxCrossEntropy()

for labels in [ [1,0,0], [0,1,0], [0,0,1] ]:
    loss = sce.forward(np.asarray([-1.2, 1.6, 0.9]), np.asarray(labels))
    print(f"Loss for labels {labels}: {loss:.3f}")

Loss for labels [1, 0, 0]: 1.081
Loss for labels [0, 1, 0]: 0.148
Loss for labels [0, 0, 1]: 0.381

Unsurprisingly, loss for [0,1,0] is the smallest since here the true class already has the largest logits value.

Apart from just computing the loss using the forward() method, we can also compute the downstream gradients for all three labels using the backward() method; see the code cell below. Note that we always have to call the forward() method first since we need the softmax outputs (self.probs) and the ground truth (self.labels) to compute the gradients.

In [6]:

sce = SoftmaxCrossEntropy()

for labels in [ [1,0,0], [0,1,0], [0,0,1] ]:
    loss = sce.forward(np.asarray([-1.2, 1.6, 0.9]), np.asarray(labels))
    grad = sce.backward()
    print(f"Gradients for labels {labels}: {grad}")

Gradients for labels [1, 0, 0]: [-0.320318    0.21403257  0.10628543]
Gradients for labels [0, 1, 0]: [ 0.01301533 -0.11930076  0.10628543]
Gradients for labels [0, 0, 1]: [ 0.01301533  0.21403257 -0.2270479 ]

Of course, compared to the losses, the gradients are less easy to interpret just by briefly looking at them.

Summary¶

This notebook provided a detailed and systematic treatment of the softmax function, starting from its role in transforming raw logits into a normalized probability distribution over classes. The forward pass was derived carefully, emphasizing important practical considerations such as numerical stability through subtracting the maximum logit. By grounding the discussion in both intuition and mathematics, the notebook established a clear understanding of why softmax is the standard choice for multi-class classification tasks and how it turns arbitrary model outputs into interpretable probabilities.

A significant focus of the notebook was the backward pass of the softmax function when treated as an individual layer. The full Jacobian of softmax was derived and analyzed, highlighting its characteristic structure with non-zero off-diagonal terms that capture the coupling between class probabilities. The notebook then examined the softmax function in combination with the cross-entropy loss, demonstrating how this pairing leads to a dramatic simplification of the gradient. By deriving the combined backward pass step by step, it became clear why the gradient with respect to the logits reduces to the difference between the predicted probabilities and the target distribution. This result not only provides strong intuition for how learning proceeds in classification models, but also explains why this combination is both computationally efficient and numerically stable in practice.

Beyond the specific formulas, the notebook emphasized broader lessons about neural network design and implementation. Understanding the softmax and cross-entropy at this level reveals recurring patterns such as gradient simplifications, invariances, and stability tricks that appear throughout deep learning. These insights help demystify backpropagation and make it easier to reason about more complex architectures and loss functions.

Finally, the notebook motivated why learning these fundamentals remains highly valuable even though libraries like PyTorch and TensorFlow provide highly optimized implementations out of the box. A deep understanding of what happens "under the hood" enables practitioners to debug training issues, reason about numerical behavior, and implement custom layers or losses with confidence. In this sense, mastering the softmax function and its gradients serves not only as a practical skill, but also as a conceptual foundation for understanding and extending modern deep learning frameworks.

In [ ]: