Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Bias-Variance Decomposition¶

The bias-variance decomposition is a fundamental concept in machine learning that helps explain the sources of error in a predictive model. When building a model, the goal is not only to perform well on the training data but also to generalize well to unseen data. The total prediction error can be decomposed into three components: bias, variance, and irreducible error. This decomposition provides a structured way to understand how model complexity and data characteristics affect performance.

Bias refers to the error introduced by approximating a complex real-world problem with a simplified model. A model with high bias makes strong assumptions about the data and often fails to capture important patterns, resulting in systematic errors or underfitting. In contrast, variance measures the model's sensitivity to fluctuations in the training data. High variance implies the model is overfitting — capturing noise rather than the true signal — which leads to poor generalization on new data. The irreducible error represents the inherent noise in the data that no model can predict.

Understanding the bias-variance decomposition is crucial for properly training, evaluating, and improving machine learning models. It helps practitioners identify whether a model's poor performance is due to underfitting, overfitting, or data limitations. This insight guides decisions about model selection, feature engineering, regularization, and data collection. For example, if high variance is detected, one might use a simpler model, apply regularization, or gather more training data. If high bias is the issue, a more flexible model or better features may be needed. Ultimately, the bias-variance decomposition provides a theoretical foundation for one of the most important trade-offs in machine learning: the bias-variance tradeoff. Striking the right balance is key to developing models that perform well not just on training data, but on unseen data as well — ensuring reliability and robustness in real-world applications.

Quick Refresher: Random Variables, Expectation & Variance¶

Random Variables¶

In probability and statistics, a random variable is a numerical description of the outcome of a random phenomenon or experiment. It's a way to assign a specific number to each possible outcome of an uncertain event. More formally, a random variable is a function that maps the outcomes of a random experiment to real numbers. The set of all possible outcomes is called the sample space.

Example 1: Assume you flip a fair coin 3 times and record if the coin shows heads (H) or tails (T). The set of all possible outcomes (i.e., the sample space) is {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}. A random variable X could be defined as the "number of heads." So, X(HHH)=3, X(HHT)=2, X(TTT)=0, and so on for each of the eight outcomes. Since the X takes on a countable number of distinct values, X is discrete random variable.

Example 2: Assume you are sitting at a bus stop and observe the arrival of buses. A random variable X could be defined as the time between the arrival between two buses. Since the number of all possible times is uncountable, X is a continuous random variable.

For many machine learning tasks, especially regression and classification, the output of the model (the predicted value or probability) is considered a realization of a random variable. Recall that a regression model has the following form:

$$\large Y = f(X) + \epsilon $$

with $X$ and $Y$ being the input and output space, respectively; $f(X)$ is the true (and unknown!) function mapping the inputs to the corresponding outputs. This error term represents all the unobserved factors that influence $Y$ but are not included in the model. These unobserved factors are inherently random. As such, the error term $\epsilon$ is a random variable, and therefore $Y$ is random variable. In simple terms, the predictions made by the model are uncertain due to that randomness.

Expected Value¶

The expected value (also called the expectation or mean) of a random variable X, commonly written as $\mathbb{E}[X]$, is the weighted average of all possible values X can take, where each value is weighted by its probability. It represents the long-term average value of the random variable if the experiment is repeated many times. In case of a discrete random variable $X$, the expected value $\mathbb{E}(X)$ is define as

$$ \begin{align} \large \mathbb{E}[X] &\large = {\sum_{x}x\cdot P(X=x)} \end{align} $$

for all possible outcomes $x$; $P(X=x)$ denotes the probability of the outcome $x$.

Let's illustrate this idea using our coin example from before. Since we assume a fair coin, each of the $8$ outcomes (HHH, HHT, HTH, etc.) is equally likely. We also know that we have four possible outcomes for the number of heads (H): $0$, $1$, $2$, $3$, and $4$. We can therefore the probability for all outcomes $x$ as:

$P(X=0) = 1/8\ $, since there is 1 outcome resulting in 0 heads (TTT)
$P(X=1) = 3/8\ $, since there are 3 outcomes resulting in 1 head (TTH, THT, HTT)
$P(X=2) = 3/8\ $, since there are 3 outcomes resulting in 2 heads (HHT, HTH, THH)
$P(X=3) = 1/8\ $, since there is 1 outcome resulting in 3 heads (HHH)

Of course, all probabilities sum up to $1$.

$$ \begin{align} \large \mathbb{E}(X) &\large = {\sum_{x}x\cdot P(X=x)}\\ &\large = \left[0\cdot P(X=0) \right] + \left[1\cdot P(X=1) \right] + \left[2\cdot P(X=2) \right] + \left[3\cdot P(X=3) \right]\\[1.0em] &\large = \left[0\cdot 1/8 \right] + \left[1\cdot 3/8 \right] + \left[2\cdot 3/8 \right] + \left[3\cdot 1/8 \right]\\[1.0em] &\large = 0 + 3/8 + 6/8 + 3/8 = 12/8 = \mathbf{1.5} \end{align} $$

The expected value of $X$, the number of heads in three coin flips, is $\mathbb{E}[X]=1.5$.

For the purpose of this notebook, we need to cover some of the basic characteristics of the expected value $\mathbb{E}(X)$:

Constants: The expected value of a constant $c$ is also a constant:

$$\large \mathbb{E}[c] = c $$

This also means that $\mathbb{E}[\mathbb{E}[X]] = \mathbb{E}[X]$ since the the expected value of a random variable X is just a constant; recall that we just calculated $\mathbb{E}[X]=1.5$ for our coin example. Furthermore, if we have a random variable $X$ and a constant $c$, the following relationship holds: $$\large \mathbb{E}[cX] = c\mathbb{E}[X] $$

Linearity of expectations: Given two random variables $X$ and $Y$, the expect value of their sum is equal to the sum of their expected values:

$$\large \mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] $$

Product of expectations: Given two independent random variables $X$ and $Y$, the expect value of their product is equal to the product of their expected values:

$$\large \mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] $$

Again, this only holds true if both random variables $X$ and $Y$ are independent.

Although we skip the proofs here, they follow directly from the definition of the expected value.

Variance¶

The variance of a random variable is a measure of how much the values of the variable differ from the expected value. In simple terms, the variance tells how "spread out" the values are. It is calculated as the expected value of the squared difference between the random variable $X$ and its expected value $\mathbb{E}[X]$:

$$\large Var(X) = \mathbb{E}\left[ \left(X - \mathbb{E}[X] \right)^2 \right] $$

We can write the formula for the variance, by (a) expanding the quadratic term, (b) using the linearity of expectations to move the expected value the the resulting individual terms, and (c) utilizing the fact that $\mathbb{E}[X]$ is a constant. More formally:

$$\begin{align} \large Var(X)\ & \large = \mathbb{E}\left[ \left(X - \mathbb{E}[X] \right)^2 \right]\\[1em] & \large = \mathbb{E}\left[ X^2 - 2X\mathbb{E}[X] + \mathbb{E}[X]^2 \right]\\[1em] & \large = \mathbb{E}\left[ X^2\right] - 2\mathbb{E}\left[X\mathbb{E}[X]\right] + \mathbb{E}\left[\mathbb{E}[X]^2 \right]\\[1em] & \large = \mathbb{E}\left[ X^2\right] - 2\mathbb{E}[X]\mathbb{E}[X] + \mathbb{E}\left[\mathbb{E}[X]^2 \right]\\[1em] & \large = \mathbb{E}\left[ X^2\right] - 2\mathbb{E}[X]^2 + \mathbb{E}[X]^2\\[1em] & \large = \mathbb{E}\left[ X^2\right] - \mathbb{E}[X]^2 \\[1em] \end{align} $$

Lastly, we can rewrite this formula such that $\mathbb{E}(X^2)$ is on one side:

$$\large \mathbb{E}(X^2) = \mathbb{E}[X]^2 + Var(X) $$

We will later use this equation for the decomposition of the model error; notice how this equation will use the variance by replacing the expected value of a squared random variable using.

Bias-Variance Decomposition (for regression problems)¶

Recall that for regression problems, we are trying to predict a dependent $X$ with independent features $X$, assuming that there is a true but unknown function $f(X)$ that defines the relationship between $X$ and $Y$. Individual observations $y\in Y$ will deviate from the true function $f(X)$ by some random error $\epsilon$ due to random noise or any unobserved factors that influence $Y$. Regression models typically assume that the error term is normally distributed with an expected value $\mathbb{E}[\epsilon] = 0$ and variance $Var(\epsilon) = \sigma^2$.

Our goal is to estimate the true and unknown function $f(x)$. For this, we obtain a training dataset $D$ containing $n$ samples:

$$\large D = \left\{(x_1, y_1), (x_2, y_2), \cdots , (x_n, y_n) \right\} $$

where $x_i\in X$ and $y_i\in Y$. Note that $x_i$ may refer to single input features but also a feature vector comprising several feature values. However, use $x_i$ (instead of $\mathbf{x}_i$) to ease presentation and this distinction between individual features and feature vectors does not matter for our discussion.

It is assumed that the $D$ consists of a sample of independent and identically distributed (i.i.d) pairs. Independent means that each sample is not influenced by or correlated with any other sample. In other words, the features and labels of one sample do not tell us anything about another sample. Identically distributed means that all samples come from the same underlying probability distribution — that is, the process that generated the data is consistent across all samples.

Using any learning algorithm — for example, (Polynomial) Linear Regression — we can train a model described by the hypothesis $h_D(X)$ that minimizes the mean squared error (MSE) over all training samples in $D$. We can also say that hypothesis $h_D(X)$ is the model. In short, $h_D(X)$ is the hypothesis of the model of our true, unknown function $f(X)$. We use the subscript $D$ to indicate that $h_D(X)$ was training on a specific dataset $D$. We will see later why when it becomes important. The MSE is used for regression models because it measures the average squared difference between the predicted values and the actual values, giving a clear sense of how far off the predictions are:

$$\large MSE = \frac{1}{n} \sum_{(x_i, y_i)\in D} (y_i - h_D(x_i))^{2} $$

Squaring the errors ensures that larger mistakes are penalized more heavily than smaller ones, which helps the model focus on reducing big errors. MSE is also mathematically convenient because it is differentiable, making it easier to optimize using methods like gradient descent.

We typically assess a model's performance based on how well it performs on unseen data, i.e., data samples that are no in $D$ and therefore have not been seen during training. Let's assume that new data sample $(x^\prime, y^\prime)$ with $y^\prime = f(x^\prime) + \epsilon$. Similar to the MSE, we can use the squared error (SE) to measure how well the model performs on this new data sample:

$$\large SE = (y^\prime - h_D(x^\prime))^{2} $$

In the introductory notebook for bias and variance, we already saw that the prediction $h_D(x^\prime)$ typically depends on the training dataset $D$ — that is, if $D$ would contain a different set of samples $(x_i, y_i)$, the model would like to be different and therefore make a different prediction. So let's assume that we have randomly drawn $N$ different datasets $(D_1, D_2, \dots , D_N)$, and have trained a model using each of those datasets, giving us $N$ models

$$\large \left[ h_{D_1}(x), h_{D_2}(x), \dots , h_{D_N}(x) \right] $$

Since each model depends on the randomly drawn set of data samples, $h_D(x)$ is a random variable. We can therefore calculate the expected value $\mathbb{E}[h_D(x)]$ representing the average prediction value of the collection of models over many training datasets $(D_1, D_2, \dots , D_N)$. For each model, we can now calculate the squared error (SE):

$$\large \left[ (y^\prime - h_{D_1}(x^\prime))^{2}, (y^\prime - h_{D_2}(x^\prime))^{2}, \dots , (y^\prime - h_{D_N}(x^\prime))^{2} \right] $$

Note that the square error we observe is also a random variable. Apart from $h_D(x)$ already being a random variable, $y^\prime$ contains the error term $\epsilon$ which itself is a random variable. Thus, to get an overall idea how well our learning algorithm is performing on unseen data, we can compute the expected value of all squared errors, which is called the expected squared error or just the model error:

$$\large Error = \mathbb{E}\left[(y^\prime - h_D(x^\prime))^{2}\right] $$

Bias & Variance¶

In the introductory notebook, we also have already introduced the notion of model bias and model variance. Recall that the bias describes how much the expected value of the model deviates from the true value of the function $f(x)$ we are trying to estimate. A low bias tells us that our model does a good job of approximating our function $f(x)$, and vice versa for a high bias. With $\mathbb{E}[h_D(x)]$ being the expected value of the model, the bias is therefore define as:

$$\large Bias(h_D(x)) = \mathbb{E}[h_D(x)] - f(x) $$

In line with the definition of the variance of a random variable, the variance of a model is the expected value of the squared differences between an particular model $h_D(x)$ — particular with respect to the dataset $D$ — and the expected value $\mathbb{E}[h_D(x)]$ of the model:

$$\large Variance(h_D(x)) = \mathbb{E}\left[ \left( h_D(x) - \mathbb{E}[h_D(x)] \right)^{2} \right] $$

The model variance captures how much the model fits vary across different datasets, therefore measuring the average "consistency" of the model. A learning algorithm with high variance indicates that the models vary a lot across datasets, while low variance indicates that models are quite similar across datasets.

Both bias and variance introduce errors to the model, with bias being a systematic error and variance being a random error. We also now know that the total error of a model is $\mathbb{E}\left[(y^\prime - h_D(x^\prime))^{2}\right]$. So the remaining question is now how the bias and the variance actually contribute to the model error.

Decomposition of Model Error¶

The goal of the decomposition is to see how the bias, the variance, and — as we will see — the irreducible error contribute to the total model error. Our starting point is there for the formula for the model error. The only slight change we make to the formula is to write $x$ instead of $x^\prime$ and $y$ instead of $\y^\prime$, just to ease the presentation a bit. This, we start the with following equation for the model error:

$$\large Error = \mathbb{E}\left[(y - h_D(x))^{2}\right] $$

In the first step, we simply replace $y$ with its definition $f(x) + \epsilon$ and reorder the resulting three terms a bit:

$$\begin{align} \large Error\ &\large = \mathbb{E}\left[(y - h_D(x))^{2}\right]\\[1em] &\large = \mathbb{E}\left[(f(x) + \epsilon - h_D(x))^{2}\right]\\[1em] &\large = \mathbb{E}\left[(f(x) - h_D(x) + \epsilon)^{2}\right] \end{align} $$

Now we expand the quadratic equation. However, we do treat $f(x) - h_D(x)$ as a single term for the expansion:

$$\begin{align} \large Error\ &\large = \mathbb{E}\left[({\color{blue}f(x) - h_D(x)} + {\color{red}\epsilon})^{2}\right]\\[1em] &\large = \mathbb{E}\left[({\color{blue}f(x) - h_D(x)})^2 + 2({\color{blue}f(x) - h_D(x)}){\color{red}\epsilon} + {\color{red}\epsilon}^2 \right] \\[1em] &\large = \mathbb{E}\left[({\color{blue}f(x) - h_D(x)})^2\right] + 2\mathbb{E}\left[ ({\color{blue}f(x) - h_D(x)}){\color{red}\epsilon} \right] + \mathbb{E}\left[ {\color{red}\epsilon}^2 \right] \\[1em] \end{align} $$

Notice that we also moved the constant factor of $2$ outside the expected value. We can now look at all three terms individually. Let's start with the third term $\mathbb{E}\left[ \epsilon^2 \right]$ by applying the formula $\mathbb{E}\left[X^2\right] = Var(X) + \mathbb{E}\left[ X \right]^2$ (see above) to get:

$$\begin{align} \large \mathbb{E}\left[ \epsilon^2 \right]\ &\large = Var(\epsilon) + \mathbb{E}\left[ \epsilon \right]^2 \\[1em] &\large = Var(\epsilon) \\[1em] &\large = \sigma^2 \\[1em] \end{align} $$

since $\mathbb{E}\left[ \epsilon \right] = 0$ by definition of our model.

For the second term $2\mathbb{E}\left[ (f(x) - h_D(x))\epsilon \right]$, we utilize the fact that the random error $\epsilon$ is independent of $x$. We can therefore transform the expected value of the product into a product of expected values:

$$\large 2\mathbb{E}\left[ (f(x) - h_D(x))\epsilon \right] = 2 \mathbb{E}\left[ f(x) - h_D(x)\right] \mathbb{E}\left[ \epsilon \right] $$

However, by definition, $\mathbb{E}[\epsilon] = 0$, which means that this term simply equates to $0$. Thus, so far, our equation for the model error looks as follows:

$$\begin{align} \large Error\ &\large = \mathbb{E}\left[(f(x) - h_D(x))^2\right] + \sigma^2 \\[1em] \end{align} $$

This means that we are now left with the first term $\mathbb{E}\left[(f(x) - h_D(x))^2\right]$. This one will take some more consideration and a bit of mathematical trickery. Recall that our goal is to write the model error in such a way to "see" the bias and the variance. Also notice that both the formula for the bias and the variance contain the expected values $\mathbb{E}[h_D(x)]$, which is so far missing in our equation for the model error. So, let's "sneak in" $\mathbb{E}[h_D(x)]$ into the equation:

$$\large \mathbb{E}\left[(f(x) - h_D(x))^2\right] = \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)] + \mathbb{E}[h_D(x)] - h_D(x))^2\right] $$

Of course, by subtracting and adding $\mathbb{E}[h_D(x)]$ to the inner term, we do not change the overall expected value. Next, we expand the quadratic term by treating $f(x) - \mathbb{E}[h_D(x)]$ and $\mathbb{E}[h_D(x)] - h_D(x)$ as individual terms, followed by moving the calculation if the expected values to the result terms after the expansion:

$$\begin{align} \large \mathbb{E}\left[(f(x) - h_D(x))^2\right]\ &\large = \mathbb{E}\left[({\color{blue}f(x) - \mathbb{E}[h_D(x)]} + {\color{red}\mathbb{E}[h_D(x)] - h_D(x)})^2\right]\\[1em] &\large = \mathbb{E}\left[({\color{blue}f(x) - \mathbb{E}[h_D(x)]})^2 + 2{(\color{blue}f(x) - \mathbb{E}[h_D(x)])}{(\color{red}\mathbb{E}[h_D(x)] - h_D(x)}) + ({\color{red}\mathbb{E}[h_D(x)] - h_D(x)})^2\right]\\[1em] &\large = \mathbb{E}\left[({\color{blue}f(x) - \mathbb{E}[h_D(x)]})^2\left] + 2\mathbb{E}\right[ {(\color{blue}f(x) - \mathbb{E}[h_D(x)])}{(\color{red}\mathbb{E}[h_D(x)] - h_D(x)})\right] + \mathbb{E}\left[({\color{red}\mathbb{E}[h_D(x)] - h_D(x)})^2\right]\\[1em] \end{align} $$

Again, we have three terms we can address individually. First, notice that the last term $\mathbb{E}\left[(\mathbb{E}[h_D(x)] - h_D(x))^2\right]$ is already the definition for the variance; thus we can write:

$$\begin{align} \large \mathbb{E}\left[(f(x) - h_D(x))^2\right]\ &\large = \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right]\ +\ 2\mathbb{E}\left[ (f(x) - \mathbb{E}[h_D(x)])(\mathbb{E}[h_D(x)] - h_D(x))\right] + Var\left(h_D(x)\right)\\[1em] \end{align} $$

Next, we look at the second term $\mathbb{E}\left[ (f(x) - \mathbb{E}[h_D(x)])(\mathbb{E}[h_D(x)] - h_D(x))\right]$ — we can ignore the constant factor of $2$ here — and show that it equates to $0$. To do this, we first multiply both terms in parentheses together:

$$\begin{align} \large \mathbb{E}\left[ (f(x) - \mathbb{E}[h_D(x)])(\mathbb{E}[h_D(x)] - h_D(x))\right]\ &\large = \mathbb{E}\left[f(x)\mathbb{E}[h_D(x)] - f(x)h_D(x) - \mathbb{E}[h_d(x)]^2 + \mathbb{E}[h_D(x)]h_d(x) \right] \\[1em] &\large = \mathbb{E}\left[f(x)\mathbb{E}[h_D(x)]\right] - \mathbb{E}\left[f(x)h_D(x)\right] - \mathbb{E}\left[\mathbb{E}[h_D(x)]^2\right] + \mathbb{E}\left[\mathbb{E}[h_D(x)]h_D(x) \right] \\[1em] \end{align} $$

For the next simplifications, we have to remember that $f(x)$ is not a random variable, but a fixed, deterministic function of $x$. Therefore, in the context of calculating expected values, we can treat $f(x)$ as constant. This means that $\mathbb{E}[f(x)] = f(x)$ and we can move $f(x)$ out of the expected value. Furthermore, we apply the rule that $\mathbb{E}[\mathbb{E}(X)] = \mathbb{E}(X)$. Applying these relationships, we can rewrite the equation as follows.

$$\begin{align} \large \mathbb{E}\left[ (f(x) - \mathbb{E}[h_D(x)])(\mathbb{E}[h_D(x)] - h_D(x))\right]\ &\large = \mathbb{E}\left[f(x)\mathbb{E}[h_D(x)]\right] - \mathbb{E}\left[f(x)h_D(x)\right] - \mathbb{E}\left[\mathbb{E}[h_D(X)]^2\right] + \mathbb{E}\left[\mathbb{E}[h_D(x)]h_D(x) \right] \\[1em] &\large = f(x)\mathbb{E}\left[\mathbb{E}[h_D(x)]\right] - f(x)\mathbb{E}\left[h_D(x)\right] - \mathbb{E}\left[\mathbb{E}[h_d(X)]^2\right] + \mathbb{E}[h_D(x)]\mathbb{E}[h_D(x)] \\[1em] &\large = f(x)\mathbb{E}[h_D(x)] - f(x)\mathbb{E}\left[h_D(x)\right] - \mathbb{E}[h_D(x)]^2 + \mathbb{E}[h_D(x)]^2 \\[1em] &\large = 0 \\[1em] \end{align} $$

Let's take one more pause to look at the current model error. With all the transformations and implications we have done so far, we can now rewrite our model error as follows:

$$\begin{align} \large Error\ &\large = \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right] + Var\left(h_D(x)\right) + \sigma^2 \\[1em] \end{align} $$

The last term we need to address is $\mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right]$. As done before, we first expand the quadratic term and move the expected value to the resulting individual terms:

$$\begin{align} \large \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right] \ &\large = \mathbb{E}\left[ f(x)^2 - 2f(x)\mathbb{E}[h_D(x)] + \mathbb{E}[h_D(x)]^2\right] \\[1em] &\large = \mathbb{E}\left[ f(x)^2 \right] - 2\mathbb{E}\left[f(x)\mathbb{E}[h_D(x)]\right] + \mathbb{E}\left[\mathbb{E}[h_D(x)]^2\right] \\[1em] \end{align} $$

Again, the important detail here is that $f(x)$ is not a random variable. This means we can once more apply the two rules $\mathbb{E}[c] = c$ and $\mathbb{E}[\mathbb{E}[E]] = \mathbb{E}[X]$. First, since $f(x)$ is not a random value, neither is $f(x)$, and $\mathbb{E}\left[ f(x)^2 \right]$ simplifies to f(x)^2. By using these two rules, we can also simplify the other two terms accordingly, giving us:

$$\begin{align} \large \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right] \ &\large = \mathbb{E}\left[ f(x)^2 \right] - 2\mathbb{E}\left[f(x)\mathbb{E}[h_D(x)]\right] + \mathbb{E}\left[\mathbb{E}[h_D(x)]^2\right] \\[1em] &\large = f(x)^2 - 2f(x)\mathbb{E}[h_D(x)] + \mathbb{E}[h_D(x)]^2 \\[1em] \end{align} $$

We can immediately see that the right-hand side of the equations is the expanded version of the quadratic term. In fact, this quadratic term turns out to be exactly the bias we are looking for. In other words, we can write:

$$\begin{align} \large \mathbb{E}\left[(f(x) - \mathbb{E}[h_D(x)])^2\right] \ &\large = f(x)^2 - 2f(x)\mathbb{E}[h_D(x)] + \mathbb{E}[h_D(x)]^2 \\[1em] &\large = \left( f(x) - \mathbb{E}[h_D(x)\right)^2 \\[1em] &\large = Bias(h_D(x))^2 \\[1em] \end{align} $$

Thus, finally, the model error can now be written as:

$$\begin{align} \large Error\ &\large = Bias(h_D(x))^2 + Var\left(h_D(x)\right) + \sigma^2 \\[1em] \end{align} $$

where $\sigma^2$ is the so-called irreducible error of our model. This irreducible error refers to the portion of the total error that cannot be explained or reduced by any model, regardless of how complex or accurate it is. This error comes from inherent noise in the data — random fluctuations, measurement errors, missing variables, or factors not captured in the features used for prediction. It is essentially the natural variability in the real-world process being modeled. The irreducible error corresponds to the variance of the noise term $\epsilon$ in the data-generating process $Y = f(X) + \epsilon$, where $\epsilon$ has mean zero and non-zero variance. Even if the model perfectly captures $f(X)$, it cannot predict the random noise $\epsilon$, and thus the irreducible error sets a lower bound on how accurate any prediction can be.

Summary¶

The bias-variance decomposition is a fundamental framework in machine learning for understanding the sources of error in predictive models. It breaks down a model’s total expected error into three components: bias, variance, and irreducible error. This decomposition helps us understand how different aspects of a model’s behavior contribute to its overall accuracy, and it provides a useful guide for diagnosing and improving model performance.

Bias refers to the error introduced by approximating a complex real-world process with a simplified model. A high-bias model makes strong assumptions about the data and often underfits — it cannot capture the underlying patterns well, leading to systematically inaccurate predictions. Variance, on the other hand, measures how sensitive the model is to fluctuations in the training data. A high-variance model reacts too strongly to noise or small changes in the training set, often resulting in overfitting and poor generalization to new data.

The irreducible error accounts for the inherent noise in the data that no model can eliminate. This component sets a lower bound on the achievable error regardless of the model used. Together, these three components explain why a model may perform poorly: it might be too simple (high bias), too complex (high variance), or simply limited by the data itself (irreducible error).

Understanding the bias-variance tradeoff is crucial for model selection and tuning. Increasing model complexity typically reduces bias but increases variance, which in turn is a major cause for overfitting. Simplifying the model typically decrease its variance but increases its bias, potentially resulting in underfitting. The key challenge is to find the right balance between bias and variance to minimize total prediction error. This insight drives many decisions in machine learning, such as choosing the right model architecture, setting regularization parameters, or deciding how much training data is needed.

In practice, the bias-variance decomposition not only clarifies why a model performs the way it does but also helps guide strategies for improvement. For example, if a model suffers from high bias, one might use a more flexible model or include more relevant features. If variance is the problem, techniques like regularization, ensembling, or increasing the training set size can help. By explicitly recognizing and balancing these components, practitioners can make more informed and effective modeling decisions.

In [ ]: