# 4.4. Model Selection, Underfitting, and Overfitting¶

As machine learning scientists, our goal is to discover *patterns*. But
how can we be sure that we have truly discovered a *general* pattern and
not simply memorized our data? For example, imagine that we wanted to
hunt for patterns among genetic markers linking patients to their
dementia status, where the labels are drawn from the set
\(\{\text{dementia}, \text{mild cognitive impairment}, \text{healthy}\}\).
Because each person’s genes identify them uniquely (ignoring identical
siblings), it is possible to memorize the entire dataset.

We do not want our model to say *“That’s Bob! I remember him! He has
dementia!”* The reason why is simple. When we deploy the model in the
future, we will encounter patients that the model has never seen before.
Our predictions will only be useful if our model has truly discovered a
*general* pattern.

To recapitulate more formally, our goal is to discover patterns that
capture regularities in the underlying population from which our
training set was drawn. If we are successful in this endeavor, then we
could successfully assess risk even for individuals that we have never
encountered before. This problem—how to discover patterns that
*generalize*—is the fundamental problem of machine learning.

The danger is that when we train models, we access just a small sample of data. The largest public image datasets contain roughly one million images. More often, we must learn from only thousands or tens of thousands of data examples. In a large hospital system, we might access hundreds of thousands of medical records. When working with finite samples, we run the risk that we might discover apparent associations that turn out not to hold up when we collect more data.

The phenomenon of fitting our training data more closely than we fit the
underlying distribution is called *overfitting*, and the techniques used
to combat overfitting are called *regularization*. In the previous
sections, you might have observed this effect while experimenting with
the Fashion-MNIST dataset. If you altered the model structure or the
hyperparameters during the experiment, you might have noticed that with
enough neurons, layers, and training epochs, the model can eventually
reach perfect accuracy on the training set, even as the accuracy on test
data deteriorates.

## 4.4.1. Training Error and Generalization Error¶

In order to discuss this phenomenon more formally, we need to
differentiate between training error and generalization error. The
*training error* is the error of our model as calculated on the training
dataset, while *generalization error* is the expectation of our model’s
error were we to apply it to an infinite stream of additional data
examples drawn from the same underlying data distribution as our
original sample.

Problematically, we can never calculate the generalization error
exactly. That is because the stream of infinite data is an imaginary
object. In practice, we must *estimate* the generalization error by
applying our model to an independent test set constituted of a random
selection of data examples that were withheld from our training set.

The following three thought experiments will help illustrate this situation better. Consider a college student trying to prepare for his final exam. A diligent student will strive to practice well and test his abilities using exams from previous years. Nonetheless, doing well on past exams is no guarantee that he will excel when it matters. For instance, the student might try to prepare by rote learning the answers to the exam questions. This requires the student to memorize many things. She might even remember the answers for past exams perfectly. Another student might prepare by trying to understand the reasons for giving certain answers. In most cases, the latter student will do much better.

Likewise, consider a model that simply uses a lookup table to answer
questions. If the set of allowable inputs is discrete and reasonably
small, then perhaps after viewing *many* training examples, this
approach would perform well. Still this model has no ability to do
better than random guessing when faced with examples that it has never
seen before. In reality the input spaces are far too large to memorize
the answers corresponding to every conceivable input. For example,
consider the black and white \(28\times28\) images. If each pixel
can take one among \(256\) grayscale values, then there are
\(256^{784}\) possible images. That means that there are far more
low-resolution grayscale thumbnail-sized images than there are atoms in
the universe. Even if we could encounter such data, we could never
afford to store the lookup table.

Last, consider the problem of trying to classify the outcomes of coin
tosses (class 0: heads, class 1: tails) based on some contextual
features that might be available. Suppose that the coin is fair. No
matter what algorithm we come up with, the generalization error will
always be \(\frac{1}{2}\). However, for most algorithms, we should
expect our training error to be considerably lower, depending on the
luck of the draw, even if we did not have any features! Consider the
dataset {0, 1, 1, 1, 0, 1}. Our feature-less algorithm would have to
fall back on always predicting the *majority class*, which appears from
our limited sample to be *1*. In this case, the model that always
predicts class 1 will incur an error of \(\frac{1}{3}\),
considerably better than our generalization error. As we increase the
amount of data, the probability that the fraction of heads will deviate
significantly from \(\frac{1}{2}\) diminishes, and our training
error would come to match the generalization error.

### 4.4.1.1. Statistical Learning Theory¶

Since generalization is the fundamental problem in machine learning, you might not be surprised to learn that many mathematicians and theorists have dedicated their lives to developing formal theories to describe this phenomenon. In their eponymous theorem, Glivenko and Cantelli derived the rate at which the training error converges to the generalization error. In a series of seminal papers, Vapnik and Chervonenkis extended this theory to more general classes of functions. This work laid the foundations of statistical learning theory.

In the standard supervised learning setting, which we have addressed up
until now and will stick with throughout most of this book, we assume
that both the training data and the test data are drawn *independently*
from *identical* distributions. This is commonly called the *i.i.d.
assumption*, which means that the process that samples our data has no
memory. In other words, the second example drawn and the third drawn are
no more correlated than the second and the two-millionth sample drawn.

Being a good machine learning scientist requires thinking critically, and already you should be poking holes in this assumption, coming up with common cases where the assumption fails. What if we train a mortality risk predictor on data collected from patients at UCSF Medical Center, and apply it on patients at Massachusetts General Hospital? These distributions are simply not identical. Moreover, draws might be correlated in time. What if we are classifying the topics of Tweets? The news cycle would create temporal dependencies in the topics being discussed, violating any assumptions of independence.

Sometimes we can get away with minor violations of the i.i.d. assumption and our models will continue to work remarkably well. After all, nearly every real-world application involves at least some minor violation of the i.i.d. assumption, and yet we have many useful tools for various applications such as face recognition, speech recognition, and language translation.

Other violations are sure to cause trouble. Imagine, for example, if we try to train a face recognition system by training it exclusively on university students and then want to deploy it as a tool for monitoring geriatrics in a nursing home population. This is unlikely to work well since college students tend to look considerably different from the elderly.

In subsequent chapters, we will discuss problems arising from violations of the i.i.d. assumption. For now, even taking the i.i.d. assumption for granted, understanding generalization is a formidable problem. Moreover, elucidating the precise theoretical foundations that might explain why deep neural networks generalize as well as they do continues to vex the greatest minds in learning theory.

When we train our models, we attempt to search for a function that fits
the training data as well as possible. If the function is so flexible
that it can catch on to spurious patterns just as easily as to true
associations, then it might perform *too well* without producing a model
that generalizes well to unseen data. This is precisely what we want to
avoid or at least control. Many of the techniques in deep learning are
heuristics and tricks aimed at guarding against overfitting.

### 4.4.1.2. Model Complexity¶

When we have simple models and abundant data, we expect the
generalization error to resemble the training error. When we work with
more complex models and fewer examples, we expect the training error to
go down but the generalization gap to grow. What precisely constitutes
model complexity is a complex matter. Many factors govern whether a
model will generalize well. For example a model with more parameters
might be considered more complex. A model whose parameters can take a
wider range of values might be more complex. Often with neural networks,
we think of a model that takes more training iterations as more complex,
and one subject to *early stopping* (fewer training iterations) as less
complex.

It can be difficult to compare the complexity among members of
substantially different model classes (say, decision trees vs. neural
networks). For now, a simple rule of thumb is quite useful: a model that
can readily explain arbitrary facts is what statisticians view as
complex, whereas one that has only a limited expressive power but still
manages to explain the data well is probably closer to the truth. In
philosophy, this is closely related to Popper’s criterion of
falsifiability of a scientific theory: a theory is good if it fits data
and if there are specific tests that can be used to disprove it. This is
important since all statistical estimation is *post hoc*, i.e., we
estimate after we observe the facts, hence vulnerable to the associated
fallacy. For now, we will put the philosophy aside and stick to more
tangible issues.

In this section, to give you some intuition, we will focus on a few factors that tend to influence the generalizability of a model class:

The number of tunable parameters. When the number of tunable parameters, sometimes called the

*degrees of freedom*, is large, models tend to be more susceptible to overfitting.The values taken by the parameters. When weights can take a wider range of values, models can be more susceptible to overfitting.

The number of training examples. It is trivially easy to overfit a dataset containing only one or two examples even if your model is simple. But overfitting a dataset with millions of examples requires an extremely flexible model.

## 4.4.2. Model Selection¶

In machine learning, we usually select our final model after evaluating
several candidate models. This process is called *model selection*.
Sometimes the models subject to comparison are fundamentally different
in nature (say, decision trees vs. linear models). At other times, we
are comparing members of the same class of models that have been trained
with different hyperparameter settings.

With MLPs, for example, we may wish to compare models with different numbers of hidden layers, different numbers of hidden units, and various choices of the activation functions applied to each hidden layer. In order to determine the best among our candidate models, we will typically employ a validation dataset.

### 4.4.2.1. Validation Dataset¶

In principle we should not touch our test set until after we have chosen all our hyperparameters. Were we to use the test data in the model selection process, there is a risk that we might overfit the test data. Then we would be in serious trouble. If we overfit our training data, there is always the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever know?

Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on the training data for model selection either because we cannot estimate the generalization error on the very data that we use to train the model.

In practical applications, the picture gets muddier. While ideally we would only touch the test data once, to assess the very best model or to compare a small number of models to each other, real-world test data is seldom discarded after just one use. We can seldom afford a new test set for each round of experiments.

The common practice to address this problem is to split our data three
ways, incorporating a *validation dataset* (or *validation set*) in
addition to the training and test datasets. The result is a murky
practice where the boundaries between validation and test data are
worryingly ambiguous. Unless explicitly stated otherwise, in the
experiments in this book we are really working with what should rightly
be called training data and validation data, with no true test sets.
Therefore, the accuracy reported in each experiment of the book is
really the validation accuracy and not a true test set accuracy.

### 4.4.2.2. \(K\)-Fold Cross-Validation¶

When training data is scarce, we might not even be able to afford to
hold out enough data to constitute a proper validation set. One popular
solution to this problem is to employ \(K\)*-fold
cross-validation*. Here, the original training data is split into
\(K\) non-overlapping subsets. Then model training and validation
are executed \(K\) times, each time training on \(K-1\) subsets
and validating on a different subset (the one not used for training in
that round). Finally, the training and validation errors are estimated
by averaging over the results from the \(K\) experiments.

## 4.4.3. Underfitting or Overfitting?¶

When we compare the training and validation errors, we want to be
mindful of two common situations. First, we want to watch out for cases
when our training error and validation error are both substantial but
there is a little gap between them. If the model is unable to reduce the
training error, that could mean that our model is too simple (i.e.,
insufficiently expressive) to capture the pattern that we are trying to
model. Moreover, since the *generalization gap* between our training and
validation errors is small, we have reason to believe that we could get
away with a more complex model. This phenomenon is known as
*underfitting*.

On the other hand, as we discussed above, we want to watch out for the
cases when our training error is significantly lower than our validation
error, indicating severe *overfitting*. Note that overfitting is not
always a bad thing. With deep learning especially, it is well known that
the best predictive models often perform far better on training data
than on holdout data. Ultimately, we usually care more about the
validation error than about the gap between the training and validation
errors.

Whether we overfit or underfit can depend both on the complexity of our model and the size of the available training datasets, two topics that we discuss below.

### 4.4.3.1. Model Complexity¶

To illustrate some classical intuition about overfitting and model complexity, we give an example using polynomials. Given training data consisting of a single feature \(x\) and a corresponding real-valued label \(y\), we try to find the polynomial of degree \(d\)

to estimate the labels \(y\). This is just a linear regression problem where our features are given by the powers of \(x\), the model’s weights are given by \(w_i\), and the bias is given by \(w_0\) since \(x^0 = 1\) for all \(x\). Since this is just a linear regression problem, we can use the squared error as our loss function.

A higher-order polynomial function is more complex than a lower-order polynomial function, since the higher-order polynomial has more parameters and the model function’s selection range is wider. Fixing the training dataset, higher-order polynomial functions should always achieve lower (at worst, equal) training error relative to lower degree polynomials. In fact, whenever the data examples each have a distinct value of \(x\), a polynomial function with degree equal to the number of data examples can fit the training set perfectly. We visualize the relationship between polynomial degree and underfitting vs. overfitting in Fig. 4.4.1.

### 4.4.3.2. Dataset Size¶

The other big consideration to bear in mind is the dataset size. Fixing our model, the fewer samples we have in the training dataset, the more likely (and more severely) we are to encounter overfitting. As we increase the amount of training data, the generalization error typically decreases. Moreover, in general, more data never hurt. For a fixed task and data distribution, there is typically a relationship between model complexity and dataset size. Given more data, we might profitably attempt to fit a more complex model. Absent sufficient data, simpler models may be more difficult to beat. For many tasks, deep learning only outperforms linear models when many thousands of training examples are available. In part, the current success of deep learning owes to the current abundance of massive datasets due to Internet companies, cheap storage, connected devices, and the broad digitization of the economy.

## 4.4.4. Polynomial Regression¶

We can now explore these concepts interactively by fitting polynomials to data.

```
from d2l import mxnet as d2l
from mxnet import gluon, np, npx
from mxnet.gluon import nn
import math
npx.set_np()
```

```
from d2l import torch as d2l
import torch
from torch import nn
import numpy as np
import math
```

```
from d2l import tensorflow as d2l
import tensorflow as tf
import numpy as np
import math
```

### 4.4.4.1. Generating the Dataset¶

First we need data. Given \(x\), we will use the following cubic polynomial to generate the labels on training and test data:

The noise term \(\epsilon\) obeys a normal distribution with a mean
of 0 and a standard deviation of 0.1. For optimization, we typically
want to avoid very large values of gradients or losses. This is why the
*features* are rescaled from \(x^i\) to \(\frac{x^i}{i!}\). It
allows us to avoid very large values for large exponents \(i\). We
will synthesize 100 samples each for the training set and test set.

```
max_degree = 20 # Maximum degree of the polynomial
n_train, n_test = 100, 100 # Training and test dataset sizes
true_w = np.zeros(max_degree) # Allocate lots of empty space
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])
features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
poly_features[:, i] /= math.gamma(i + 1) # `gamma(n)` = (n-1)!
# Shape of `labels`: (`n_train` + `n_test`,)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)
```

Again, monomials stored in `poly_features`

are rescaled by the gamma
function, where \(\Gamma(n)=(n-1)!\). Take a look at the first 2
samples from the generated dataset. The value 1 is technically a
feature, namely the constant feature corresponding to the bias.

```
features[:2], poly_features[:2, :], labels[:2]
```

```
(array([[-0.03716067],
[-1.1468065 ]]),
array([[ 1.0000000e+00, -3.7160669e-02, 6.9045764e-04, -8.5526226e-06,
7.9455290e-08, -5.9052235e-10, 3.6573678e-12, -1.9415747e-14,
9.0187767e-17, -3.7238198e-19, 1.3837962e-21, -4.6747992e-24,
1.4476556e-26, -4.1381425e-29, 1.0984010e-31, -2.7211542e-34,
6.3199942e-37, -1.3815009e-39, 2.8516424e-42, -5.6051939e-45],
[ 1.0000000e+00, -1.1468065e+00, 6.5758252e-01, -2.5137332e-01,
7.2069131e-02, -1.6529869e-02, 3.1594271e-03, -5.1760738e-04,
7.4199430e-05, -9.4547095e-06, 1.0842722e-06, -1.1304095e-07,
1.0803007e-08, -9.5299690e-10, 7.8064499e-11, -5.9683248e-12,
4.2778208e-13, -2.8857840e-14, 1.8385756e-15, -1.1097316e-16]]),
array([ 5.1432443 , -0.06415121]))
```

```
# Convert from NumPy ndarrays to tensors
true_w, features, poly_features, labels = [torch.tensor(x, dtype=
torch.float32) for x in [true_w, features, poly_features, labels]]
features[:2], poly_features[:2, :], labels[:2]
```

```
(tensor([[1.6111],
[0.2082]]),
tensor([[1.0000e+00, 1.6111e+00, 1.2979e+00, 6.9701e-01, 2.8074e-01, 9.0462e-02,
2.4291e-02, 5.5909e-03, 1.1259e-03, 2.0156e-04, 3.2474e-05, 4.7563e-06,
6.3859e-07, 7.9142e-08, 9.1077e-09, 9.7825e-10, 9.8505e-11, 9.3356e-12,
8.3560e-13, 7.0856e-14],
[1.0000e+00, 2.0817e-01, 2.1668e-02, 1.5036e-03, 7.8252e-05, 3.2580e-06,
1.1304e-07, 3.3617e-09, 8.7477e-11, 2.0234e-12, 4.2122e-14, 7.9715e-16,
1.3829e-17, 2.2145e-19, 3.2928e-21, 4.5699e-23, 5.9458e-25, 7.2809e-27,
8.4206e-29, 9.2260e-31]]),
tensor([6.4902, 5.2426]))
```

```
# Convert from NumPy ndarrays to tensors
true_w, features, poly_features, labels = [tf.constant(x, dtype=
tf.float32) for x in [true_w, features, poly_features, labels]]
features[:2], poly_features[:2, :], labels[:2]
```

```
(<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[0.4959065],
[0.4676605]], dtype=float32)>,
<tf.Tensor: shape=(2, 20), dtype=float32, numpy=
array([[1.0000000e+00, 4.9590650e-01, 1.2296163e-01, 2.0325825e-02,
2.5199272e-03, 2.4992967e-04, 2.0656957e-05, 1.4634170e-06,
9.0714757e-08, 4.9984488e-09, 2.4787630e-10, 1.1174861e-11,
4.6180724e-13, 1.7616400e-14, 6.2400626e-16, 2.0629918e-17,
6.3940689e-19, 1.8652120e-20, 5.1387263e-22, 1.3412253e-23],
[1.0000000e+00, 4.6766049e-01, 1.0935316e-01, 1.7046716e-02,
1.9930189e-03, 1.8641124e-04, 1.4529528e-05, 9.7069801e-07,
5.6744636e-08, 2.9485803e-09, 1.3789345e-10, 5.8624832e-12,
2.2847097e-13, 8.2189885e-15, 2.7454972e-16, 8.5597361e-18,
2.5019065e-19, 6.8826048e-21, 1.7881789e-22, 4.4013717e-24]],
dtype=float32)>,
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([5.3745303, 5.319692 ], dtype=float32)>)
```

### 4.4.4.2. Training and Testing the Model¶

Let us first implement a function to evaluate the loss on a given dataset.

```
def evaluate_loss(net, data_iter, loss): #@save
"""Evaluate the loss of a model on the given dataset."""
metric = d2l.Accumulator(2) # Sum of losses, no. of examples
for X, y in data_iter:
l = loss(net(X), y)
metric.add(d2l.reduce_sum(l), l.size)
return metric[0] / metric[1]
```

```
def evaluate_loss(net, data_iter, loss): #@save
"""Evaluate the loss of a model on the given dataset."""
metric = d2l.Accumulator(2) # Sum of losses, no. of examples
for X, y in data_iter:
out = net(X)
y = y.reshape(out.shape)
l = loss(out, y)
metric.add(d2l.reduce_sum(l), d2l.size(l))
return metric[0] / metric[1]
```

```
def evaluate_loss(net, data_iter, loss): #@save
"""Evaluate the loss of a model on the given dataset."""
metric = d2l.Accumulator(2) # Sum of losses, no. of examples
for X, y in data_iter:
l = loss(net(X), y)
metric.add(tf.reduce_sum(l), tf.size(l).numpy())
return metric[0] / metric[1]
```

Now define the training function.

```
def train(train_features, test_features, train_labels, test_labels,
num_epochs=400):
loss = gluon.loss.L2Loss()
net = nn.Sequential()
# Switch off the bias since we already catered for it in the polynomial
# features
net.add(nn.Dense(1, use_bias=False))
net.initialize()
batch_size = min(10, train_labels.shape[0])
train_iter = d2l.load_array((train_features, train_labels), batch_size)
test_iter = d2l.load_array((test_features, test_labels), batch_size,
is_train=False)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': 0.01})
animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
xlim=[1, num_epochs], ylim=[1e-3, 1e2],
legend=['train', 'test'])
for epoch in range(num_epochs):
d2l.train_epoch_ch3(net, train_iter, loss, trainer)
if epoch == 0 or (epoch + 1) % 20 == 0:
animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
evaluate_loss(net, test_iter, loss)))
print('weight:', net[0].weight.data().asnumpy())
```

```
def train(train_features, test_features, train_labels, test_labels,
num_epochs=400):
loss = nn.MSELoss()
input_shape = train_features.shape[-1]
# Switch off the bias since we already catered for it in the polynomial
# features
net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
batch_size = min(10, train_labels.shape[0])
train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
batch_size)
test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
batch_size, is_train=False)
trainer = torch.optim.SGD(net.parameters(), lr=0.01)
animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
xlim=[1, num_epochs], ylim=[1e-3, 1e2],
legend=['train', 'test'])
for epoch in range(num_epochs):
d2l.train_epoch_ch3(net, train_iter, loss, trainer)
if epoch == 0 or (epoch + 1) % 20 == 0:
animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
evaluate_loss(net, test_iter, loss)))
print('weight:', net[0].weight.data.numpy())
```

```
def train(train_features, test_features, train_labels, test_labels,
num_epochs=400):
loss = tf.losses.MeanSquaredError()
input_shape = train_features.shape[-1]
# Switch off the bias since we already catered for it in the polynomial
# features
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, use_bias=False))
batch_size = min(10, train_labels.shape[0])
train_iter = d2l.load_array((train_features, train_labels), batch_size)
test_iter = d2l.load_array((test_features, test_labels), batch_size,
is_train=False)
trainer = tf.keras.optimizers.SGD(learning_rate=.01)
animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
xlim=[1, num_epochs], ylim=[1e-3, 1e2],
legend=['train', 'test'])
for epoch in range(num_epochs):
d2l.train_epoch_ch3(net, train_iter, loss, trainer)
if epoch == 0 or (epoch + 1) % 20 == 0:
animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
evaluate_loss(net, test_iter, loss)))
print('weight:', net.get_weights()[0].T)
```

### 4.4.4.3. Third-Order Polynomial Function Fitting (Normal)¶

We will begin by first using a third-order polynomial function, which is the same order as that of the data generation function. The results show that this model’s training and test losses can be both effectively reduced. The learned model parameters are also close to the true values \(w = [5, 1.2, -3.4, 5.6]\).

```
# Pick the first four dimensions, i.e., 1, x, x^2/2!, x^3/3! from the
# polynomial features
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
labels[:n_train], labels[n_train:])
```

```
weight: [[ 5.019086 1.2221123 -3.4236915 5.5718546]]
```

```
# Pick the first four dimensions, i.e., 1, x, x^2/2!, x^3/3! from the
# polynomial features
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
labels[:n_train], labels[n_train:])
```

```
weight: [[ 5.01016 1.2114989 -3.4048996 5.5852485]]
```

```
# Pick the first four dimensions, i.e., 1, x, x^2/2!, x^3/3! from the
# polynomial features
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
labels[:n_train], labels[n_train:])
```

```
weight: [[ 4.990935 1.2467119 -3.3708434 5.5036616]]
```

### 4.4.4.4. Linear Function Fitting (Underfitting)¶

Let us take another look at linear function fitting. After the decline in early epochs, it becomes difficult to further decrease this model’s training loss. After the last epoch iteration has been completed, the training loss is still high. When used to fit nonlinear patterns (like the third-order polynomial function here) linear models are liable to underfit.

```
# Pick the first two dimensions, i.e., 1, x, from the polynomial features
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
labels[:n_train], labels[n_train:])
```

```
weight: [[2.6998408 4.2314367]]
```

```
# Pick the first two dimensions, i.e., 1, x, from the polynomial features
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
labels[:n_train], labels[n_train:])
```

```
weight: [[3.4219165 3.4815433]]
```

```
# Pick the first two dimensions, i.e., 1, x, from the polynomial features
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
labels[:n_train], labels[n_train:])
```

```
weight: [[3.3597612 3.740147 ]]
```

### 4.4.4.5. Higher-Order Polynomial Function Fitting (Overfitting)¶

Now let us try to train the model using a polynomial of too high degree. Here, there are insufficient data to learn that the higher-degree coefficients should have values close to zero. As a result, our overly-complex model is so susceptible that it is being influenced by noise in the training data. Though the training loss can be effectively reduced, the test loss is still much higher. It shows that the complex model overfits the data.

```
# Pick all the dimensions from the polynomial features
train(poly_features[:n_train, :], poly_features[n_train:, :],
labels[:n_train], labels[n_train:], num_epochs=1500)
```

```
weight: [[ 4.9921837 1.3060793 -3.353223 5.116618 -0.11134586 1.3031868
0.12675649 0.16656096 0.05128633 -0.02274772 0.00806038 -0.05167779
-0.02426315 -0.01502201 -0.04941354 0.06389865 -0.04761846 -0.04380166
-0.05188227 0.05655775]]
```

```
# Pick all the dimensions from the polynomial features
train(poly_features[:n_train, :], poly_features[n_train:, :],
labels[:n_train], labels[n_train:], num_epochs=1500)
```

```
weight: [[ 4.99062634e+00 1.26753187e+00 -3.27904248e+00 5.23687983e+00
-4.33208913e-01 1.21556771e+00 4.95712869e-02 3.04271013e-01
1.78179458e-01 2.04654500e-01 6.97572678e-02 1.47447363e-01
1.76797602e-02 1.82150811e-01 -3.93367140e-03 -1.05195366e-01
1.93066925e-01 3.37832025e-03 2.50623245e-02 -1.97445840e-01]]
```

```
# Pick all the dimensions from the polynomial features
train(poly_features[:n_train, :], poly_features[n_train:, :],
labels[:n_train], labels[n_train:], num_epochs=1500)
```

```
weight: [[ 4.967567 1.2645578 -3.2570417 5.281634 -0.2793624 0.95744616
-0.14461283 -0.17835243 0.2666632 0.02711521 -0.37940085 0.34650052
-0.16412416 -0.06393074 -0.14317942 0.3542745 0.50140303 0.19751173
0.4851883 -0.25642896]]
```

In the subsequent sections, we will continue to discuss overfitting problems and methods for dealing with them, such as weight decay and dropout.

## 4.4.5. Summary¶

Since the generalization error cannot be estimated based on the training error, simply minimizing the training error will not necessarily mean a reduction in the generalization error. Machine learning models need to be careful to safeguard against overfitting so as to minimize the generalization error.

A validation set can be used for model selection, provided that it is not used too liberally.

Underfitting means that a model is not able to reduce the training error. When training error is much lower than validation error, there is overfitting.

We should choose an appropriately complex model and avoid using insufficient training samples.

## 4.4.6. Exercises¶

Can you solve the polynomial regression problem exactly? Hint: use linear algebra.

Consider model selection for polynomials:

Plot the training loss vs. model complexity (degree of the polynomial). What do you observe? What degree of polynomial do you need to reduce the training loss to 0?

Plot the test loss in this case.

Generate the same plot as a function of the amount of data.

What happens if you drop the normalization (\(1/i!\)) of the polynomial features \(x^i\)? Can you fix this in some other way?

Can you ever expect to see zero generalization error?