# 4.6. Dropout¶

In Section 4.5, we introduced the classical approach to regularizing statistical models by penalizing the \(L_2\) norm of the weights. In probabilistic terms, we could justify this technique by arguing that we have assumed a prior belief that weights take values from a Gaussian distribution with mean zero. More intuitively, we might argue that we encouraged the model to spread out its weights among many features rather than depending too much on a small number of potentially spurious associations.

## 4.6.1. Overfitting Revisited¶

Faced with more features than examples, linear models tend to overfit. But given more examples than features, we can generally count on linear models not to overfit. Unfortunately, the reliability with which linear models generalize comes at a cost. Naively applied, linear models do not take into account interactions among features. For every feature, a linear model must assign either a positive or a negative weight, ignoring context.

In traditional texts, this fundamental tension between generalizability
and flexibility is described as the *bias-variance tradeoff*. Linear
models have high bias: they can only represent a small class of
functions. However, these models have low variance: they give similar
results across different random samples of the data.

Deep neural networks inhabit the opposite end of the bias-variance spectrum. Unlike linear models, neural networks are not confined to looking at each feature individually. They can learn interactions among groups of features. For example, they might infer that “Nigeria” and “Western Union” appearing together in an email indicates spam but that separately they do not.

Even when we have far more examples than features, deep neural networks are capable of overfitting. In 2017, a group of researchers demonstrated the extreme flexibility of neural networks by training deep nets on randomly-labeled images. Despite the absence of any true pattern linking the inputs to the outputs, they found that the neural network optimized by stochastic gradient descent could label every image in the training set perfectly. Consider what this means. If the labels are assigned uniformly at random and there are 10 classes, then no classifier can do better than 10% accuracy on holdout data. The generalization gap here is a whopping 90%. If our models are so expressive that they can overfit this badly, then when should we expect them not to overfit?

The mathematical foundations for the puzzling generalization properties of deep networks remain open research questions, and we encourage the theoretically-oriented reader to dig deeper into the topic. For now, we turn to the investigation of practical tools that tend to empirically improve the generalization of deep nets.

## 4.6.2. Robustness through Perturbations¶

Let us think briefly about what we expect from a good predictive model. We want it to peform well on unseen data. Classical generalization theory suggests that to close the gap between train and test performance, we should aim for a simple model. Simplicity can come in the form of a small number of dimensions. We explored this when discussing the monomial basis functions of linear models in Section 4.4. Additionally, as we saw when discussing weight decay (\(L_2\) regularization) in Section 4.5, the (inverse) norm of the parameters also represents a useful measure of simplicity. Another useful notion of simplicity is smoothness, i.e., that the function should not be sensitive to small changes to its inputs. For instance, when we classify images, we would expect that adding some random noise to the pixels should be mostly harmless.

In 1995, Christopher Bishop formalized this idea when he proved that training with input noise is equivalent to Tikhonov regularization [Bishop, 1995]. This work drew a clear mathematical connection between the requirement that a function be smooth (and thus simple), and the requirement that it be resilient to perturbations in the input.

Then, in 2014, Srivastava et al. [Srivastava et al., 2014] developed a clever idea for how to apply Bishop’s idea to the internal layers of a network, too. Namely, they proposed to inject noise into each layer of the network before calculating the subsequent layer during training. They realized that when training a deep network with many layers, injecting noise enforces smoothness just on the input-output mapping.

Their idea, called *dropout*, involves injecting noise while computing
each internal layer during forward propagation, and it has become a
standard technique for training neural networks. The method is called
*dropout* because we literally *drop out* some neurons during training.
Throughout training, on each iteration, standard dropout consists of
zeroing out some fraction of the nodes in each layer before calculating
the subsequent layer.

To be clear, we are imposing our own narrative with the link to Bishop.
The original paper on dropout offers intuition through a surprising
analogy to sexual reproduction. The authors argue that neural network
overfitting is characterized by a state in which each layer relies on a
specifc pattern of activations in the previous layer, calling this
condition *co-adaptation*. Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to break up co-adapted genes.

The key challenge then is how to inject this noise. One idea is to
inject the noise in an *unbiased* manner so that the expected value of
each layer—while fixing the others—equals to the value it would have
taken absent noise.

In Bishop’s work, he added Gaussian noise to the inputs to a linear model. At each training iteration, he added noise sampled from a distribution with mean zero \(\epsilon \sim \mathcal{N}(0,\sigma^2)\) to the input \(\mathbf{x}\), yielding a perturbed point \(\mathbf{x}' = \mathbf{x} + \epsilon\). In expectation, \(E[\mathbf{x}'] = \mathbf{x}\).

In standard dropout regularization, one debiases each layer by
normalizing by the fraction of nodes that were retained (not dropped
out). In other words, with *dropout probability* \(p\), each
intermediate activation \(h\) is replaced by a random variable
\(h'\) as follows:

By design, the expectation remains unchanged, i.e., \(E[h'] = h\).

## 4.6.3. Dropout in Practice¶

Recall the MLP with a hidden layer and 5 hidden units in Fig. 4.1.1. When we apply dropout to a hidden layer, zeroing out each hidden unit with probability \(p\), the result can be viewed as a network containing only a subset of the original neurons. In Fig. 4.6.1, \(h_2\) and \(h_5\) are removed. Consequently, the calculation of the outputs no longer depends on \(h_2\) or \(h_5\) and their respective gradient also vanishes when performing backpropagation. In this way, the calculation of the output layer cannot be overly dependent on any one element of \(h_1, \ldots, h_5\).

Typically, we disable dropout at test time. Given a trained model and a
new example, we do not drop out any nodes and thus do not need to
normalize. However, there are some exceptions: some researchers use
dropout at test time as a heuristic for estimating the *uncertainty* of
neural network predictions: if the predictions agree across many
different dropout masks, then we might say that the network is more
confident.

## 4.6.4. Implementation from Scratch¶

To implement the dropout function for a single layer, we must draw as many samples from a Bernoulli (binary) random variable as our layer has dimensions, where the random variable takes value \(1\) (keep) with probability \(1-p\) and \(0\) (drop) with probability \(p\). One easy way to implement this is to first draw samples from the uniform distribution \(U[0, 1]\). Then we can keep those nodes for which the corresponding sample is greater than \(p\), dropping the rest.

In the following code, we implement a `dropout_layer`

function that
drops out the elements in the tensor input `X`

with probability
`dropout`

, rescaling the remainder as described above: dividing the
survivors by `1.0-dropout`

.

```
from d2l import mxnet as d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
# In this case, all elements are dropped out
if dropout == 1:
return np.zeros_like(X)
# In this case, all elements are kept
if dropout == 0:
return X
mask = np.random.uniform(0, 1, X.shape) > dropout
return mask.astype(np.float32) * X / (1.0 - dropout)
```

```
from d2l import torch as d2l
import torch
from torch import nn
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
# In this case, all elements are dropped out
if dropout == 1:
return torch.zeros_like(X)
# In this case, all elements are kept
if dropout == 0:
return X
mask = (torch.Tensor(X.shape).uniform_(0, 1) > dropout).float()
return mask * X / (1.0 - dropout)
```

```
from d2l import tensorflow as d2l
import tensorflow as tf
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
# In this case, all elements are dropped out
if dropout == 1:
return tf.zeros_like(X)
# In this case, all elements are kept
if dropout == 0:
return X
mask = tf.random.uniform(
shape=tf.shape(X), minval=0, maxval=1) < 1 - dropout
return tf.cast(mask, dtype=tf.float32) * X / (1.0 - dropout)
```

We can test out the `dropout_layer`

function on a few examples. In the
following lines of code, we pass our input `X`

through the dropout
operation, with probabilities 0, 0.5, and 1, respectively.

```
X = np.arange(16).reshape(2, 8)
print(dropout_layer(X, 0))
print(dropout_layer(X, 0.5))
print(dropout_layer(X, 1))
```

```
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]]
[[ 0. 2. 4. 6. 8. 10. 12. 14.]
[ 0. 18. 20. 0. 0. 0. 28. 0.]]
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
```

```
X= torch.arange(16, dtype = torch.float32).reshape((2, 8))
print(X)
print(dropout_layer(X, 0.))
print(dropout_layer(X, 0.5))
print(dropout_layer(X, 1.))
```

```
tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
tensor([[ 0., 0., 4., 6., 0., 10., 0., 14.],
[16., 0., 20., 22., 24., 26., 28., 30.]])
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
```

```
X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
print(X)
print(dropout_layer(X, 0.))
print(dropout_layer(X, 0.5))
print(dropout_layer(X, 1.))
```

```
tf.Tensor(
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]], shape=(2, 8), dtype=float32)
tf.Tensor(
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]], shape=(2, 8), dtype=float32)
tf.Tensor(
[[ 0. 2. 4. 6. 0. 10. 0. 14.]
[16. 0. 20. 0. 24. 0. 28. 0.]], shape=(2, 8), dtype=float32)
tf.Tensor(
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]], shape=(2, 8), dtype=float32)
```

### 4.6.4.1. Defining Model Parameters¶

Again, we work with the Fashion-MNIST dataset introduced in Section 3.5. We define an MLP with two hidden layers containing 256 units each.

```
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens1))
b1 = np.zeros(num_hiddens1)
W2 = np.random.normal(scale=0.01, size=(num_hiddens1, num_hiddens2))
b2 = np.zeros(num_hiddens2)
W3 = np.random.normal(scale=0.01, size=(num_hiddens2, num_outputs))
b3 = np.zeros(num_outputs)
params = [W1, b1, W2, b2, W3, b3]
for param in params:
param.attach_grad()
```

```
num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
```

```
num_outputs, num_hiddens1, num_hiddens2 = 10, 256, 256
```

### 4.6.4.2. Defining the Model¶

The model below applies dropout to the output of each hidden layer (following the activation function). We can set dropout probabilities for each layer separately. A common trend is to set a lower dropout probability closer to the input layer. Below we set it to 0.2 and 0.5 for the first and second hidden layers, respectively. We ensure that dropout is only active during training.

```
dropout1, dropout2 = 0.2, 0.5
def net(X):
X = X.reshape(-1, num_inputs)
H1 = npx.relu(np.dot(X, W1) + b1)
# Use dropout only when training the model
if autograd.is_training():
# Add a dropout layer after the first fully connected layer
H1 = dropout_layer(H1, dropout1)
H2 = npx.relu(np.dot(H1, W2) + b2)
if autograd.is_training():
# Add a dropout layer after the second fully connected layer
H2 = dropout_layer(H2, dropout2)
return np.dot(H2, W3) + b3
```

```
dropout1, dropout2 = 0.2, 0.5
class Net(nn.Module):
def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2,
is_training = True):
super(Net, self).__init__()
self.num_inputs = num_inputs
self.training = is_training
self.lin1 = nn.Linear(num_inputs, num_hiddens1)
self.lin2 = nn.Linear(num_hiddens1, num_hiddens2)
self.lin3 = nn.Linear(num_hiddens2, num_outputs)
self.relu = nn.ReLU()
def forward(self, X):
H1 = self.relu(self.lin1(X.reshape((-1, self.num_inputs))))
# Use dropout only when training the model
if self.training == True:
# Add a dropout layer after the first fully connected layer
H1 = dropout_layer(H1, dropout1)
H2 = self.relu(self.lin2(H1))
if self.training == True:
# Add a dropout layer after the second fully connected layer
H2 = dropout_layer(H2, dropout2)
out = self.lin3(H2)
return out
net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)
```

```
dropout1, dropout2 = 0.2, 0.5
class Net(tf.keras.Model):
def __init__(self, num_outputs, num_hiddens1, num_hiddens2):
super().__init__()
self.input_layer = tf.keras.layers.Flatten()
self.hidden1 = tf.keras.layers.Dense(num_hiddens1, activation='relu')
self.hidden2 = tf.keras.layers.Dense(num_hiddens2, activation='relu')
self.output_layer = tf.keras.layers.Dense(num_outputs)
def call(self, inputs, training=None):
x = self.input_layer(inputs)
x = self.hidden1(x)
if training:
x = dropout_layer(x, dropout1)
x = self.hidden2(x)
if training:
x = dropout_layer(x, dropout2)
x = self.output_layer(x)
return x
net = Net(num_outputs, num_hiddens1, num_hiddens2)
```

### 4.6.4.3. Training and Testing¶

This is similar to the training and testing of MLPs described previously.

```
num_epochs, lr, batch_size = 10, 0.5, 256
loss = gluon.loss.SoftmaxCrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
lambda batch_size: d2l.sgd(params, lr, batch_size))
```

```
num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss()
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
```

```
num_epochs, lr, batch_size = 10, 0.5, 256
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
```

## 4.6.5. Concise Implementation¶

With high-level APIs, all we need to do is add a `Dropout`

layer after
each fully-connected layer, passing in the dropout probability as the
only argument to its constructor. During training, the `Dropout`

layer
will randomly drop out outputs of the previous layer (or equivalently,
the inputs to the subsequent layer) according to the specified dropout
probability. When not in training mode, the `Dropout`

layer simply
passes the data through during testing.

```
net = nn.Sequential()
net.add(nn.Dense(256, activation="relu"),
# Add a dropout layer after the first fully connected layer
nn.Dropout(dropout1),
nn.Dense(256, activation="relu"),
# Add a dropout layer after the second fully connected layer
nn.Dropout(dropout2),
nn.Dense(10))
net.initialize(init.Normal(sigma=0.01))
```

```
net = nn.Sequential(nn.Flatten(),
nn.Linear(784, 256),
nn.ReLU(),
# Add a dropout layer after the first fully connected layer
nn.Dropout(dropout1),
nn.Linear(256, 256),
nn.ReLU(),
# Add a dropout layer after the second fully connected layer
nn.Dropout(dropout2),
nn.Linear(256, 10))
def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.normal_(m.weight, std=0.01)
net.apply(init_weights)
```

```
Sequential(
(0): Flatten()
(1): Linear(in_features=784, out_features=256, bias=True)
(2): ReLU()
(3): Dropout(p=0.2, inplace=False)
(4): Linear(in_features=256, out_features=256, bias=True)
(5): ReLU()
(6): Dropout(p=0.5, inplace=False)
(7): Linear(in_features=256, out_features=10, bias=True)
)
```

```
net = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
# Add a dropout layer after the first fully connected layer
tf.keras.layers.Dropout(dropout1),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
# Add a dropout layer after the second fully connected layer
tf.keras.layers.Dropout(dropout2),
tf.keras.layers.Dense(10),
])
```

Next, we train and test the model.

```
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
```

```
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
```

```
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)
```

## 4.6.6. Summary¶

Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool to avoid overfitting. Often they are used jointly.

Dropout replaces an activation \(h\) with a random variable with expected value \(h\).

Dropout is only used during training.

## 4.6.7. Exercises¶

What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.

Increase the number of epochs and compare the results obtained when using dropout with those when not using it.

What is the variance of the activations in each hidden layer when dropout is and is not applied? Draw a plot to show how this quantity evolves over time for both models.

Why is dropout not typically used at test time?

Using the model in this section as an example, compare the effects of using dropout and weight decay. What happens when dropout and weight decay are used at the same time? Are the results additive? Are there diminished returns (or worse)? Do they cancel each other out?

What happens if we apply dropout to the individual weights of the weight matrix rather than the activations?

Invent another technique for injecting random noise at each layer that is different from the standard dropout technique. Can you develop a method that outperforms dropout on the Fashion-MNIST dataset (for a fixed architecture)?