# 5.1. Multilayer Perceptrons¶ Open the notebook in SageMaker Studio Lab

In Section 4.1, we introduced softmax regression, implementing the algorithm from scratch (Section 4.4) and using high-level APIs (Section 4.5). This allowed us to train classifiers capable of recognizing 10 categories of clothing from low-resolution images. Along the way, we learned how to wrangle data, coerce our outputs into a valid probability distribution, apply an appropriate loss function, and minimize it with respect to our model’s parameters. Now that we have mastered these mechanics in the context of simple linear models, we can launch our exploration of deep neural networks, the comparatively rich class of models with which this book is primarily concerned.

```
%matplotlib inline
import torch
from d2l import torch as d2l
```

```
%matplotlib inline
from mxnet import autograd, np, npx
from d2l import mxnet as d2l
npx.set_np()
```

```
%matplotlib inline
import jax
from jax import grad
from jax import numpy as jnp
from jax import vmap
from d2l import jax as d2l
```

```
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
```

```
%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l
```

## 5.1.2. Activation Functions¶

Activation functions decide whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. They are differentiable operators for transforming input signals to outputs, while most of them add nonlinearity. Because activation functions are fundamental to deep learning, let’s briefly survey some common ones.

### 5.1.2.1. ReLU Function¶

The most popular choice, due to both simplicity of implementation and
its good performance on a variety of predictive tasks, is the *rectified
linear unit* (*ReLU*) (Nair and Hinton, 2010). ReLU provides a very
simple nonlinear transformation. Given an element \(x\), the
function is defined as the maximum of that element and \(0\):

Informally, the ReLU function retains only positive elements and discards all negative elements by setting the corresponding activations to 0. To gain some intuition, we can plot the function. As you can see, the activation function is piecewise linear.

```
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)
y = torch.relu(x)
d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))
```

```
x = np.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
y = npx.relu(x)
d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))
```

```
[21:54:14] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
```

```
x = jnp.arange(-8.0, 8.0, 0.1)
y = jax.nn.relu(x)
d2l.plot(x, y, 'x', 'relu(x)', figsize=(5, 2.5))
```

```
x = tf.Variable(tf.range(-8.0, 8.0, 0.1), dtype=tf.float32)
y = tf.nn.relu(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'relu(x)', figsize=(5, 2.5))
```

When the input is negative, the derivative of the ReLU function is 0,
and when the input is positive, the derivative of the ReLU function is
1. Note that the ReLU function is not differentiable when the input
takes value precisely equal to 0. In these cases, we default to the
left-hand-side derivative and say that the derivative is 0 when the
input is 0. We can get away with this because the input may never
actually be zero (mathematicians would say that it is nondifferentiable
on a set of measure zero). There is an old adage that if subtle boundary
conditions matter, we are probably doing (*real*) mathematics, not
engineering. That conventional wisdom may apply here, or at least, the
fact that we are not performing constrained optimization
(Mangasarian, 1965, Rockafellar, 1970). We plot the derivative of
the ReLU function below.

```
y.backward(torch.ones_like(x), retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
```

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of relu', figsize=(5, 2.5))
```

```
[21:54:14] ../src/base.cc:48: GPU context requested, but no GPUs found.
```

```
grad_relu = vmap(grad(jax.nn.relu))
d2l.plot(x, grad_relu(x), 'x', 'grad of relu', figsize=(5, 2.5))
```

```
with tf.GradientTape() as t:
y = tf.nn.relu(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of relu',
figsize=(5, 2.5))
```

The reason for using ReLU is that its derivatives are particularly well behaved: either they vanish or they just let the argument through. This makes optimization better behaved and it mitigated the well-documented problem of vanishing gradients that plagued previous versions of neural networks (more on this later).

Note that there are many variants to the ReLU function, including the
*parametrized ReLU* (*pReLU*) function (He *et al.*, 2015).
This variation adds a linear term to ReLU, so some information still
gets through, even when the argument is negative:

### 5.1.2.2. Sigmoid Function¶

The *sigmoid function* transforms those inputs whose values lie in the
domain \(\mathbb{R}\), to outputs that lie on the interval (0, 1).
For that reason, the sigmoid is often called a *squashing function*: it
squashes any input in the range (-inf, inf) to some value in the range
(0, 1):

In the earliest neural networks, scientists were interested in modeling
biological neurons that either *fire* or *do not fire*. Thus the
pioneers of this field, going all the way back to McCulloch and Pitts,
the inventors of the artificial neuron, focused on thresholding units
(McCulloch and Pitts, 1943). A thresholding activation takes value 0
when its input is below some threshold and value 1 when the input
exceeds the threshold.

When attention shifted to gradient-based learning, the sigmoid function
was a natural choice because it is a smooth, differentiable
approximation to a thresholding unit. Sigmoids are still widely used as
activation functions on the output units when we want to interpret the
outputs as probabilities for binary classification problems: you can
think of the sigmoid as a special case of the softmax. However, the
sigmoid has largely been replaced by the simpler and more easily
trainable ReLU for most use in hidden layers. Much of this has to do
with the fact that the sigmoid poses challenges for optimization
(LeCun *et al.*, 1998) since its gradient vanishes for large
positive *and* negative arguments. This can lead to plateaus that are
difficult to escape from. Nonetheless sigmoids are important. In later
chapters (e.g., Section 10.1) on recurrent neural networks, we
will describe architectures that leverage sigmoid units to control the
flow of information across time.

Below, we plot the sigmoid function. Note that when the input is close to 0, the sigmoid function approaches a linear transformation.

```
y = torch.sigmoid(x)
d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
```

```
with autograd.record():
y = npx.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))
```

```
y = jax.nn.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)', figsize=(5, 2.5))
```

```
y = tf.nn.sigmoid(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
```

The derivative of the sigmoid function is given by the following equation:

The derivative of the sigmoid function is plotted below. Note that when the input is 0, the derivative of the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either direction, the derivative approaches 0.

```
# Clear out previous gradients
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
```

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))
```

```
grad_sigmoid = vmap(grad(jax.nn.sigmoid))
d2l.plot(x, grad_sigmoid(x), 'x', 'grad of sigmoid', figsize=(5, 2.5))
```

```
with tf.GradientTape() as t:
y = tf.nn.sigmoid(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of sigmoid',
figsize=(5, 2.5))
```

### 5.1.2.3. Tanh Function¶

Like the sigmoid function, the tanh (hyperbolic tangent) function also squashes its inputs, transforming them into elements on the interval between \(-1\) and \(1\):

We plot the tanh function below. Note that as input nears 0, the tanh function approaches a linear transformation. Although the shape of the function is similar to that of the sigmoid function, the tanh function exhibits point symmetry about the origin of the coordinate system (Kalman and Kwasny, 1992).

```
y = torch.tanh(x)
d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))
```

```
with autograd.record():
y = np.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))
```

```
y = jax.nn.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)', figsize=(5, 2.5))
```

```
y = tf.nn.tanh(x)
d2l.plot(x.numpy(), y.numpy(), 'x', 'tanh(x)', figsize=(5, 2.5))
```

The derivative of the tanh function is:

It is plotted below. As the input nears 0, the derivative of the tanh function approaches a maximum of 1. And as we saw with the sigmoid function, as input moves away from 0 in either direction, the derivative of the tanh function approaches 0.

```
# Clear out previous gradients
x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
```

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))
```

```
grad_tanh = vmap(grad(jax.nn.tanh))
d2l.plot(x, grad_tanh(x), 'x', 'grad of tanh', figsize=(5, 2.5))
```

```
with tf.GradientTape() as t:
y = tf.nn.tanh(x)
d2l.plot(x.numpy(), t.gradient(y, x).numpy(), 'x', 'grad of tanh',
figsize=(5, 2.5))
```

## 5.1.3. Summary and Discussion¶

We now know how to incorporate nonlinearities to build expressive multilayer neural network architectures. As a side note, your knowledge already puts you in command of a similar toolkit to a practitioner circa 1990. In some ways, you have an advantage over anyone working back then, because you can leverage powerful open-source deep learning frameworks to build models rapidly, using only a few lines of code. Previously, training these networks required researchers to code up layers and derivatives explicitly in C, Fortran, or even Lisp (in the case of LeNet).

A secondary benefit is that ReLU is significantly more amenable to
optimization than the sigmoid or the tanh function. One could argue that
this was one of the key innovations that helped the resurgence of deep
learning over the past decade. Note, though, that research in activation
functions has not stopped. For instance, the GELU (Gaussian error linear
unit) activation function \(x \Phi(x)\) by
Hendrycks and Gimpel (2016) (\(\Phi(x)\) is the standard
Gaussian cumulative distribution function) and the Swish activation
function \(\sigma(x) = x \operatorname{sigmoid}(\beta x)\) as
proposed in Ramachandran *et al.* (2017) can yield better
accuracy in many cases.

## 5.1.4. Exercises¶

Show that adding layers to a

*linear*deep network, i.e., a network without nonlinearity \(\sigma\) can never increase the expressive power of the network. Give an example where it actively reduces it.Compute the derivative of the pReLU activation function.

Compute the derivative of the Swish activation function \(x \operatorname{sigmoid}(\beta x)\).

Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise linear function.

Sigmoid and tanh are very similar.

Show that \(\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)\).

Prove that the function classes parametrized by both nonlinearities are identical. Hint: affine layers have bias terms, too.

Assume that we have a nonlinearity that applies to one minibatch at a time, such as the batch normalization (Ioffe and Szegedy, 2015). What kinds of problems do you expect this to cause?

Provide an example where the gradients vanish for the sigmoid activation function.