# 4.1. Multilayer Perceptrons¶ Open the notebook in Colab

In the previous chapter, we introduced softmax regression
(Section 3.4), implementing the algorithm from scratch
(Section 3.6) and in gluon
(Section 3.7) and training classifiers to recognize 10
categories of clothing from low-resolution images. Along the way, we
learned how to wrangle data, coerce our outputs into a valid probability
distribution (via `softmax`

), apply an appropriate loss function, and
to minimize it with respect to our model’s parameters. Now that we have
mastered these mechanics in the context of simple linear models, we can
launch our exploration of deep neural networks, the comparatively rich
class of models with which this book is primarily concerned.

## 4.1.2. Activation Functions¶

Because they are so fundamental to deep learning, let briefly survey some common activation functions.

### 4.1.2.1. ReLU Function¶

As stated above, the most popular choice, due to both simplicity of implementation its performance on a variety of predictive tasks is the rectified linear unit (ReLU). ReLUs provide a very simple nonlinear transformation. Given the element \(z\), the function is defined as the maximum of that element and 0.

Informally, the ReLU function retains only positive elements and
discards all negative elements (setting the corresponding activations to
0). To gain some intuition, we can plot the function. Because it is used
so commonly, NDarray supports the `relu`

function as a native
operator. As you can see, the activation function is piecewise linear.

```
x = np.arange(-8.0, 8.0, 0.1)
x.attach_grad()
with autograd.record():
y = npx.relu(x)
d2l.set_figsize((4, 2.5))
d2l.plot(x, y, 'x', 'relu(x)')
```

When the input is negative, the derivative of ReLU function is 0 and
when the input is positive, the derivative of ReLU function is 1. Note
that the ReLU function is not differentiable when the input takes value
precisely equal to 0. In these cases, we default to the left-hand-side
(LHS) derivative and say that the derivative is 0 when the input is 0.
We can get away with this because the input may never actually be zero.
There is an old adage that if subtle boundary conditions matter, we are
probably doing (*real*) mathematics, not engineering. That conventional
wisdom may apply here. We plot the derivative of the ReLU function
plotted below.

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of relu')
```

Note that there are many variants to the ReLU function, including the parameterized ReLU (pReLU) of He et al., 2015. This variation adds a linear term to the ReLU, so some information still gets through, even when the argument is negative.

The reason for using the ReLU is that its derivatives are particularly
well behaved: either they vanish or they just let the argument through.
This makes optimization better behaved and it mitigated well-documented
problem of *vanishing gradients* that plagued previous versions of
neural networks (more on this later).

### 4.1.2.2. Sigmoid Function¶

The sigmoid function transforms its inputs, which values in the domain
\(\mathbb{R}\), to outputs that lie the interval \((0, 1)\). For
that reason, the sigmoid is often called a *squashing* function: it
*squashes* any input in the range (-inf, inf) to some value in the range
(0, 1).

In the earliest neural networks, scientists were interested in modeling
biological neurons which either *fire* or *do not fire*. Thus the
pioneers of this field, going all the way back to McCulloch and Pitts,
the inventors of the artificial neuron, focused on thresholding units. A
thresholding activation takes value \(0\) when its input is below
some threshold and value \(1\) when the input exceeds the threshold.

When attention shifted to gradient based learning, the sigmoid function was a natural choice because it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are still widely used as activation functions on the output units, when we want to interpret the outputs as probabilities for binary classification problems (you can think of the sigmoid as a special case of the softmax). However, the sigmoid has mostly been replaced by the simpler and more easily trainable ReLU for most use in hidden layers. In the “Recurrent Neural Network” chapter (Section 8.4), we will describe architectures that leverage sigmoid units to control the flow of information across time.

Below, we plot the sigmoid function. Note that when the input is close to 0, the sigmoid function approaches a linear transformation.

```
with autograd.record():
y = npx.sigmoid(x)
d2l.plot(x, y, 'x', 'sigmoid(x)')
```

The derivative of sigmoid function is given by the following equation:

The derivative of sigmoid function is plotted below. Note that when the input is 0, the derivative of the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either direction, the derivative approaches 0.

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of sigmoid')
```

### 4.1.2.3. Tanh Function¶

Like the sigmoid function, the tanh (Hyperbolic Tangent) function also squashes its inputs, transforms them into elements on the interval between -1 and 1:

We plot the tanh function blow. Note that as the input nears 0, the tanh function approaches a linear transformation. Although the shape of the function is similar to the sigmoid function, the tanh function exhibits point symmetry about the origin of the coordinate system.

```
with autograd.record():
y = np.tanh(x)
d2l.plot(x, y, 'x', 'tanh(x)')
```

The derivative of the Tanh function is:

The derivative of tanh function is plotted below. As the input nears 0, the derivative of the tanh function approaches a maximum of 1. And as we saw with the sigmoid function, as the input moves away from 0 in either direction, the derivative of the tanh function approaches 0.

```
y.backward()
d2l.plot(x, x.grad, 'x', 'grad of tanh')
```

In summary, we now know how to incorporate nonlinearities to build expressive multilayer neural network architectures. As a side note, your knowledge already puts you in command of a similar toolkit to a practitioner circa 1990. In some ways, you have an advantage over anyone working the 1990s, because you can leverage powerful open-source deep learning frameworks to build models rapidly, using only a few lines of code. Previously, getting these nets training required researchers to code up thousands of lines of C and Fortran.

## 4.1.3. Summary¶

The multilayer perceptron adds one or multiple fully-connected hidden layers between the output and input layers and transforms the output of the hidden layer via an activation function.

Commonly-used activation functions include the ReLU function, the sigmoid function, and the tanh function.

## 4.1.4. Exercises¶

Compute the derivative of the tanh and the pReLU activation function.

Show that a multilayer perceptron using only ReLU (or pReLU) constructs a continuous piecewise linear function.

Show that \(\mathrm{tanh}(x) + 1 = 2 \mathrm{sigmoid}(2x)\).

Assume we have a multilayer perceptron

*without*nonlinearities between the layers. In particular, assume that we have \(d\) input dimensions, \(d\) output dimensions and that one of the layers had only \(d/2\) dimensions. Show that this network is less expressive (powerful) than a single layer perceptron.Assume that we have a nonlinearity that applies to one minibatch at a time. What kinds of problems do you expect this to cause?