# 5.6. Convolutional Neural Networks (LeNet)¶

In our first encounter with image data we applied a Multilayer Perceptron to pictures of clothing in the Fashion-MNIST data set. Both the height and width of each image were 28 pixels. We expanded the pixels in the image line by line to get a vector of length 784, and then used them as inputs to the fully connected layer. However, this classification method has certain limitations:

1. The adjacent pixels in the same column of an image may be far apart in this vector. The patterns they create may be difficult for the model to recognize. In fact, the vectorial representation ignores position entirely - we could have permuted all $$28 \times 28$$ pixels at random and obtained the same results.
2. For large input images, using a fully connected layer can easily cause the model to become too large, as we discussed previously.

As discussed in the previous sections, the convolutional layer attempts to solve both problems. On the one hand, the convolutional layer retains the input shape, so that the correlation of image pixels in the directions of both height and width can be recognized effectively. On the other hand, the convolutional layer repeatedly calculates the same kernel and the input of different positions through the sliding window, thereby avoiding excessively large parameter sizes.

A convolutional neural network is a network with convolutional layers. In this section, we will introduce an early convolutional neural network used to recognize handwritten digits in images - LeNet5. Convolutional networks were invented by Yann LeCun and coworkers at AT&T Bell Labs in the early 90s. LeNet showed that it was possible to use gradient descent to train the convolutional neural network for handwritten digit recognition. It achieved outstanding results at the time (only matched by Support Vector Machines at the time).

## 5.6.1. LeNet¶

LeNet is divided into two parts: a block of convolutional layers and one of fully connected ones. Below, we will introduce these two modules separately. Before going into details, let’s briefly review the model in pictures. To illustrate the issue of channels and the specific layers we will use a rather description (later we will see how to convey the same information more concisely).

The basic units in the convolutional block are a convolutional layer and a subsequent average pooling layer (note that max-pooling works better, but it had not been invented in the 90s yet). The convolutional layer is used to recognize the spatial patterns in the image, such as lines and the parts of objects, and the subsequent average pooling layer is used to reduce the dimensionality. The convolutional layer block is composed of repeated stacks of these two basic units. In the convolutional layer block, each convolutional layer uses a $$5\times 5$$ window and a sigmoid activation function for the output (note that ReLu works better, but it had not been invented in the 90s yet). The number of output channels for the first convolutional layer is 6, and the number of output channels for the second convolutional layer is increased to 16. This is because the height and width of the input of the second convolutional layer is smaller than that of the first convolutional layer. Therefore, increasing the number of output channels makes the parameter sizes of the two convolutional layers similar. The window shape for the two average pooling layers of the convolutional layer block is $$2\times 2$$ and the stride is 2. Because the pooling window has the same shape as the stride, the areas covered by the pooling window sliding on each input do not overlap. In other words, the pooling layer performs downsampling.

The output shape of the convolutional layer block is (batch size, channel, height, width). When the output of the convolutional layer block is passed into the fully connected layer block, the fully connected layer block flattens each example in the mini-batch. That is to say, the input shape of the fully connected layer will become two dimensional: the first dimension is the example in the mini-batch, the second dimension is the vector representation after each example is flattened, and the vector length is the product of channel, height, and width. The fully connected layer block has three fully connected layers. They have 120, 84, and 10 outputs, respectively. Here, 10 is the number of output classes.

Next, we implement the LeNet model through the Sequential class.

In [1]:

import d2l
import mxnet as mx
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import loss as gloss, nn
import time

net = nn.Sequential()
nn.AvgPool2D(pool_size=2, strides=2),
nn.Conv2D(channels=16, kernel_size=5, activation='sigmoid'),
nn.AvgPool2D(pool_size=2, strides=2),
# Dense will transform the input of the shape (batch size, channel, height, width) into
# the input of the shape (batch size, channel *height * width) automatically by default.
nn.Dense(120, activation='sigmoid'),
nn.Dense(84, activation='sigmoid'),
nn.Dense(10))


We took the liberty of replacing the Gaussian activation in the last layer by a regular dense network since this is rather much more convenient to train. Other than that the network matches the historical definition of LeNet5. Next, we feed a single-channel example of size $$28 \times 28$$ into the network and perform a forward computation layer by layer to see the output shape of each layer.

In [2]:

X = nd.random.uniform(shape=(1, 1, 28, 28))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)

conv0 output shape:      (1, 6, 28, 28)
pool0 output shape:      (1, 6, 14, 14)
conv1 output shape:      (1, 16, 10, 10)
pool1 output shape:      (1, 16, 5, 5)
dense0 output shape:     (1, 120)
dense1 output shape:     (1, 84)
dense2 output shape:     (1, 10)


We can see that the height and width of the input in the convolutional layer block is reduced, layer by layer. The convolutional layer uses a kernel with a height and width of 5 to reduce the height and width by 4, while the pooling layer halves the height and width, but the number of channels increases from 1 to 16. The fully connected layer reduces the number of outputs layer by layer, until the number of image classes becomes 10.

## 5.6.2. Data Acquisition and Training¶

Now, we will experiment with the LeNet model. We still use Fashion-MNIST as the training data set since the problem is rather more difficult than OCR (even in the 1990s the error rates were in the 1% range).

In [3]:

batch_size = 256


Since convolutional networks are significantly more expensive to compute than multilayer perceptrons we recommend using GPUs to speed up training. Time to introduce a convenience function that allows us to detect whether we have a GPU: it works by trying to allocate an NDArray on gpu(0), and use gpu(0) if this is successful. Otherwise, we catch the resulting exception and we stick with the CPU.

In [4]:

def try_gpu():  # This function has been saved in the d2l package for future use.
try:
ctx = mx.gpu()
_ = nd.zeros((1,), ctx=ctx)
except mx.base.MXNetError:
ctx = mx.cpu()
return ctx

ctx = try_gpu()
ctx

Out[4]:

gpu(0)


Accordingly, we slightly modify the evaluate_accuracy function described when implementing the SoftMax from scratch. Since the data arrives in the CPU when loading we need to copy it to the GPU before any computation can occur. This is accomplished via the as_in_context function described in the GPU Computing section. Note that we accumulate the errors on the same device as where the data eventually lives (in acc). This avoids intermediate copy operations that would destroy performance.

In [5]:

# This function has been saved in the d2l package for future use. The function will be gradually improved.
# Its complete implementation will be discussed in the "Image Augmentation" section.
def evaluate_accuracy(data_iter, net, ctx):
acc_sum, n = nd.array([0], ctx=ctx), 0
for X, y in data_iter:
# If ctx is the GPU, copy the data to the GPU.
X, y = X.as_in_context(ctx), y.as_in_context(ctx).astype('float32')
acc_sum += (net(X).argmax(axis=1) == y).sum()
n += y.size
return acc_sum.asscalar() / n


Just like the data loader we need to update the training function to deal with GPUs. Unlike train_ch3 <../chapter_deep-learning-basics/softmax-regression-scratch.md>__ we now move data prior to computation.

In [6]:

# This function has been saved in the d2l package for future use.
def train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx,
num_epochs):
print('training on', ctx)
loss = gloss.SoftmaxCrossEntropyLoss()
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
for X, y in train_iter:
X, y = X.as_in_context(ctx), y.as_in_context(ctx)
y_hat = net(X)
l = loss(y_hat, y).sum()
l.backward()
trainer.step(batch_size)
y = y.astype('float32')
train_l_sum += l.asscalar()
train_acc_sum += (y_hat.argmax(axis=1) == y).sum().asscalar()
n += y.size
test_acc = evaluate_accuracy(test_iter, net, ctx)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
'time %.1f sec'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc,
time.time() - start))


We initialize the model parameters on the device indicated by ctx, this time using Xavier. The loss function and the training algorithm still use the cross-entropy loss function and mini-batch stochastic gradient descent.

In [7]:

lr, num_epochs = 0.9, 5
net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
train_ch5(net, train_iter, test_iter, batch_size, trainer, ctx, num_epochs)

training on gpu(0)
epoch 1, loss 2.3196, train acc 0.100, test acc 0.100, time 2.4 sec
epoch 2, loss 2.2979, train acc 0.115, test acc 0.275, time 2.0 sec
epoch 3, loss 1.2638, train acc 0.496, test acc 0.682, time 1.9 sec
epoch 4, loss 0.8286, train acc 0.674, test acc 0.713, time 1.9 sec
epoch 5, loss 0.6941, train acc 0.727, test acc 0.757, time 1.9 sec


## 5.6.3. Summary¶

• A convolutional neural network (in short, ConvNet) is a network using convolutional layers.
• In a ConvNet we alternate between convolutions, nonlinearities and often also pooling operations.
• Ultimately the resolution is reduced prior to emitting an output via one (or more) dense layers.
• LeNet was the first successful deployment of such a network.

## 5.6.4. Problems¶

1. Replace the average pooling with max pooling. What happens?
2. Try to construct a more complex network based on LeNet to improve its accuracy.
• Adjust the convolution window size.
• Adjust the number of output channels.
• Adjust the activation function (ReLu?).
• Adjust the number of convolution layers.
• Adjust the number of fully connected layers.
• Adjust the learning rates and other training details (initialization, epochs, etc.)
3. Try out the improved network on the original MNIST dataset.
4. Display the activations of the first and second layer of LeNet for different inputs (e.g. sweaters, coats).

## 5.6.5. References¶

[1] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.