Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

3.2. Linear regression implementation from scratch

After getting some background on linear regression, we are now ready for a hands-on implementation. While a powerful deep learning framework minimizes repetitive work, relying on it too much to make things easy can make it hard to properly understand how deep learning works. This matters in particular if we want to change things later, e.g. define our own layers, loss functions, etc. Because of this, we start by describing how to implement linear regression training using only NDArray and autograd.

Before we begin, let’s import the package or module required for this section’s experiment; matplotlib will be used for plotting and will be set to embed in the GUI.

In [1]:
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random

3.2.1. Generating Data Sets

By constructing a simple artificial training data set, we can visually compare the differences between the parameters we have learned and the actual model parameters. Set the number of examples in the training data set as 1000 and the number of inputs (feature number) as 2. Using the randomly generated batch example feature \(\mathbf{X}\in \mathbb{R}^{1000 \times 2}\), we use the actual weight \(\mathbf{w} = [2, -3.4]^\top\) and bias \(b = 4.2\) of the linear regression model, as well as a random noise item \(\epsilon\) to generate the tag

\[\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon\]

The noise term \(\epsilon\) (or rather each coordinate of it) obeys a normal distribution with a mean of 0 and a standard deviation of 0.01. To get a better idea, let us generate the dataset.

In [2]:
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)

Note that each row in features consists of a 2-dimensional data point and that each row in labels consists of a 1-dimensional target value (a scalar).

In [3]:
features[0], labels[0]
 [2.2122064 0.7740038]
 <NDArray 2 @cpu(0)>,
 <NDArray 1 @cpu(0)>)

By generating a scatter plot using the second features[:, 1] and labels, we can clearly observe the linear correlation between the two.

In [4]:
def use_svg_display():
    # Displayed in vector graphics.

def set_figsize(figsize=(3.5, 2.5)):
    # Set the size of the graph to be plotted.
    plt.rcParams['figure.figsize'] = figsize

plt.figure(figsize=(10, 6))
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);

The plotting function plt as well as the use_svg_display and set_figsize functions are defined in the d2l package. We will call d2l.plt directly for future plotting. To print the vector diagram and set its size, we only need to call d2l.set_figsize() before plotting, because plt is a global variable in the d2l package.

3.2.2. Reading Data

We need to iterate over the entire data set and continuously examine mini-batches of data examples when training the model. Here we define a function. Its purpose is to return the features and tags of random batch_size (batch size) examples every time it’s called. One might wonder why we are not reading one observation at a time but rather construct an iterator which returns a few observations at a time. This has mostly to do with efficiency when optimizing. Recall that when we processed one dimension at a time the algorithm was quite slow. The same thing happens when processing single observations vs. an entire ‘batch’ of them, which can be represented as a matrix rather than just a vector. In particular, GPUs are much faster when it comes to dealing with matrices, up to an order of magnitude. This is one of the reasons why deep learning usually operates on mini-batches rather than singletons.

In [5]:
# This function has been saved in the d2l package for future use.
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # The examples are read at random, in no particular order.
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield features.take(j), labels.take(j)
        # The “take” function will then return the corresponding element based on the indices.

Let’s read and print the first small batch of data examples. The shape of the features in each batch corresponds to the batch size and the number of input dimensions. Likewise, we obtain as many labels as requested by the batch size.

In [6]:
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, y)

[[-0.16185044  0.72164685]
 [-0.05296669  0.27128416]
 [ 0.98146594  0.26833233]
 [ 0.9509561   0.06745205]
 [-0.15137078 -2.3573012 ]
 [-0.46069986  0.8082894 ]
 [ 0.6538188   1.1236147 ]
 [ 0.27814385  0.09634511]
 [-0.26314208 -2.9627717 ]
 [ 0.3082519   0.66169816]]
<NDArray 10x2 @cpu(0)>
[ 1.4243342   3.165769    5.2414656   5.8650117  11.910825    0.51967335
  1.6822191   4.4285975  13.766514    2.5606186 ]
<NDArray 10 @cpu(0)>

Clearly, if we run the iterator again, we obtain a different minibatch until all the data has been exhausted (try this). Note that the iterator described above is a bit inefficient (it requires that we load all data in memory and that we perform a lot of random memory access). The built-in iterators are more efficient and they can deal with data stored on file (or being fed via a data stream).

3.2.3. Initialize Model Parameters

Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01, with the bias \(b\) set to zero.

In [7]:
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))

In the succeeding cells, we’re going to update these parameters to better fit our data. This will involve taking the gradient (a multi-dimensional derivative) of some loss function with respect to the parameters. We’ll update each parameter in the direction that reduces the loss. In order for autograd to know that it needs to set up the appropriate data structures, track changes, etc., we need to attach gradients explicitly.

In [8]:

3.2.4. Define the Model

Next we’ll want to define our model. In this case, we’ll be working with linear models, the simplest possible useful neural network. To calculate the output of the linear model, we simply multiply a given input with the model’s weights \(w\), and add the offset \(b\).

In [9]:
def linreg(X, w, b):  # This function has been saved in the d2l package for future use.
    return nd.dot(X, w) + b

3.2.5. Define the Loss Function

We will use the squared loss function described in the previous section to define the linear regression loss. In the implementation, we need to transform the true value y into the predicted value’s shape y_hat. The result returned by the following function will also be the same as the y_hat shape.

In [10]:
def squared_loss(y_hat, y):  # This function has been saved in the d2l package for future use.
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

3.2.6. Define the Optimization Algorithm

Linear regression actually has a closed-form solution. However, most interesting models that we’ll care about cannot be solved analytically. So we’ll solve this problem by stochastic gradient descent sgd. At each step, we’ll estimate the gradient of the loss with respect to our weights, using one batch randomly drawn from our dataset. Then, we’ll update our parameters a small amount in the direction that reduces the loss. Here, the gradient calculated by the automatic differentiation module is the gradient sum of a batch of examples. We divide it by the batch size to obtain the average. The size of the step is determined by the learning rate lr.

In [11]:
def sgd(params, lr, batch_size):  # This function has been saved in the d2l package for future use.
    for param in params:
        param[:] = param - lr * param.grad / batch_size

3.2.7. Training

In training, we will iterate over the data to improve the model parameters. In each iteration, the mini-batch stochastic gradient is calculated by first calling the inverse function backward depending on the currently read mini-batch data examples (feature X and label y), and then calling the optimization algorithm sgd to iterate the model parameters. Since we previously set the batch size batch_size to 10, the loss shape l for each small batch is (10, 1).

  • Initialize parameters \((\mathbf{w}, b)\)
  • Repeat until done
    • Compute gradient \(\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)\)
    • Update parameters \((\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}\)

Since nobody wants to compute gradients explicitly (this is tedious and error prone) we use automatic differentiation to compute \(g\). See section “Automatic Gradient” for more details. Since the loss l is not a scalar variable, running l.backward() will add together the elements in l to obtain the new variable, and then calculate the variable model parameters’ gradient.

In an epoch (a pass through the data), we will iterate through the data_iter function once and use it for all the examples in the training data set (assuming the number of examples is divisible by the batch size). The number of epochs num_epochs and the learning rate lr are both hyper-parameters and are set to 3 and 0.03, respectively. Unfortunately in practice, the majority of the hyper-parameters will require some adjustment by trial and error. For instance, the model might actually become more accurate by training longer (but this increases computational cost). Likewise, we might want to change the learning rate on the fly. We will discuss this later in the chapter on “Optimization Algorithms”.

In [12]:
lr = 0.03               # learning rate
num_epochs = 3          # number of iterations
net = linreg            # our fancy linear model
loss = squared_loss     # 0.5 (y-y')^2

for epoch in range(num_epochs):
    # Assuming the number of examples can be divided by the batch size, all the examples in
    # the training data set are used once in one epoch iteration.
    # The features and tags of mini-batch examples are given by X and y respectively.
    for X, y in data_iter(batch_size, features, labels):
        with autograd.record():
            l = loss(net(X, w, b), y)  # minibatch loss in X and y
        l.backward()                   # compute gradient on l with respect to [w,b]
        sgd([w, b], lr, batch_size)    # update parameters [w,b] using their gradient
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().asnumpy()))
epoch 1, loss 0.040839
epoch 2, loss 0.000152
epoch 3, loss 0.000050

To evaluate the trained model, we can compare the actual parameters used with the parameters we have learned after the training has been completed. They are very close to each other.

In [13]:
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)
Error in estimating w
[ 0.00041497 -0.00038981]
<NDArray 2 @cpu(0)>
Error in estimating b
<NDArray 1 @cpu(0)>

Note that we should not take it for granted that we are able to reover the parameters accurately. This only happens for a special category problems: strongly convex optimization problems with ‘enough’ data to ensure that the noisy samples allow us to recover the underlying dependency correctly. In most cases this is not the case. In fact, the parameters of a deep network are rarely the same (or even close) between two different runs, unless everything is kept identically, including the order in which the data is traversed. Nonetheless this can lead to very good solutions, mostly due to the fact that quite often there are many sets of parameters that work well.

3.2.8. Summary

We saw how a deep network can be implemented and optimized from scratch, using just NDArray and autograd without any need for defining layers, fancy optimizers, etc. This only scratches the surface of what is possible. In the following sections, we will describe additional deep learning models based on what we have just learned and you will learn how to implement them using more concisely.

3.2.9. Problems

  1. What would happen if we were to initialize the weights \(\mathbf{w} = 0\). Would the algorithm still work?
  2. Assume that you’re Georg Simon Ohm trying to come up with a model between voltage and current. Can you use autograd to learn the parameters of your model.
  3. Can you use Planck’s Law to determine the temperature of an object using spectral energy density.
  4. What are the problems you might encounter if you wanted to extend autograd to second derivatives? How would you fix them?
  5. Why is the reshape function needed in the squared_loss function?
  6. Experiment using different learning rates to find out how fast the loss function value drops.
  7. If the number of examples cannot be divided by the batch size, what happens to the data_iter function’s behavior?

3.2.10. Discuss on our Forum