3.2. Linear Regression Implementation from Scratch
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab

Now that you understand the key ideas behind linear regression, we can begin to work through a hands-on implementation in code. In this section, we will implement the entire method from scratch, including the data pipeline, the model, the loss function, and the minibatch stochastic gradient descent optimizer. While modern deep learning frameworks can automate nearly all of this work, implementing things from scratch is the only way to make sure that you really know what you are doing. Moreover, when it comes time to customize models, defining our own layers or loss functions, understanding how things work under the hood will prove handy. In this section, we will rely only on tensors and auto differentiation. Afterwards, we will introduce a more concise implementation, taking advantage of bells and whistles of deep learning frameworks.

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import autograd, np, npx
import random
npx.set_np()
%matplotlib inline
from d2l import torch as d2l
import torch
import random
%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf
import random

3.2.1. Generating the Dataset

To keep things simple, we will construct an artificial dataset according to a linear model with additive noise. Our task will be to recover this model’s parameters using the finite set of examples contained in our dataset. We will keep the data low-dimensional so we can visualize it easily. In the following code snippet, we generate a dataset containing 1000 examples, each consisting of 2 features sampled from a standard normal distribution. Thus our synthetic dataset will be a matrix \(\mathbf{X}\in \mathbb{R}^{1000 \times 2}\).

The true parameters generating our dataset will be \(\mathbf{w} = [2, -3.4]^\top\) and \(b = 4.2\), and our synthetic labels will be assigned according to the following linear model with the noise term \(\epsilon\):

(3.2.1)\[\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.\]

You could think of \(\epsilon\) as capturing potential measurement errors on the features and labels. We will assume that the standard assumptions hold and thus that \(\epsilon\) obeys a normal distribution with mean of 0. To make our problem easy, we will set its standard deviation to 0.01. The following code generates our synthetic dataset.

def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = np.random.normal(0, 1, (num_examples, len(w)))
    y = np.dot(X, w) + b
    y += np.random.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = np.array([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)
def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = tf.zeros((num_examples, w.shape[0]))
    X += tf.random.normal(shape=X.shape)
    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
    y += tf.random.normal(shape=y.shape, stddev=0.01)
    y = tf.reshape(y, (-1, 1))
    return X, y

true_w = tf.constant([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Note that each row in features consists of a 2-dimensional data example and that each row in labels consists of a 1-dimensional label value (a scalar).

print('features:', features[0],'\nlabel:', labels[0])
features: [2.2122064 1.1630787]
label: [4.662078]
print('features:', features[0],'\nlabel:', labels[0])
features: tensor([-1.7227,  0.6595])
label: tensor([-1.4825])
print('features:', features[0],'\nlabel:', labels[0])
features: tf.Tensor([-0.2209929  0.65862  ], shape=(2,), dtype=float32)
label: tf.Tensor([1.5084338], shape=(1,), dtype=float32)

By generating a scatter plot using the second feature features[:, 1] and labels, we can clearly observe the linear correlation between the two.

d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(d2l.numpy(features[:, 1]), d2l.numpy(labels), 1);
../_images/output_linear-regression-scratch_58de05_39_0.svg
d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(d2l.numpy(features[:, 1]), d2l.numpy(labels), 1);
../_images/output_linear-regression-scratch_58de05_42_0.svg
d2l.set_figsize()
# The semicolon is for displaying the plot only
d2l.plt.scatter(d2l.numpy(features[:, 1]), d2l.numpy(labels), 1);
../_images/output_linear-regression-scratch_58de05_45_0.svg

3.2.2. Reading the Dataset

Recall that training models consists of making multiple passes over the dataset, grabbing one minibatch of examples at a time, and using them to update our model. Since this process is so fundamental to training machine learning algorithms, it is worth defining a utility function to shuffle the dataset and access it in minibatches.

In the following code, we define the data_iter function to demonstrate one possible implementation of this functionality. The function takes a batch size, a matrix of features, and a vector of labels, yielding minibatches of the size batch_size. Each minibatch consists of a tuple of features and labels.

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = np.array(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        j = tf.constant(indices[i: min(i + batch_size, num_examples)])
        yield tf.gather(features, j), tf.gather(labels, j)

In general, note that we want to use reasonably sized minibatches to take advantage of the GPU hardware, which excels at parallelizing operations. Because each example can be fed through our models in parallel and the gradient of the loss function for each example can also be taken in parallel, GPUs allow us to process hundreds of examples in scarcely more time than it might take to process just a single example.

To build some intuition, let us read and print the first small batch of data examples. The shape of the features in each minibatch tells us both the minibatch size and the number of input features. Likewise, our minibatch of labels will have a shape given by batch_size.

batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
[[-0.2125029   1.1763296 ]
 [-0.11836779  0.19793853]
 [-0.18006676  0.25114498]
 [ 1.5227884  -0.47990882]
 [ 1.6174138  -0.5286131 ]
 [-0.0538918  -1.022586  ]
 [ 1.4159873   0.03571886]
 [-1.0218245   2.1935298 ]
 [-0.6015945   0.29670078]
 [ 1.0442022   0.47742996]]
 [[-0.23889384]
 [ 3.29689   ]
 [ 2.98238   ]
 [ 8.883703  ]
 [ 9.224307  ]
 [ 7.574081  ]
 [ 6.905659  ]
 [-5.3005147 ]
 [ 1.9725974 ]
 [ 4.670243  ]]
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
tensor([[ 1.6899,  0.0246],
        [ 0.6226,  1.2036],
        [ 0.0212,  0.6328],
        [-0.4103, -0.5637],
        [-1.1218,  0.4869],
        [ 0.8225, -0.8655],
        [ 1.1777, -0.6210],
        [-0.6031, -0.2346],
        [-0.0958,  1.1411],
        [-0.3367,  1.2743]])
 tensor([[ 7.4870],
        [ 1.3574],
        [ 2.0872],
        [ 5.2840],
        [ 0.2862],
        [ 8.7966],
        [ 8.6365],
        [ 3.7991],
        [ 0.1464],
        [-0.8119]])
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break
tf.Tensor(
[[-0.10883021  1.2566772 ]
 [ 1.1622273  -0.11815778]
 [ 0.76029664 -0.91417277]
 [-0.6312261  -0.6870954 ]
 [-0.08980141 -0.52814686]
 [-0.16461211 -0.61581516]
 [-0.33140317  2.4456334 ]
 [ 1.9638819  -1.2954173 ]
 [ 1.6808692   0.21056521]
 [-0.5796793   0.5305954 ]], shape=(10, 2), dtype=float32)
 tf.Tensor(
[[-0.28596663]
 [ 6.9213047 ]
 [ 8.833311  ]
 [ 5.273091  ]
 [ 5.834231  ]
 [ 5.97464   ]
 [-4.782687  ]
 [12.534847  ]
 [ 6.8439426 ]
 [ 1.2351022 ]], shape=(10, 1), dtype=float32)

As we run the iteration, we obtain distinct minibatches successively until the entire dataset has been exhausted (try this). While the iteration implemented above is good for didactic purposes, it is inefficient in ways that might get us in trouble on real problems. For example, it requires that we load all the data in memory and that we perform lots of random memory access. The built-in iterators implemented in a deep learning framework are considerably more efficient and they can deal with both data stored in files and data fed via data streams.

3.2.3. Initializing Model Parameters

Before we can begin optimizing our model’s parameters by minibatch stochastic gradient descent, we need to have some parameters in the first place. In the following code, we initialize weights by sampling random numbers from a normal distribution with mean 0 and a standard deviation of 0.01, and setting the bias to 0.

w = np.random.normal(0, 0.01, (2, 1))
b = np.zeros(1)
w.attach_grad()
b.attach_grad()
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01),
                trainable=True)
b = tf.Variable(tf.zeros(1), trainable=True)

After initializing our parameters, our next task is to update them until they fit our data sufficiently well. Each update requires taking the gradient of our loss function with respect to the parameters. Given this gradient, we can update each parameter in the direction that may reduce the loss.

Since nobody wants to compute gradients explicitly (this is tedious and error prone), we use automatic differentiation, as introduced in Section 2.5, to compute the gradient.

3.2.4. Defining the Model

Next, we must define our model, relating its inputs and parameters to its outputs. Recall that to calculate the output of the linear model, we simply take the matrix-vector dot product of the input features \(\mathbf{X}\) and the model weights \(\mathbf{w}\), and add the offset \(b\) to each example. Note that below \(\mathbf{Xw}\) is a vector and \(b\) is a scalar. Recall the broadcasting mechanism as described in Section 2.1.3. When we add a vector and a scalar, the scalar is added to each component of the vector.

def linreg(X, w, b):  #@save
    """The linear regression model."""
    return np.dot(X, w) + b
def linreg(X, w, b):  #@save
    """The linear regression model."""
    return torch.matmul(X, w) + b
def linreg(X, w, b):  #@save
    """The linear regression model."""
    return tf.matmul(X, w) + b

3.2.5. Defining the Loss Function

Since updating our model requires taking the gradient of our loss function, we ought to define the loss function first. Here we will use the squared loss function as described in Section 3.1. In the implementation, we need to transform the true value y into the predicted value’s shape y_hat. The result returned by the following function will also have the same shape as y_hat.

def squared_loss(y_hat, y):  #@save
    """Squared loss."""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def squared_loss(y_hat, y):  #@save
    """Squared loss."""
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def squared_loss(y_hat, y):  #@save
    """Squared loss."""
    return (y_hat - tf.reshape(y, y_hat.shape)) ** 2 / 2

3.2.6. Defining the Optimization Algorithm

As we discussed in Section 3.1, linear regression has a closed-form solution. However, this is not a book about linear regression: it is a book about deep learning. Since none of the other models that this book introduces can be solved analytically, we will take this opportunity to introduce your first working example of minibatch stochastic gradient descent.

At each step, using one minibatch randomly drawn from our dataset, we will estimate the gradient of the loss with respect to our parameters. Next, we will update our parameters in the direction that may reduce the loss. The following code applies the minibatch stochastic gradient descent update, given a set of parameters, a learning rate, and a batch size. The size of the update step is determined by the learning rate lr. Because our loss is calculated as a sum over the minibatch of examples, we normalize our step size by the batch size (batch_size), so that the magnitude of a typical step size does not depend heavily on our choice of the batch size.

def sgd(params, lr, batch_size):  #@save
    """Minibatch stochastic gradient descent."""
    for param in params:
        param[:] = param - lr * param.grad / batch_size
def sgd(params, lr, batch_size):  #@save
    """Minibatch stochastic gradient descent."""
    for param in params:
        param.data.sub_(lr*param.grad/batch_size)
        param.grad.data.zero_()
def sgd(params, grads, lr, batch_size):  #@save
    """Minibatch stochastic gradient descent."""
    for param, grad in zip(params, grads):
        param.assign_sub(lr*grad/batch_size)

3.2.7. Training

Now that we have all of the parts in place, we are ready to implement the main training loop. It is crucial that you understand this code because you will see nearly identical training loops over and over again throughout your career in deep learning.

In each iteration, we will grab a minibatch of training examples, and pass them through our model to obtain a set of predictions. After calculating the loss, we initiate the backwards pass through the network, storing the gradients with respect to each parameter. Finally, we will call the optimization algorithm sgd to update the model parameters.

In summary, we will execute the following loop:

  • Initialize parameters \((\mathbf{w}, b)\)

  • Repeat until done

    • Compute gradient \(\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)\)

    • Update parameters \((\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}\)

In each epoch, we will iterate through the entire dataset (using the data_iter function) once passing through every example in the training dataset (assuming that the number of examples is divisible by the batch size). The number of epochs num_epochs and the learning rate lr are both hyperparameters, which we set here to 3 and 0.03, respectively. Unfortunately, setting hyperparameters is tricky and requires some adjustment by trial and error. We elide these details for now but revise them later in Section 11.

lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        with autograd.record():
            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Because `l` has a shape (`batch_size`, 1) and is not a scalar
        # variable, the elements in `l` are added together to obtain a new
        # variable, on which gradients with respect to [`w`, `b`] are computed
        l.backward()
        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
    train_l = loss(net(features, w, b), labels)
    print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
epoch 1, loss 0.024958
epoch 2, loss 0.000088
epoch 3, loss 0.000051
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Compute gradient on `l` with respect to [`w`, `b`]
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # Update parameters using their gradient
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')
epoch 1, loss 0.035664
epoch 2, loss 0.000134
epoch 3, loss 0.000052
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        with tf.GradientTape() as g:
            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Compute gradient on l with respect to [`w`, `b`]
        dw, db = g.gradient(l, [w, b])
        # Update parameters using their gradient
        sgd([w, b], [dw, db], lr, batch_size)
    train_l = loss(net(features, w, b), labels)
    print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')
epoch 1, loss 0.057356
epoch 2, loss 0.000292
epoch 3, loss 0.000053

In this case, because we synthesized the dataset ourselves, we know precisely what the true parameters are. Thus, we can evaluate our success in training by comparing the true parameters with those that we learned through our training loop. Indeed they turn out to be very close to each other.

print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
error in estimating w: [0.00041795 0.00029826]
error in estimating b: [0.00057745]
print(f'error in estimating w: {true_w - w.reshape(true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
error in estimating w: tensor([1.8430e-04, 1.7405e-05], grad_fn=<SubBackward0>)
error in estimating b: tensor([-0.0003], grad_fn=<RsubBackward1>)
print(f'error in estimating w: {true_w - tf.reshape(w, true_w.shape)}')
print(f'error in estimating b: {true_b - b}')
error in estimating w: [ 0.0003022  -0.00141883]
error in estimating b: [0.00101137]

Note that we should not take it for granted that we are able to recover the parameters perfectly. However, in machine learning, we are typically less concerned with recovering true underlying parameters, and more concerned with parameters that lead to highly accurate prediction. Fortunately, even on difficult optimization problems, stochastic gradient descent can often find remarkably good solutions, owing partly to the fact that, for deep networks, there exist many configurations of the parameters that lead to highly accurate prediction.

3.2.8. Summary

  • We saw how a deep network can be implemented and optimized from scratch, using just tensors and auto differentiation, without any need for defining layers or fancy optimizers.

  • This section only scratches the surface of what is possible. In the following sections, we will describe additional models based on the concepts that we have just introduced and learn how to implement them more concisely.

3.2.9. Exercises

  1. What would happen if we were to initialize the weights to zero. Would the algorithm still work?

  2. Assume that you are Georg Simon Ohm trying to come up with a model between voltage and current. Can you use auto differentiation to learn the parameters of your model?

  3. Can you use Planck’s Law to determine the temperature of an object using spectral energy density?

  4. What are the problems you might encounter if you wanted to compute the second derivatives? How would you fix them?

  5. Why is the reshape function needed in the squared_loss function?

  6. Experiment using different learning rates to find out how fast the loss function value drops.

  7. If the number of examples cannot be divided by the batch size, what happens to the data_iter function’s behavior?