.. _sec_softmax_scratch:
Implementation of Softmax Regression from Scratch
=================================================
Just as we implemented linear regression from scratch, we believe that
multiclass logistic (softmax) regression is similarly fundamental and
you ought to know the gory details of how to implement it yourself. As
with linear regression, after doing things by hand we will breeze
through an implementation in Gluon for comparison. To begin, let’s
import the familiar packages.
.. code:: python
import d2l
from mxnet import autograd, np, npx, gluon
from IPython import display
npx.set_np()
We will work with the Fashion-MNIST dataset, just introduced in
:numref:`sec_fashion_mnist`, setting up an iterator with batch size
:math:`256`.
.. code:: python
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
Initializing Model Parameters
-----------------------------
As in our linear regression example, each example here will be
represented by a fixed-length vector. Each example in the raw data is a
:math:`28 \times 28` image. In this section, we will flatten each image,
treating them as :math:`784` 1D vectors. In the future, we will talk
about more sophisticated strategies for exploiting the spatial structure
in images, but for now we treat each pixel location as just another
feature.
Recall that in softmax regression, we have as many outputs as there are
categories. Because our dataset has :math:`10` categories, our network
will have an output dimension of :math:`10`. Consequently, our weights
will constitute a :math:`784 \times 10` matrix and the biases will
constitute a :math:`1 \times 10` vector. As with linear regression, we
will initialize our weights :math:`W` with Gaussian noise and our biases
to take the initial value :math:`0`.
.. code:: python
num_inputs = 784
num_outputs = 10
W = np.random.normal(0, 0.01, (num_inputs, num_outputs))
b = np.zeros(num_outputs)
Recall that we need to *attach gradients* to the model parameters. More
literally, we are allocating memory for future gradients to be stored
and notifiying MXNet that we will want to calculate gradients with
respect to these parameters in the future.
.. code:: python
W.attach_grad()
b.attach_grad()
The Softmax
-----------
Before implementing the softmax regression model, let’s briefly review
how operators such as ``sum`` work along specific dimensions in an
``ndarray``. Given a matrix ``X`` we can sum over all elements (default)
or only over elements in the same axis, *i.e.*, the column (``axis=0``)
or the same row (``axis=1``). Note that if ``X`` is an array with shape
``(2, 3)`` and we sum over the columns (``X.sum(axis=0``), the result
will be a (1D) vector with shape ``(3,)``. If we want to keep the number
of axes in the original array (resulting in a 2D array with shape
``(1, 3)``), rather than collapsing out the dimension that we summed
over we can specify ``keepdims=True`` when invoking ``sum``.
.. code:: python
X = np.array([[1, 2, 3], [4, 5, 6]])
print(X.sum(axis=0, keepdims=True), '\n', X.sum(axis=1, keepdims=True))
.. parsed-literal::
:class: output
[[5. 7. 9.]]
[[ 6.]
[15.]]
We are now ready to implement the softmax function. Recall that softmax
consists of two steps: First, we exponentiate each term (using ``exp``).
Then, we sum over each row (we have one row per example in the batch) to
get the normalization constants for each example. Finally, we divide
each row by its normalization constant, ensuring that the result sums to
:math:`1`. Before looking at the code, let’s recall what this looks
expressed as an equation:
.. math::
\mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(X_{ij})}{\sum_k \exp(X_{ik})}.
The denominator, or normalization constant, is also sometimes called the
partition function (and its logarithm is called the log-partition
function). The origins of that name are in `statistical
physics `__
where a related equation models the distribution over an ensemble of
particles).
.. code:: python
def softmax(X):
X_exp = np.exp(X)
partition = X_exp.sum(axis=1, keepdims=True)
return X_exp / partition # The broadcast mechanism is applied here
As you can see, for any random input, we turn each element into a
non-negative number. Moreover, each row sums up to 1, as is required for
a probability. Note that while this looks correct mathematically, we
were a bit sloppy in our implementation because failed to take
precautions against numerical overflow or underflow due to large (or
very small) elements of the matrix, as we did in
:numref:`sec_naive_bayes`.
.. code:: python
X = np.random.normal(size=(2, 5))
X_prob = softmax(X)
X_prob, X_prob.sum(axis=1)
.. parsed-literal::
:class: output
(array([[0.22376052, 0.06659239, 0.06583703, 0.29964197, 0.3441681 ],
[0.63209665, 0.03179282, 0.194987 , 0.09209415, 0.04902935]]),
array([1. , 0.99999994]))
The Model
---------
Now that we have defined the softmax operation, we can implement the
softmax regression model. The below code defines the forward pass
through the network. Note that we flatten each original image in the
batch into a vector with length ``num_inputs`` with the ``reshape``
function before passing the data through our model.
.. code:: python
def net(X):
return softmax(np.dot(X.reshape(-1, num_inputs), W) + b)
The Loss Function
-----------------
Next, we need to implement the cross-entropy loss function, introduced
in :numref:`sec_softmax`. This may be the most common loss function in
all of deep learning because, at the moment, classification problems far
outnumber regression problems.
Recall that cross-entropy takes the negative log likelihood of the
predicted probability assigned to the true label
:math:`-\log P(y \mid x)`. Rather than iterating over the predictions
with a Python ``for`` loop (which tends to be inefficient), we can use
the ``pick`` function which allows us to easily select the appropriate
terms from the matrix of softmax entries. Below, we illustrate the
``pick`` function on a toy example, with :math:`3` categories and
:math:`2` examples.
.. code:: python
y_hat = np.array([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y_hat[[0, 1], [0, 2]]
.. parsed-literal::
:class: output
array([0.1, 0.5])
Now we can implement the cross-entropy loss function efficiently with
just one line of code.
.. code:: python
def cross_entropy(y_hat, y):
return - np.log(y_hat[range(len(y_hat)), y])
Classification Accuracy
-----------------------
Given the predicted probability distribution ``y_hat``, we typically
choose the class with highest predicted probability whenever we must
output a *hard* prediction. Indeed, many applications require that we
make a choice. Gmail must categorize an email into Primary, Social,
Updates, or Forums. It might estimate probabilities internally, but at
the end of the day it has to choose one among the categories.
When predictions are consistent with the actual category ``y``, they are
correct. The classification accuracy is the fraction of all predictions
that are correct. Although it can be difficult optimize accuracy
directly (it is not differentiable), it is often the performance metric
that we care most about, and we will nearly always report it when
training classifiers.
To compute accuracy we do the following: First, we execute
``y_hat.argmax(axis=1)`` to gather the predicted classes (given by the
indices for the largest entires each row). The result has the same shape
as the variable ``y``. Now we just need to check how frequently the two
match. Since the equality operator ``==`` is datatype-sensitive (e.g.,
an ``int`` and a ``float32`` are never equal), we also need to convert
both to the same type (we pick ``float32``). The result is an
``ndarray`` containing entries of 0 (false) and 1 (true). Taking the
mean yields the desired result.
.. code:: python
# Saved in the d2l package for later use
def accuracy(y_hat, y):
if y_hat.shape[1] > 1:
return float((y_hat.argmax(axis=1) == y.astype('float32')).sum())
else:
return float((y_hat.astype('int32') == y.astype('int32')).sum())
We will continue to use the variables ``y_hat`` and ``y`` defined in the
``pick`` function, as the predicted probability distribution and label,
respectively. We can see that the first example’s prediction category is
:math:`2` (the largest element of the row is 0.6 with an index of
:math:`2`), which is inconsistent with the actual label, :math:`0`. The
second example’s prediction category is :math:`2` (the largest element
of the row is :math:`0.5` with an index of :math:`2`), which is
consistent with the actual label, :math:`2`. Therefore, the
classification accuracy rate for these two examples is :math:`0.5`.
.. code:: python
y = np.array([0, 2])
accuracy(y_hat, y) / len(y)
.. parsed-literal::
:class: output
0.5
Similarly, we can evaluate the accuracy for model ``net`` on the dataset
(accessed via ``data_iter``).
.. code:: python
# Saved in the d2l package for later use
def evaluate_accuracy(net, data_iter):
metric = Accumulator(2) # num_corrected_examples, num_examples
for X, y in data_iter:
metric.add(accuracy(net(X), y), y.size)
return metric[0] / metric[1]
Here ``Accumulator`` is a utility class to accumulated sum over multiple
numbers.
.. code:: python
# Saved in the d2l package for later use
class Accumulator(object):
"""Sum a list of numbers over time"""
def __init__(self, n):
self.data = [0.0] * n
def add(self, *args):
self.data = [a+b for a, b in zip(self.data, args)]
def reset(self):
self.data = [0] * len(self.data)
def __getitem__(self, i):
return self.data[i]
Because we initialized the ``net`` model with random weights, the
accuracy of this model should be close to random guessing, i.e.,
:math:`0.1` for :math:`10` classes.
.. code:: python
evaluate_accuracy(net, test_iter)
.. parsed-literal::
:class: output
0.0811
Model Training
--------------
The training loop for softmax regression should look strikingly familiar
if you read through our implementation of linear regression in
:numref:`sec_linear_scratch`. Here we refactor the implementation to
make it reusable. First, we define a function to train for one data
epoch. Note that ``updater`` is general function to update the model
parameters, which accepts the batch size as an argument. It can be
either a wrapper of ``d2l.sgd`` or a Gluon trainer.
.. code:: python
# Saved in the d2l package for later use
def train_epoch_ch3(net, train_iter, loss, updater):
metric = Accumulator(3) # train_loss_sum, train_acc_sum, num_examples
if isinstance(updater, gluon.Trainer):
updater = updater.step
for X, y in train_iter:
# Compute gradients and update parameters
with autograd.record():
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
updater(X.shape[0])
metric.add(float(l.sum()), accuracy(y_hat, y), y.size)
# Return training loss and training accuracy
return metric[0]/metric[2], metric[1]/metric[2]
Before showing the implementation of the training function, we define a
utility class that draw data in animation. Again, it aims to simplify
the codes in later chapters.
.. code:: python
# Saved in the d2l package for later use
class Animator(object):
def __init__(self, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear', fmts=None,
nrows=1, ncols=1, figsize=(3.5, 2.5)):
"""Incrementally plot multiple lines."""
d2l.use_svg_display()
self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
if nrows * ncols == 1:
self.axes = [self.axes, ]
# Use a lambda to capture arguments
self.config_axes = lambda: d2l.set_axes(
self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
self.X, self.Y, self.fmts = None, None, fmts
def add(self, x, y):
"""Add multiple data points into the figure."""
if not hasattr(y, "__len__"):
y = [y]
n = len(y)
if not hasattr(x, "__len__"):
x = [x] * n
if not self.X:
self.X = [[] for _ in range(n)]
if not self.Y:
self.Y = [[] for _ in range(n)]
if not self.fmts:
self.fmts = ['-'] * n
for i, (a, b) in enumerate(zip(x, y)):
if a is not None and b is not None:
self.X[i].append(a)
self.Y[i].append(b)
self.axes[0].cla()
for x, y, fmt in zip(self.X, self.Y, self.fmts):
self.axes[0].plot(x, y, fmt)
self.config_axes()
display.display(self.fig)
display.clear_output(wait=True)
The training function then runs multiple epochs and visualize the
training progress.
.. code:: python
# Saved in the d2l package for later use
def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
animator = Animator(xlabel='epoch', xlim=[1, num_epochs],
ylim=[0.3, 0.9],
legend=['train loss', 'train acc', 'test acc'])
for epoch in range(num_epochs):
train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
test_acc = evaluate_accuracy(net, test_iter)
animator.add(epoch+1, train_metrics+(test_acc,))
Again, we use the minibatch stochastic gradient descent to optimize the
loss function of the model. Note that the number of epochs
(``num_epochs``), and learning rate (``lr``) are both adjustable
hyper-parameters. By changing their values, we may be able to increase
the classification accuracy of the model. In practice we will want to
split our data three ways into training, validation, and test data,
using the validation data to choose the best values of our
hyperparameters.
.. code:: python
num_epochs, lr = 10, 0.1
def updater(batch_size):
return d2l.sgd([W, b], lr, batch_size)
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater)
.. figure:: output_softmax-regression-scratch_34a5f2_37_0.svg
Prediction
----------
Now that training is complete, our model is ready to classify some
images. Given a series of images, we will compare their actual labels
(first line of text output) and the model predictions (second line of
text output).
.. code:: python
# Saved in the d2l package for later use
def predict_ch3(net, test_iter, n=6):
for X, y in test_iter:
break
trues = d2l.get_fashion_mnist_labels(y)
preds = d2l.get_fashion_mnist_labels(net(X).argmax(axis=1))
titles = [true+'\n' + pred for true, pred in zip(trues, preds)]
d2l.show_images(X[0:n].reshape(n, 28, 28), 1, n, titles=titles[0:n])
predict_ch3(net, test_iter)
.. figure:: output_softmax-regression-scratch_34a5f2_39_0.svg
Summary
-------
With softmax regression, we can train models for multi-category
classification. The training loop is very similar to that in linear
regression: retrieve and read data, define models and loss functions,
then train models using optimization algorithms. As you will soon find
out, most common deep learning models have similar training procedures.
Exercises
---------
1. In this section, we directly implemented the softmax function based
on the mathematical definition of the softmax operation. What
problems might this cause (hint: try to calculate the size of
:math:`\exp(50)`)?
2. The function ``cross_entropy`` in this section is implemented
according to the definition of the cross-entropy loss function. What
could be the problem with this implementation (hint: consider the
domain of the logarithm)?
3. What solutions you can think of to fix the two problems above?
4. Is it always a good idea to return the most likely label. E.g. would
you do this for medical diagnosis?
5. Assume that we want to use softmax regression to predict the next
word based on some features. What are some problems that might arise
from a large vocabulary?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_softmax-regression-scratch.svg