.. _sec_minibatch_sgd:
Minibatch Stochastic Gradient Descent
=====================================
So far we encountered two extremes in the approach to gradient based
learning: :numref:`sec_gd` uses the full dataset to compute gradients
and to update parameters, one pass at a time. Conversely
:numref:`sec_sgd` processes one observation at a time to make
progress. Each of them has its own drawbacks. Gradient Descent is not
particularly *data efficient* whenever data is very similar. Stochastic
Gradient Descent is not particularly *computationally efficient* since
CPUs and GPUs cannot exploit the full power of vectorization. This
suggests that there might be a happy medium, and in fact, that’s what
we’ve been using so far in the examples we discussed.
Vectorization and Caches
------------------------
At the heart of the decision to use minibatches is computational
efficiency. This is most easily understood when considering
parallelization to multiple GPUs and multiple servers. In this case we
need to send at least one image to each GPU. With 8 GPUs per server and
16 servers we already arrive at a minibatch size of 128.
Things are a bit more subtle when it comes to single GPUs or even CPUs.
These devices have multiple types of memory, often multiple type of
compute units and different bandwidth constraints between them. For
instance, a CPU has a small number of registers and then L1, L2 and in
some cases even L3 cache (which is shared between the different
processor cores). These caches are of increasing size and latency (and
at the same time they’re of decreasing bandwidth). Suffice it to say,
the processor is capable of performing many more operations than what
the main memory interface is able to provide.
- A 2GHz CPU with 16 cores and AVX-512 vectorization can process up to
:math:`2 \cdot 10^9 \cdot 16 \cdot 32 = 10^{12}` bytes per second.
The capability of GPUs easily exceeds this number by a factor of 100.
On the other hand, a midrange server processor might not have much
more than 100 GB/s bandwidth, i.e., less than one tenth of what would
be required to keep the processor fed. To make matters worse, not all
memory access is created equal: first, memory interfaces are
typically 64 bit wide or wider (e.g., on GPUs up to 384 bit), hence
reading a single byte incurs the cost of a much wider access.
- There is significant overhead for the first access whereas sequential
access is relatively cheap (this is often called a burst read). There
are many more things to keep in mind, such as caching when we have
multiple sockets, chiplets and other structures. A detailed
discussion of this is beyond the scope of this section. See e.g.,
this `Wikipedia
article `__ for a more
in-depth discussion.
The way to alleviate these constraints is to use a hierarchy of CPU
caches which are actually fast enough to supply the processor with data.
This is *the* driving force behind batching in deep learning. To keep
matters simple, consider matrix-matrix multiplication, say
:math:`\mathbf{A} = \mathbf{B}\mathbf{C}`. We have a number of options
for calculating :math:`\mathbf{A}`. For instance we could try the
following:
1. We could compute
:math:`\mathbf{A}_{ij} = \mathbf{B}_{i,:} \mathbf{C}_{:,j}^\top`,
i.e., we could compute it element-wise by means of dot products.
2. We could compute
:math:`\mathbf{A}_{:,j} = \mathbf{B} \mathbf{C}_{:,j}^\top`, i.e., we
could compute it one column at a time. Likewise we could compute
:math:`\mathbf{A}` one row :math:`\mathbf{A}_{i,:}` at a time.
3. We could simply compute :math:`\mathbf{A} = \mathbf{B} \mathbf{C}`.
4. We could break :math:`\mathbf{B}` and :math:`\mathbf{C}` into smaller
block matrices and compute :math:`\mathbf{A}` one block at a time.
If we follow the first option, we will need to copy one row and one
column vector into the CPU each time we want to compute an element
:math:`\mathbf{A}_{ij}`. Even worse, due to the fact that matrix
elements are aligned sequentially we are thus required to access many
disjoint locations for one of the two vectors as we read them from
memory. The second option is much more favorable. In it, we are able to
keep the column vector :math:`\mathbf{C}_{:,j}` in the CPU cache while
we keep on traversing through :math:`B`. This halves the memory
bandwidth requirement with correspondingly faster access. Of course,
option 3 is most desirable. Unfortunately, most matrices might not
entirely fit into cache (this is what we’re discussing after all).
However, option 4 offers a practically useful alternative: we can move
blocks of the matrix into cache and multiply them locally. Optimized
libraries take care of this for us. Let’s have a look at how efficient
these operations are in practice.
Beyond computational efficiency, the overhead introduced by Python and
by the deep learning framework itself is considerable. Recall that each
time we execute a command the Python interpreter sends a command to the
MXNet engine which needs to insert it into the compute graph and deal
with it during scheduling. Such overhead can be quite detrimental. In
short, it is highly advisable to use vectorization (and matrices)
whenever possible.
.. code:: python
%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()
timer = d2l.Timer()
A = np.zeros((256, 256))
B = np.random.normal(0, 1, (256, 256))
C = np.random.normal(0, 1, (256, 256))
Element-wise assignment simply iterates over all rows and columns of
:math:`\mathbf{B}` and :math:`\mathbf{C}` respectively to assign the
value to :math:`\mathbf{A}`.
.. code:: python
# Compute A = B C one element at a time
timer.start()
for i in range(256):
for j in range(256):
A[i, j] = np.dot(B[i, :], C[:, j])
A.wait_to_read()
timer.stop()
.. parsed-literal::
:class: output
58.11607789993286
A faster strategy is to perform column-wise assignment.
.. code:: python
# Compute A = B C one column at a time
timer.start()
for j in range(256):
A[:, j] = np.dot(B, C[:, j])
A.wait_to_read()
timer.stop()
.. parsed-literal::
:class: output
0.1943495273590088
Last, the most effective manner is to perform the entire operation in
one block. Let’s see what the respective speed of the operations is.
.. code:: python
# Compute A = B C in one go
timer.start()
A = np.dot(B, C)
A.wait_to_read()
timer.stop()
# Multiply and add count as separate operations (fused in practice)
gigaflops = [2/i for i in timer.times]
print("Performance in Gigaflops: element {:.3f}, \
column {:.3f}, full {:.3f}".format(*gigaflops))
.. parsed-literal::
:class: output
Performance in Gigaflops: element 0.034, column 10.291, full 3070.501
.. _sec_minibatches:
Minibatches
-----------
In the past we took it for granted that we would read *minibatches* of
data rather than single observations to update parameters. We now give a
brief justification for it. Processing single observations requires us
to perform many single matrix-vector (or even vector-vector)
multiplications, which is quite expensive and which incurs a significant
overhead on behalf of the underlying deep learning framework. This
applies both to evaluating a network when applied to data (often
referred to as inference) and when computing gradients to update
parameters. That is, this applies whenever we perform
:math:`\mathbf{w} \leftarrow \mathbf{w} - \eta_t \mathbf{g}_t` where
.. math:: \mathbf{g}_t = \partial_{\mathbf{w}} f(\mathbf{x}_{t}, \mathbf{w})
We can increase the *computational* efficiency of this operation by
applying it to a minibatch of observations at a time. That is, we
replace the gradient :math:`\mathbf{g}_t` over a single observation by
one over a small batch
.. math:: \mathbf{g}_t = \partial_{\mathbf{w}} \frac{1}{|\mathcal{B}_t|} \sum_{i \in \mathcal{B}_t} f(\mathbf{x}_{i}, \mathbf{w})
Let’s see what this does to the statistical properties of
:math:`\mathbf{g}_t`: since both :math:`\mathbf{x}_t` and also all
elements of the minibatch :math:`\mathcal{B}_t` are drawn uniformly at
random from the training set, the expectation of the gradient remains
unchanged. The variance, on the other hand, is reduced significantly.
Since the minibatch gradient is composed of :math:`b := |\mathcal{B}_t|`
independent gradients which are being averaged, its standard deviation
is reduced by a factor of :math:`b^{-\frac{1}{2}}`. This, by itself, is
a good thing, since it means that the updates are more reliably aligned
with the full gradient.
Naively this would indicate that choosing a large minibatch
:math:`\mathcal{B}_t` would be universally desirable. Alas, after some
point, the additional reduction in standard deviation is minimal when
compared to the linear increase in computational cost. In practice we
pick a minibatch that is large enough to offer good computational
efficiency while still fitting into the memory of a GPU. To illustrate
the savings let’s have a look at some code. In it we perform the same
matrix-matrix multiplication, but this time broken up into “minibatches”
of 64 columns at a time.
.. code:: python
timer.start()
for j in range(0, 256, 64):
A[:, j:j+64] = np.dot(B, C[:, j:j+64])
timer.stop()
print("Performance in Gigaflops: block {:.3f}".format(2/timer.times[3]))
.. parsed-literal::
:class: output
Performance in Gigaflops: block 534.272
As we can see, the computation on the minibatch is essentially as
efficient as on the full matrix. A word of caution is in order. In
:numref:`sec_batch_norm` we used a type of regularization that was
heavily dependent on the amount of variance in a minibatch. As we
increase the latter, the variance decreases and with it the benefit of
the noise-injection due to batch normalization. See e.g.,
:cite:`Ioffe.2017` for details on how to rescale and compute the
appropriate terms.
Reading the Dataset
-------------------
Let’s have a look at how minibatches are efficiently generated from
data. In the following we use a dataset developed by NASA to test the
wing `noise from different
aircraft `__
to compare these optimization algorithms. For convenience we only use
the first :math:`1,500` examples. The data is whitened for
preprocessing, i.e., we remove the mean and rescale the variance to
:math:`1` per coordinate.
.. code:: python
# Saved in the d2l package for later use
d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',
'76e5be1548fd8222e5074cf0faae75edff8cf93f')
# Saved in the d2l package for later use
def get_data_ch11(batch_size=10, n=1500):
data = np.genfromtxt(d2l.download('airfoil'),
dtype=np.float32, delimiter='\t')
data = (data - data.mean(axis=0)) / data.std(axis=0)
data_iter = d2l.load_array(
(data[:n, :-1], data[:n, -1]), batch_size, is_train=True)
return data_iter, data.shape[1]-1
Implementation from Scratch
---------------------------
Recall the minibatch SGD implementation from
:numref:`sec_linear_scratch`. In the following we provide a slightly
more general implementation. For convenience it has the same call
signature as the other optimization algorithms introduced later in this
chapter. Specifically, we add the status input ``states`` and place the
hyperparameter in dictionary ``hyperparams``. In addition, we will
average the loss of each minibatch example in the training function, so
the gradient in the optimization algorithm does not need to be divided
by the batch size.
.. code:: python
def sgd(params, states, hyperparams):
for p in params:
p[:] -= hyperparams['lr'] * p.grad
Next, we implement a generic training function to facilitate the use of
the other optimization algorithms introduced later in this chapter. It
initializes a linear regression model and can be used to train the model
with minibatch SGD and other algorithms introduced subsequently.
.. code:: python
# Saved in the d2l package for later use
def train_ch11(trainer_fn, states, hyperparams, data_iter,
feature_dim, num_epochs=2):
# Initialization
w = np.random.normal(scale=0.01, size=(feature_dim, 1))
b = np.zeros(1)
w.attach_grad()
b.attach_grad()
net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
# Train
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
for X, y in data_iter:
with autograd.record():
l = loss(net(X), y).mean()
l.backward()
trainer_fn([w, b], states, hyperparams)
n += X.shape[0]
if n % 200 == 0:
timer.stop()
animator.add(n/X.shape[0]/len(data_iter),
(d2l.evaluate_loss(net, data_iter, loss),))
timer.start()
print('loss: %.3f, %.3f sec/epoch' % (animator.Y[0][-1], timer.avg()))
return timer.cumsum(), animator.Y[0]
Let’s see how optimization proceeds for batch gradient descent. This can
be achieved by setting the minibatch size to 1500 (i.e., to the total
number of examples). As a result the model parameters are updated only
once per epoch. There is little progress. In fact, after 6 steps
progress stalls.
.. code:: python
def train_sgd(lr, batch_size, num_epochs=2):
data_iter, feature_dim = get_data_ch11(batch_size)
return train_ch11(
sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)
gd_res = train_sgd(1, 1500, 10)
.. parsed-literal::
:class: output
loss: 0.254, 0.066 sec/epoch
.. figure:: output_minibatch-sgd_356976_17_1.svg
When the batch size equals 1, we use SGD for optimization. For
simplicity of implementation we picked a constant (albeit small)
learning rate. In SGD, the model parameters are updated whenever an
example is processed. In our case this amounts to 1500 updates per
epoch. As we can see, the decline in the value of the objective function
slows down after one epoch. Although both the procedures processed 1500
examples within one epoch, SGD consumes more time than gradient descent
in our experiment. This is because SGD updated the parameters more
frequently and since it is less efficient to process single observations
one at a time.
.. code:: python
sgd_res = train_sgd(0.005, 1)
.. parsed-literal::
:class: output
loss: 0.247, 0.359 sec/epoch
.. figure:: output_minibatch-sgd_356976_19_1.svg
Last, when the batch size equals 100, we use minibatch SGD for
optimization. The time required per epoch is longer than the time needed
for SGD and the time for batch gradient descent.
.. code:: python
mini1_res = train_sgd(.4, 100)
.. parsed-literal::
:class: output
loss: 0.245, 0.008 sec/epoch
.. figure:: output_minibatch-sgd_356976_21_1.svg
Reducing the batch size to 10, the time for each epoch increases because
the workload for each batch is less efficient to execute.
.. code:: python
mini2_res = train_sgd(.05, 10)
.. parsed-literal::
:class: output
loss: 0.244, 0.036 sec/epoch
.. figure:: output_minibatch-sgd_356976_23_1.svg
Finally, we compare the time versus loss for the preview four
experiments. As can be seen, despite SGD converges faster than GD in
terms of number of examples processed, it uses more time to reach the
same loss than GD because that computing gradient example by example is
not efficient. Minibatch SGD is able to trade-off the convergence speed
and computation efficiency. A minibatch size 10 is more efficient than
SGD; a minibatch size 100 even outperforms GD in terms of runtime.
.. code:: python
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res, mini1_res, mini2_res))),
'time (sec)', 'loss', xlim=[1e-2, 10],
legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')
.. figure:: output_minibatch-sgd_356976_25_0.svg
Concise Implementation
----------------------
In Gluon, we can use the ``Trainer`` class to call optimization
algorithms. This is used to implement a generic training function. We
will use this throughout the current chapter.
.. code:: python
# Saved in the d2l package for later use
def train_gluon_ch11(tr_name, hyperparams, data_iter, num_epochs=2):
# Initialization
net = nn.Sequential()
net.add(nn.Dense(1))
net.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(net.collect_params(), tr_name, hyperparams)
loss = gluon.loss.L2Loss()
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
for X, y in data_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()
trainer.step(X.shape[0])
n += X.shape[0]
if n % 200 == 0:
timer.stop()
animator.add(n/X.shape[0]/len(data_iter),
(d2l.evaluate_loss(net, data_iter, loss),))
timer.start()
print('loss: %.3f, %.3f sec/epoch' % (animator.Y[0][-1], timer.avg()))
Using Gluon to repeat the last experiment shows identical behavior.
.. code:: python
data_iter, _ = get_data_ch11(10)
train_gluon_ch11('sgd', {'learning_rate': 0.05}, data_iter)
.. parsed-literal::
:class: output
loss: 0.244, 0.041 sec/epoch
.. figure:: output_minibatch-sgd_356976_29_1.svg
Summary
-------
- Vectorization makes code more efficient due to reduced overhead
arising from the deep learning framework and due to better memory
locality and caching on CPUs and GPUs.
- There is a trade-off between statistical efficiency arising from SGD
and computational efficiency arising from processing large batches of
data at a time.
- Minibatch stochastic gradient descent offers the best of both worlds:
computational and statistical efficiency.
- In minibatch SGD we process batches of data obtained by a random
permutation of the training data (i.e., each observation is processed
only once per epoch, albeit in random order).
- It is advisable to decay the learning rates during training.
- In general, minibatch SGD is faster than SGD and gradient descent for
convergence to a smaller risk, when measured in terms of clock time.
Exercises
---------
1. Modify the batch size and learning rate and observe the rate of
decline for the value of the objective function and the time consumed
in each epoch.
2. Read the MXNet documentation and use the ``Trainer`` class
``set_learning_rate`` function to reduce the learning rate of the
minibatch SGD to 1/10 of its previous value after each epoch.
3. Compare minibatch SGD with a variant that actually *samples with
replacement* from the training set. What happens?
4. An evil genie replicates your dataset without telling you (i.e., each
observation occurs twice and your dataset grows to twice its original
size, but nobody told you). How does the behavior of SGD, minibatch
SGD and that of gradient descent change?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_minibatch-sgd.svg