# 10.5. Mini-batch Stochastic Gradient Descent¶

In each iteration, the gradient descent uses the entire training data set to compute the gradient, so it is sometimes referred to as batch gradient descent. Stochastic gradient descent (SGD) only randomly select one example in each iteration to compute the gradient. Just like in the previous chapters, we can perform random uniform sampling for each iteration to form a mini-batch and then use this mini-batch to compute the gradient. Now, we are going to discuss mini-batch stochastic gradient descent.

Set objective function $$f(\boldsymbol{x}): \mathbb{R}^d \rightarrow \mathbb{R}$$. The time step before the start of iteration is set to 0. The independent variable of this time step is $$\boldsymbol{x}_0\in \mathbb{R}^d$$ and is usually obtained by random initialization. In each subsequent time step $$t>0$$, mini-batch SGD uses random uniform sampling to get a mini-batch $$\mathcal{B}_t$$ made of example indices from the training data set. We can use sampling with replacement or sampling without replacement to get a mini-batch example. The former method allows duplicate examples in the same mini-batch, the latter does not and is more commonly used. We can use either of the two methods

(10.5.1)$\boldsymbol{g}_t \leftarrow \nabla f_{\mathcal{B}_t}(\boldsymbol{x}_{t-1}) = \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}_t}\nabla f_i(\boldsymbol{x}_{t-1})$

to compute the gradient $$\boldsymbol{g}_t$$ of the objective function at $$\boldsymbol{x}_{t-1}$$ with mini-batch $$\mathcal{B}_t$$ at time step $$t$$. Here, $$|\mathcal{B}|$$ is the size of the batch, which is the number of examples in the mini-batch. This is a hyper-parameter. Just like the stochastic gradient, the mini-batch SGD $$\boldsymbol{g}_t$$ obtained by sampling with replacement is also the unbiased estimate of the gradient $$\nabla f(\boldsymbol{x}_{t-1})$$. Given the learning rate $$\eta_t$$ (positive), the iteration of the mini-batch SGD on the independent variable is as follows:

(10.5.2)$\boldsymbol{x}_t \leftarrow \boldsymbol{x}_{t-1} - \eta_t \boldsymbol{g}_t.$

The variance of the gradient based on random sampling cannot be reduced during the iterative process, so in practice, the learning rate of the (mini-batch) SGD can self-decay during the iteration, such as $$\eta_t=\eta t^\alpha$$ (usually $$\alpha=-1$$ or $$-0.5$$), $$\eta_t = \eta \alpha^t$$ (e.g $$\alpha=0.95$$), or learning rate decay once per iteration or after several iterations. As a result, the variance of the learning rate and the (mini-batch) SGD will decrease. Gradient descent always uses the true gradient of the objective function during the iteration, without the need to self-decay the learning rate.

The cost for computing each iteration is $$\mathcal{O}(|\mathcal{B}|)$$. When the batch size is 1, the algorithm is an SGD; when the batch size equals the example size of the training data, the algorithm is a gradient descent. When the batch size is small, fewer examples are used in each iteration, which will result in parallel processing and reduce the RAM usage efficiency. This makes it more time consuming to compute examples of the same size than using larger batches. When the batch size increases, each mini-batch gradient may contain more redundant information. To get a better solution, we need to compute more examples for a larger batch size, such as increasing the number of epochs.

In this chapter, we will use a data set developed by NASA to test the wing noise from different aircraft to compare these optimization algorithms[1]. We will use the first 1500 examples of the data set, 5 features, and a normalization method to preprocess the data.

%matplotlib inline
import d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import nn
import numpy as np

# Save to the d2l package.
def get_data_ch10(batch_size=10, n=1500):
data = np.genfromtxt('../data/airfoil_self_noise.dat', delimiter='\t')
data = nd.array((data - data.mean(axis=0)) / data.std(axis=0))
data_iter = d2l.load_array((data[:n, :-1], data[:n, -1]),
batch_size, is_train=True)
return data_iter, data.shape[1]-1


## 10.5.2. Implementation from Scratch¶

We have already implemented the mini-batch SGD algorithm in the Section 3.2. We have made its input parameters more generic here, so that we can conveniently use the same input for the other optimization algorithms introduced later in this chapter. Specifically, we add the status input states and place the hyper-parameter in dictionary hyperparams. In addition, we will average the loss of each mini-batch example in the training function, so the gradient in the optimization algorithm does not need to be divided by the batch size.

def sgd(params, states, hyperparams):
for p in params:


Next, we are going to implement a generic training function to facilitate the use of the other optimization algorithms introduced later in this chapter. It initializes a linear regression model and can then be used to train the model with the mini-batch SGD and other algorithms introduced in subsequent sections.

# Save to the d2l package.
def train_ch10(trainer_fn, states, hyperparams, data_iter,
feature_dim, num_epochs=2):
# Initialization
w = nd.random.normal(scale=0.01, shape=(feature_dim, 1))
b = nd.zeros(1)
net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
# Train
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
for X, y in data_iter:
l = loss(net(X), y).mean()
l.backward()
trainer_fn([w, b], states, hyperparams)
n += X.shape[0]
if n % 200 == 0:
timer.stop()
d2l.evaluate_loss(net, data_iter, loss))
timer.start()
print('loss: %.3f, %.3f sec/epoch'%(animator.Y[0][-1], timer.avg()))
return timer.cumsum(), animator.Y[0]


When the batch size equals 1500 (the total number of examples), we use gradient descent for optimization. The model parameters will be iterated only once for each epoch of the gradient descent. As we can see, the downward trend of the value of the objective function (training loss) flattened out after 6 iterations.

def train_sgd(lr, batch_size, num_epochs=2):
data_iter, feature_dim = get_data_ch10(batch_size)
return train_ch10(
sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)

gd_res = train_sgd(1, 1500, 6)

loss: 0.246, 0.057 sec/epoch


When the batch size equals 1, we use SGD for optimization. In order to simplify the implementation, we did not self-decay the learning rate. Instead, we simply used a small constant for the learning rate in the (mini-batch) SGD experiment. In SGD, the independent variable (model parameter) is updated whenever an example is processed. Thus it is updated 1500 times in one epoch. As we can see, the decline in the value of the objective function slows down after one epoch.

Although both the procedures processed 1500 examples within one epoch, SGD consumes more time than gradient descent in our experiment. This is because SGD performed more iterations on the independent variable within one epoch, and it is harder for single-example gradient computation to use parallel computing effectively.

sgd_res = train_sgd(0.005, 1)

loss: 0.244, 0.264 sec/epoch


When the batch size equals 100, we use mini-batch SGD for optimization. The time required for one epoch is between the time needed for gradient descent and SGD to complete the same epoch.

mini1_res = train_sgd(.4, 100)

loss: 0.251, 0.006 sec/epoch


Reduce the batch size to 10, the time for each epoch increases because the workload for each batch is less efficient to execute.

mini2_res = train_sgd(.05, 10)

loss: 0.245, 0.026 sec/epoch


Finally, we compare the time versus loss for the preview four experiments. As can be seen, despite SGD converges faster than GD in terms of number of examples processed, it uses more time to reach the same loss than GD because that computing gradient example by example is not efficient. Mini-batch SGD is able to trade-off the convergence speed and computation efficiency. Here, a batch size 10 improves SGD, and a batch size 100 even outperforms GD.

d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res, mini1_res, mini2_res))),
'time (sec)', 'loss', xlim=[1e-2, 10],
legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')


## 10.5.3. Concise Implementation¶

In Gluon, we can use the Trainer class to call optimization algorithms. Next, we are going to implement a generic training function that uses the optimization name trainer name and hyperparameter trainer_hyperparameter to create the instance Trainer.

# Save to the d2l package.
def train_gluon_ch10(trainer_name, trainer_hyperparams,
data_iter, num_epochs=2):
# Initialization
net = nn.Sequential()
net.initialize(init.Normal(sigma=0.01))
trainer = gluon.Trainer(
net.collect_params(), trainer_name, trainer_hyperparams)
loss = gluon.loss.L2Loss()
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
for X, y in data_iter:
l = loss(net(X), y)
l.backward()
trainer.step(X.shape[0])
n += X.shape[0]
if n % 200 == 0:
timer.stop()
d2l.evaluate_loss(net, data_iter, loss))
timer.start()
print('loss: %.3f, %.3f sec/epoch'%(animator.Y[0][-1], timer.avg()))
return timer.cumsum(), animator.Y[0]


Use Gluon to repeat the last experiment.

data_iter, _ = get_data_ch10(10)
train_gluon_ch10('sgd', {'learning_rate': 0.05}, data_iter);

loss: 0.243, 0.025 sec/epoch


## 10.5.4. Summary¶

• Mini-batch stochastic gradient uses random uniform sampling to get a mini-batch training example for gradient computation.
• In practice, learning rates of the (mini-batch) SGD can self-decay during iteration.
• In general, the time consumption per epoch for mini-batch stochastic gradient is between what takes for gradient descent and SGD to complete the same epoch.

## 10.5.5. Exercises¶

• Modify the batch size and learning rate and observe the rate of decline for the value of the objective function and the time consumed in each epoch.
• Read the MXNet documentation and use the Trainer class set_learning_rate function to reduce the learning rate of the mini-batch SGD to 1/10 of its previous value after each epoch.

## 10.5.6. Reference¶

[1] Aircraft wing noise data set. https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise