12.4. Stochastic Gradient Descent¶ Open the notebook in SageMaker Studio Lab
In earlier chapters we kept using stochastic gradient descent in our training procedure, however, without explaining why it works. To shed some light on it, we just described the basic principles of gradient descent in Section 12.3. In this section, we go on to discuss stochastic gradient descent in greater detail.
12.4.1. Stochastic Gradient Updates¶
In deep learning, the objective function is usually the average of the
loss functions for each example in the training dataset. Given a
training dataset of
The gradient of the objective function at
If gradient descent is used, the computational cost for each independent
variable iteration is
Stochastic gradient descent (SGD) reduces computational cost at each
iteration. At each iteration of stochastic gradient descent, we
uniformly sample an index
where
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent.
def f(x1, x2): # Objective function
return x1 ** 2 + 2 * x2 ** 2
def f_grad(x1, x2): # Gradient of the objective function
return 2 * x1, 4 * x2
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += torch.normal(0.0, 1, (1,)).item()
g2 += torch.normal(0.0, 1, (1,)).item()
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
epoch 50, x1: 0.225517, x2: -0.076646
def f(x1, x2): # Objective function
return x1 ** 2 + 2 * x2 ** 2
def f_grad(x1, x2): # Gradient of the objective function
return 2 * x1, 4 * x2
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += np.random.normal(0.0, 1, (1,))
g2 += np.random.normal(0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
[22:10:25] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
epoch 50, x1: -0.472513, x2: 0.110780
def f(x1, x2): # Objective function
return x1 ** 2 + 2 * x2 ** 2
def f_grad(x1, x2): # Gradient of the objective function
return 2 * x1, 4 * x2
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += tf.random.normal([1], 0.0, 1)
g2 += tf.random.normal([1], 0.0, 1)
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
epoch 50, x1: 0.228397, x2: -0.078016
As we can see, the trajectory of the variables in the stochastic
gradient descent is much more noisy than the one we observed in gradient
descent in Section 12.3. This is due to the stochastic nature of
the gradient. That is, even when we arrive near the minimum, we are
still subject to the uncertainty injected by the instantaneous gradient
via
This is also the reason for adding a learning rate function lr
into
the sgd
step function. In the example above any functionality for
learning rate scheduling lies dormant as we set the associated lr
function to be constant.
12.4.2. Dynamic Learning Rate¶
Replacing
In the first piecewise constant scenario we decrease the learning
rate, e.g., whenever progress in optimization stalls. This is a common
strategy for training deep networks. Alternatively we could decrease it
much more aggressively by an exponential decay. Unfortunately this
often leads to premature stopping before the algorithm has converged. A
popular choice is polynomial decay with
Let’s see what the exponential decay looks like in practice.
epoch 1000, x1: -0.758829, x2: -0.115584
epoch 1000, x1: -0.820458, x2: 0.004701
As expected, the variance in the parameters is significantly reduced.
However, this comes at the expense of failing to converge to the optimal
solution
epoch 50, x1: 0.144834, x2: 0.041688
epoch 50, x1: 0.025029, x2: 0.115820
There exist many more choices for how to set the learning rate. For instance, we could start with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even alternate between smaller and larger learning rates. There exists a large variety of such schedules. For now let’s focus on learning rate schedules for which a comprehensive theoretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent lecture notes of Tibshirani 2015.
12.4.3. Convergence Analysis for Convex Objectives¶
The following convergence analysis of stochastic gradient descent for convex objective functions is optional and primarily serves to convey more intuition about the problem. We limit ourselves to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more advanced proof techniques exist, e.g., whenever the objective function is particularly well behaved.
Suppose that the objective function
where
the expected risk and by
We assume that the
We are mostly interested in how the distance between
Plugging both inequalities (12.4.9) and
(12.4.10) into (12.4.8) we obtain
a bound on the distance between parameters at time
This means that we make progress as long as the difference between
current loss and the optimal loss outweighs
Next we take expectations over (12.4.11). This yields
The last step involves summing over the inequalities for
Note that we exploited that
Since
by Jensen’s inequality (setting
Plugging this into the inequality (12.4.13) yields the bound
where
12.4.4. Stochastic Gradients and Finite Samples¶
So far we have played a bit fast and loose when it comes to talking
about stochastic gradient descent. We posited that we draw instances
However, this is not really what we did. In the toy examples in the
current section we simply added noise to an otherwise non-stochastic
gradient, i.e., we pretended to have pairs
A similar reasoning shows that the probability of picking some sample (i.e., training example) exactly once is given by
Sampling with replacement leads to an increased variance and decreased data efficiency relative to sampling without replacement. Hence, in practice we perform the latter (and this is the default choice throughout this book). Last note that repeated passes through the training dataset traverse it in a different random order.
12.4.5. Summary¶
For convex problems we can prove that for a wide choice of learning rates stochastic gradient descent will converge to the optimal solution.
For deep learning this is generally not the case. However, the analysis of convex problems gives us useful insight into how to approach optimization, namely to reduce the learning rate progressively, albeit not too quickly.
Problems occur when the learning rate is too small or too large. In practice a suitable learning rate is often found only after multiple experiments.
When there are more examples in the training dataset, it costs more to compute each iteration for gradient descent, so stochastic gradient descent is preferred in these cases.
Optimality guarantees for stochastic gradient descent are in general not available in nonconvex cases since the number of local minima that require checking might well be exponential.
12.4.6. Exercises¶
Experiment with different learning rate schedules for stochastic gradient descent and with different numbers of iterations. In particular, plot the distance from the optimal solution
as a function of the number of iterations.Prove that for the function
adding normal noise to the gradient is equivalent to minimizing a loss function where is drawn from a normal distribution.Compare convergence of stochastic gradient descent when you sample from
with replacement and when you sample without replacement.How would you change the stochastic gradient descent solver if some gradient (or rather some coordinate associated with it) was consistently larger than all the other gradients?
Assume that
. How many local minima does have? Can you change in such a way that to minimize it one needs to evaluate all the local minima?