12.7. Adagrad¶ Open the notebook in SageMaker Studio Lab
Let’s begin by considering learning problems with features that occur infrequently.
12.7.1. Sparse Features and Learning Rates¶
Imagine that we are training a language model. To get good accuracy we
typically want to decrease the learning rate as we keep on training,
usually at a rate of
Parameters associated with infrequent features only receive meaningful updates whenever these features occur. Given a decreasing learning rate we might end up in a situation where the parameters for common features converge rather quickly to their optimal values, whereas for infrequent features we are still short of observing them sufficiently frequently before their optimal values can be determined. In other words, the learning rate either decreases too slowly for frequent features or too quickly for infrequent ones.
A possible hack to redress this issue would be to count the number of
times we see a particular feature and to use this as a clock for
adjusting learning rates. That is, rather than choosing a learning rate
of the form
Adagrad by Duchi et al. (2011) addresses this by
replacing the rather crude counter
12.7.2. Preconditioning¶
Convex optimization problems are good for analyzing the characteristics
of algorithms. After all, for most nonconvex problems it is difficult to
derive meaningful theoretical guarantees, but intuition and insight
often carry over. Let’s look at the problem of minimizing
As we saw in Section 12.6, it is possible to rewrite this
problem in terms of its eigendecomposition
Here we used
If we perturb
If the condition number
While computing eigenvalues exactly might be expensive, guessing them
and computing them even somewhat approximately may already be a lot
better than not doing anything at all. In particular, we could use the
diagonal entries of
In this case we have
Unfortunately we face yet another problem: in deep learning we typically
do not even have access to the second derivative of the objective
function: for
In order to see why this works, let’s look at
where
12.7.3. The Algorithm¶
Let’s formalize the discussion from above. We use the variable
Here the operation are applied coordinate wise. That is,
Just like in the case of momentum we need to keep track of an auxiliary
variable, in this case to allow for an individual learning rate per
coordinate. This does not increase the cost of Adagrad significantly
relative to SGD, simply since the main cost is typically to compute
Note that accumulating squared gradients in
We are going to implement Adagrad using the same learning rate
previously, i.e.,
%matplotlib inline
import math
import torch
from d2l import torch as d2l
def adagrad_2d(x1, x2, s1, s2):
eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1: -2.382563, x2: -0.158591
%matplotlib inline
import math
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
def adagrad_2d(x1, x2, s1, s2):
eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1: -2.382563, x2: -0.158591
[22:07:35] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
%matplotlib inline
import math
import tensorflow as tf
from d2l import tensorflow as d2l
def adagrad_2d(x1, x2, s1, s2):
eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1: -2.382563, x2: -0.158591
As we increase the learning rate to
12.7.4. Implementation from Scratch¶
Just like the momentum method, Adagrad needs to maintain a state variable of the same shape as the parameters.
def init_adagrad_states(feature_dim):
s_w = torch.zeros((feature_dim, 1))
s_b = torch.zeros(1)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
with torch.no_grad():
s[:] += torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()
def init_adagrad_states(feature_dim):
s_w = tf.Variable(tf.zeros((feature_dim, 1)))
s_b = tf.Variable(tf.zeros(1))
return (s_w, s_b)
def adagrad(params, grads, states, hyperparams):
eps = 1e-6
for p, s, g in zip(params, states, grads):
s[:].assign(s + tf.math.square(g))
p[:].assign(p - hyperparams['lr'] * g / tf.math.sqrt(s + eps))
Compared to the experiment in Section 12.5 we use a larger learning rate to train the model.
loss: 0.243, 0.162 sec/epoch
loss: 0.243, 0.495 sec/epoch
12.7.5. Concise Implementation¶
Using the Trainer
instance of the algorithm adagrad
, we can
invoke the Adagrad algorithm in Gluon.
loss: 0.242, 0.129 sec/epoch
12.7.6. Summary¶
Adagrad decreases the learning rate dynamically on a per-coordinate basis.
It uses the magnitude of the gradient as a means of adjusting how quickly progress is achieved - coordinates with large gradients are compensated with a smaller learning rate.
Computing the exact second derivative is typically infeasible in deep learning problems due to memory and computational constraints. The gradient can be a useful proxy.
If the optimization problem has a rather uneven structure Adagrad can help mitigate the distortion.
Adagrad is particularly effective for sparse features where the learning rate needs to decrease more slowly for infrequently occurring terms.
On deep learning problems Adagrad can sometimes be too aggressive in reducing learning rates. We will discuss strategies for mitigating this in the context of Section 12.10.
12.7.7. Exercises¶
Prove that for an orthogonal matrix
and a vector the following holds: . Why does this mean that the magnitude of perturbations does not change after an orthogonal change of variables?Try out Adagrad for
and also for the objective function was rotated by 45 degrees, i.e., . Does it behave differently?Prove Gerschgorin’s circle theorem which states that eigenvalues
of a matrix satisfy for at least one choice of .What does Gerschgorin’s theorem tell us about the eigenvalues of the diagonally preconditioned matrix
?Try out Adagrad for a proper deep network, such as Section 7.6 when applied to Fashion-MNIST.
How would you need to modify Adagrad to achieve a less aggressive decay in learning rate?