11.4. Stochastic Gradient Descent
Open the notebook in Colab

In this section, we are going to introduce the basic principles of stochastic gradient descent.

%matplotlib inline
import d2l
import math
from mxnet import np, npx
npx.set_np()

11.4.1. Stochastic Gradient Updates

In deep learning, the objective function is usually the average of the loss functions for each example in the training dataset. We assume that \(f_i(\mathbf{x})\) is the loss function of the training dataset with \(n\) examples, an index of \(i\), and parameter vector of \(\mathbf{x}\), then we have the objective function

(11.4.1)\[f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).\]

The gradient of the objective function at \(\mathbf{x}\) is computed as

(11.4.2)\[\nabla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}).\]

If gradient descent is used, the computing cost for each independent variable iteration is \(\mathcal{O}(n)\), which grows linearly with \(n\). Therefore, when the model training dataset is large, the cost of gradient descent for each iteration will be very high.

Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index \(i\in\{1,\ldots, n\}\) for data instances at random, and compute the gradient \(\nabla f_i(\mathbf{x})\) to update \(\mathbf{x}\):

(11.4.3)\[\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla f_i(\mathbf{x}).\]

Here, \(\eta\) is the learning rate. We can see that the computing cost for each iteration drops from \(\mathcal{O}(n)\) of the gradient descent to the constant \(\mathcal{O}(1)\). We should mention that the stochastic gradient \(\nabla f_i(\mathbf{x})\) is the unbiased estimate of gradient \(\nabla f(\mathbf{x})\).

(11.4.4)\[\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x}).\]

This means that, on average, the stochastic gradient is a good estimate of the gradient.

Now, we will compare it to gradient descent by adding random noise with a mean of 0 to the gradient to simulate a SGD.

def f(x1, x2):
    return x1 ** 2 + 2 * x2 ** 2  # Objective

def gradf(x1, x2):
    return (2 * x1, 4 * x2)  # Gradient

def sgd(x1, x2, s1, s2):  # Simulate noisy gradient
    global lr  # Learning rate scheduler
    (g1, g2) = gradf(x1, x2)  # Compute gradient
    (g1, g2) = (g1 + np.random.normal(0.1), g2 + np.random.normal(0.1))
    eta_t = eta * lr()  # Learning rate at time t
    return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)  # Update variables

eta = 0.1
lr = (lambda: 1)  # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50))
epoch 1, x1 -4.231221, x2 -1.287400
epoch 2, x1 -3.499321, x2 -0.900833
epoch 3, x1 -2.998628, x2 -0.427026
epoch 4, x1 -2.231799, x2 -0.221077
epoch 5, x1 -1.853378, x2 0.042962
epoch 6, x1 -1.295014, x2 0.036579
epoch 7, x1 -1.070454, x2 0.015664
epoch 8, x1 -0.817588, x2 0.001660
epoch 9, x1 -0.721532, x2 -0.155617
epoch 10, x1 -0.655854, x2 -0.138866
epoch 11, x1 -0.642000, x2 -0.105337
epoch 12, x1 -0.426489, x2 0.004367
epoch 13, x1 -0.272370, x2 -0.081557
epoch 14, x1 -0.080551, x2 0.048375
epoch 15, x1 0.029807, x2 0.151814
epoch 16, x1 0.161342, x2 0.133502
epoch 17, x1 -0.007552, x2 -0.019405
epoch 18, x1 0.044118, x2 -0.142049
epoch 19, x1 0.122416, x2 -0.036973
epoch 20, x1 0.050762, x2 -0.125184
epoch 21, x1 0.172867, x2 -0.033349
epoch 22, x1 -0.072589, x2 -0.058640
epoch 23, x1 -0.124117, x2 -0.142160
epoch 24, x1 -0.056440, x2 0.093613
epoch 25, x1 -0.120631, x2 0.091649
epoch 26, x1 -0.139016, x2 0.175013
epoch 27, x1 -0.158006, x2 -0.050335
epoch 28, x1 -0.160559, x2 -0.088099
epoch 29, x1 -0.235332, x2 0.039323
epoch 30, x1 -0.191453, x2 0.045462
epoch 31, x1 -0.145528, x2 -0.018378
epoch 32, x1 -0.200842, x2 -0.098901
epoch 33, x1 -0.231552, x2 -0.176758
epoch 34, x1 -0.201884, x2 -0.200925
epoch 35, x1 -0.091469, x2 -0.113673
epoch 36, x1 -0.176808, x2 -0.113948
epoch 37, x1 -0.229379, x2 0.022662
epoch 38, x1 -0.154346, x2 -0.128065
epoch 39, x1 -0.090184, x2 -0.158375
epoch 40, x1 -0.174689, x2 -0.014530
epoch 41, x1 -0.142283, x2 -0.100972
epoch 42, x1 0.064025, x2 -0.159171
epoch 43, x1 -0.149967, x2 -0.138845
epoch 44, x1 -0.141858, x2 0.098682
epoch 45, x1 -0.055928, x2 -0.051566
epoch 46, x1 -0.019351, x2 -0.204179
epoch 47, x1 0.007874, x2 0.041277
epoch 48, x1 -0.081075, x2 -0.074360
epoch 49, x1 -0.403905, x2 0.065146
epoch 50, x1 -0.522513, x2 0.085780
../_images/output_sgd_0425f7_3_1.svg

As we can see, the trajectory of the variables in the SGD is much more noisy than the one we observed in gradient descent in the previous section. This is due to the stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still subject to the uncertainty injected by the instantaneous gradient via \(\eta \nabla f_i(\mathbf{x})\). Even after 50 steps the quality is still not so good. Even worse, it will not improve after additional steps (we encourage the reader to experiment with a larger number of steps to confirm this on his own). This leaves us with the only alternative—change the learning rate \(\eta\). However, if we pick this too small, we will not make any meaningful progress initially. On the other hand, if we pick it too large, we will not get a good solution, as seen above. The only way to resolve these conflicting goals is to reduce the learning rate dynamically as optimization progresses.

This is also the reason for adding a learning rate function lr into the sgd step function. In the example above any functionality for learning rate scheduling lies dormant as we set the associated lr function to be constant, i.e., lr = (lambda: 1).

11.4.2. Dynamic Learning Rate

Replacing \(\eta\) with a time-dependent learning rate \(\eta(t)\) adds to the complexity of controlling convergence of an optimization algorithm. In particular, need to figure out how rapidly \(\eta\) should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization. There are a few basic strategies that are used in adjusting \(\eta\) over time (we will discuss more advanced strategies in a later chapter):

(11.4.5)\[\begin{split}\begin{aligned} \eta(t) & = \eta_i \text{ if } t_i \leq t \leq t_{i+1} && \mathrm{piecewise~constant} \\ \eta(t) & = \eta_0 \cdot e^{-\lambda t} && \mathrm{exponential} \\ \eta(t) & = \eta_0 \cdot (\beta t + 1)^{-\alpha} && \mathrm{polynomial} \end{aligned}\end{split}\]

In the first scenario we decrease the learning rate, e.g., whenever progress in optimization has stalled. This is a common strategy for training deep networks. Alternatively we could decrease it much more aggressively by an exponential decay. Unfortunately this leads to premature stopping before the algorithm has converged. A popular choice is polynomial decay with \(\alpha = 0.5\). In the case of convex optimization there are a number of proofs which show that this rate is well behaved. Let us see what this looks like in practice.

def exponential():
    global ctr
    ctr += 1
    return math.exp(-0.1 * ctr)

ctr = 1
lr = exponential  # Set up learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000))
epoch 1, x1 -4.187498, x2 -1.441271
epoch 2, x1 -3.480476, x2 -0.959998
epoch 3, x1 -3.114053, x2 -0.776623
epoch 4, x1 -2.708901, x2 -0.478098
epoch 5, x1 -2.485057, x2 -0.352735
epoch 6, x1 -2.254137, x2 -0.221854
epoch 7, x1 -1.991054, x2 -0.151798
epoch 8, x1 -1.845638, x2 -0.118831
epoch 9, x1 -1.719661, x2 -0.088086
epoch 10, x1 -1.600937, x2 -0.071031
epoch 11, x1 -1.508440, x2 -0.031854
epoch 12, x1 -1.437246, x2 -0.021014
epoch 13, x1 -1.341845, x2 -0.010659
epoch 14, x1 -1.263227, x2 0.038187
epoch 15, x1 -1.207312, x2 0.041305
epoch 16, x1 -1.174320, x2 0.019743
epoch 17, x1 -1.129022, x2 0.042506
epoch 18, x1 -1.094389, x2 0.046345
epoch 19, x1 -1.063700, x2 0.036641
epoch 20, x1 -1.044599, x2 0.031504
epoch 21, x1 -1.030997, x2 0.028999
epoch 22, x1 -1.006169, x2 0.009590
epoch 23, x1 -0.987778, x2 -0.005833
epoch 24, x1 -0.981508, x2 -0.018722
epoch 25, x1 -0.971658, x2 -0.009066
epoch 26, x1 -0.954919, x2 -0.005784
epoch 27, x1 -0.944525, x2 -0.004234
epoch 28, x1 -0.937402, x2 -0.008652
epoch 29, x1 -0.926925, x2 -0.006400
epoch 30, x1 -0.922282, x2 -0.013231
epoch 31, x1 -0.915879, x2 -0.013747
epoch 32, x1 -0.908610, x2 -0.021068
epoch 33, x1 -0.910579, x2 -0.022140
epoch 34, x1 -0.905332, x2 -0.021483
epoch 35, x1 -0.902504, x2 -0.019787
epoch 36, x1 -0.899458, x2 -0.021415
epoch 37, x1 -0.899127, x2 -0.017584
epoch 38, x1 -0.895843, x2 -0.016869
epoch 39, x1 -0.891759, x2 -0.017396
epoch 40, x1 -0.889904, x2 -0.017348
epoch 41, x1 -0.887124, x2 -0.018373
epoch 42, x1 -0.885645, x2 -0.017271
epoch 43, x1 -0.880201, x2 -0.016722
epoch 44, x1 -0.880153, x2 -0.018402
epoch 45, x1 -0.877001, x2 -0.017588
epoch 46, x1 -0.876668, x2 -0.017980
epoch 47, x1 -0.875592, x2 -0.016532
epoch 48, x1 -0.874229, x2 -0.017292
epoch 49, x1 -0.872565, x2 -0.018114
epoch 50, x1 -0.871246, x2 -0.018002
epoch 51, x1 -0.869750, x2 -0.018754
epoch 52, x1 -0.869317, x2 -0.018583
epoch 53, x1 -0.868514, x2 -0.018675
epoch 54, x1 -0.868533, x2 -0.018459
epoch 55, x1 -0.867547, x2 -0.018754
epoch 56, x1 -0.867012, x2 -0.019084
epoch 57, x1 -0.866814, x2 -0.018714
epoch 58, x1 -0.866372, x2 -0.019447
epoch 59, x1 -0.866034, x2 -0.019170
epoch 60, x1 -0.865656, x2 -0.019275
epoch 61, x1 -0.865323, x2 -0.019303
epoch 62, x1 -0.865011, x2 -0.019570
epoch 63, x1 -0.864740, x2 -0.019706
epoch 64, x1 -0.864440, x2 -0.019908
epoch 65, x1 -0.864342, x2 -0.019966
epoch 66, x1 -0.864343, x2 -0.019947
epoch 67, x1 -0.864012, x2 -0.019815
epoch 68, x1 -0.863826, x2 -0.019761
epoch 69, x1 -0.863641, x2 -0.019667
epoch 70, x1 -0.863491, x2 -0.019686
epoch 71, x1 -0.863366, x2 -0.019674
epoch 72, x1 -0.863213, x2 -0.019607
epoch 73, x1 -0.863100, x2 -0.019694
epoch 74, x1 -0.862988, x2 -0.019696
epoch 75, x1 -0.862870, x2 -0.019690
epoch 76, x1 -0.862808, x2 -0.019737
epoch 77, x1 -0.862724, x2 -0.019800
epoch 78, x1 -0.862710, x2 -0.019789
epoch 79, x1 -0.862640, x2 -0.019759
epoch 80, x1 -0.862598, x2 -0.019758
epoch 81, x1 -0.862550, x2 -0.019751
epoch 82, x1 -0.862547, x2 -0.019737
epoch 83, x1 -0.862528, x2 -0.019725
epoch 84, x1 -0.862518, x2 -0.019726
epoch 85, x1 -0.862491, x2 -0.019709
epoch 86, x1 -0.862473, x2 -0.019721
epoch 87, x1 -0.862437, x2 -0.019712
epoch 88, x1 -0.862414, x2 -0.019719
epoch 89, x1 -0.862382, x2 -0.019727
epoch 90, x1 -0.862373, x2 -0.019727
epoch 91, x1 -0.862356, x2 -0.019737
epoch 92, x1 -0.862340, x2 -0.019721
epoch 93, x1 -0.862323, x2 -0.019735
epoch 94, x1 -0.862302, x2 -0.019727
epoch 95, x1 -0.862298, x2 -0.019729
epoch 96, x1 -0.862286, x2 -0.019722
epoch 97, x1 -0.862275, x2 -0.019723
epoch 98, x1 -0.862263, x2 -0.019728
epoch 99, x1 -0.862262, x2 -0.019731
epoch 100, x1 -0.862252, x2 -0.019730
epoch 101, x1 -0.862245, x2 -0.019733
epoch 102, x1 -0.862243, x2 -0.019738
epoch 103, x1 -0.862241, x2 -0.019736
epoch 104, x1 -0.862237, x2 -0.019737
epoch 105, x1 -0.862234, x2 -0.019735
epoch 106, x1 -0.862230, x2 -0.019737
epoch 107, x1 -0.862226, x2 -0.019740
epoch 108, x1 -0.862224, x2 -0.019738
epoch 109, x1 -0.862223, x2 -0.019739
epoch 110, x1 -0.862219, x2 -0.019737
epoch 111, x1 -0.862217, x2 -0.019737
epoch 112, x1 -0.862216, x2 -0.019738
epoch 113, x1 -0.862215, x2 -0.019737
epoch 114, x1 -0.862212, x2 -0.019736
epoch 115, x1 -0.862212, x2 -0.019736
epoch 116, x1 -0.862211, x2 -0.019738
epoch 117, x1 -0.862209, x2 -0.019738
epoch 118, x1 -0.862209, x2 -0.019738
epoch 119, x1 -0.862208, x2 -0.019737
epoch 120, x1 -0.862207, x2 -0.019736
epoch 121, x1 -0.862206, x2 -0.019737
epoch 122, x1 -0.862207, x2 -0.019737
epoch 123, x1 -0.862206, x2 -0.019736
epoch 124, x1 -0.862206, x2 -0.019736
epoch 125, x1 -0.862206, x2 -0.019736
epoch 126, x1 -0.862205, x2 -0.019736
epoch 127, x1 -0.862204, x2 -0.019736
epoch 128, x1 -0.862204, x2 -0.019736
epoch 129, x1 -0.862204, x2 -0.019736
epoch 130, x1 -0.862204, x2 -0.019736
epoch 131, x1 -0.862203, x2 -0.019736
epoch 132, x1 -0.862203, x2 -0.019736
epoch 133, x1 -0.862203, x2 -0.019736
epoch 134, x1 -0.862202, x2 -0.019736
epoch 135, x1 -0.862202, x2 -0.019736
epoch 136, x1 -0.862202, x2 -0.019736
epoch 137, x1 -0.862202, x2 -0.019736
epoch 138, x1 -0.862202, x2 -0.019736
epoch 139, x1 -0.862201, x2 -0.019736
epoch 140, x1 -0.862201, x2 -0.019736
epoch 141, x1 -0.862201, x2 -0.019736
epoch 142, x1 -0.862201, x2 -0.019736
epoch 143, x1 -0.862201, x2 -0.019736
epoch 144, x1 -0.862201, x2 -0.019736
epoch 145, x1 -0.862201, x2 -0.019736
epoch 146, x1 -0.862201, x2 -0.019736
epoch 147, x1 -0.862201, x2 -0.019736
epoch 148, x1 -0.862201, x2 -0.019736
epoch 149, x1 -0.862201, x2 -0.019736
epoch 150, x1 -0.862200, x2 -0.019736
epoch 151, x1 -0.862200, x2 -0.019736
epoch 152, x1 -0.862200, x2 -0.019736
epoch 153, x1 -0.862200, x2 -0.019736
epoch 154, x1 -0.862200, x2 -0.019736
epoch 155, x1 -0.862200, x2 -0.019736
epoch 156, x1 -0.862200, x2 -0.019736
epoch 157, x1 -0.862200, x2 -0.019736
epoch 158, x1 -0.862200, x2 -0.019736
epoch 159, x1 -0.862200, x2 -0.019736
epoch 160, x1 -0.862200, x2 -0.019736
epoch 161, x1 -0.862200, x2 -0.019736
epoch 162, x1 -0.862200, x2 -0.019736
epoch 163, x1 -0.862200, x2 -0.019736
epoch 164, x1 -0.862200, x2 -0.019736
epoch 165, x1 -0.862200, x2 -0.019736
epoch 166, x1 -0.862200, x2 -0.019736
epoch 167, x1 -0.862200, x2 -0.019736
epoch 168, x1 -0.862200, x2 -0.019736
epoch 169, x1 -0.862200, x2 -0.019736
epoch 170, x1 -0.862200, x2 -0.019736
epoch 171, x1 -0.862200, x2 -0.019736
epoch 172, x1 -0.862200, x2 -0.019736
epoch 173, x1 -0.862200, x2 -0.019736
epoch 174, x1 -0.862200, x2 -0.019736
epoch 175, x1 -0.862200, x2 -0.019736
epoch 176, x1 -0.862200, x2 -0.019736
epoch 177, x1 -0.862200, x2 -0.019736
epoch 178, x1 -0.862200, x2 -0.019736
epoch 179, x1 -0.862200, x2 -0.019736
epoch 180, x1 -0.862200, x2 -0.019736
epoch 181, x1 -0.862200, x2 -0.019736
epoch 182, x1 -0.862200, x2 -0.019736
epoch 183, x1 -0.862200, x2 -0.019736
epoch 184, x1 -0.862200, x2 -0.019736
epoch 185, x1 -0.862200, x2 -0.019736
epoch 186, x1 -0.862200, x2 -0.019736
epoch 187, x1 -0.862200, x2 -0.019736
epoch 188, x1 -0.862200, x2 -0.019736
epoch 189, x1 -0.862200, x2 -0.019736
epoch 190, x1 -0.862200, x2 -0.019736
epoch 191, x1 -0.862200, x2 -0.019736
epoch 192, x1 -0.862200, x2 -0.019736
epoch 193, x1 -0.862200, x2 -0.019736
epoch 194, x1 -0.862200, x2 -0.019736
epoch 195, x1 -0.862200, x2 -0.019736
epoch 196, x1 -0.862200, x2 -0.019736
epoch 197, x1 -0.862200, x2 -0.019736
epoch 198, x1 -0.862200, x2 -0.019736
epoch 199, x1 -0.862200, x2 -0.019736
epoch 200, x1 -0.862200, x2 -0.019736
epoch 201, x1 -0.862200, x2 -0.019736
epoch 202, x1 -0.862200, x2 -0.019736
epoch 203, x1 -0.862200, x2 -0.019736
epoch 204, x1 -0.862200, x2 -0.019736
epoch 205, x1 -0.862200, x2 -0.019736
epoch 206, x1 -0.862200, x2 -0.019736
epoch 207, x1 -0.862200, x2 -0.019736
epoch 208, x1 -0.862200, x2 -0.019736
epoch 209, x1 -0.862200, x2 -0.019736
epoch 210, x1 -0.862200, x2 -0.019736
epoch 211, x1 -0.862200, x2 -0.019736
epoch 212, x1 -0.862200, x2 -0.019736
epoch 213, x1 -0.862200, x2 -0.019736
epoch 214, x1 -0.862200, x2 -0.019736
epoch 215, x1 -0.862200, x2 -0.019736
epoch 216, x1 -0.862200, x2 -0.019736
epoch 217, x1 -0.862200, x2 -0.019736
epoch 218, x1 -0.862200, x2 -0.019736
epoch 219, x1 -0.862200, x2 -0.019736
epoch 220, x1 -0.862200, x2 -0.019736
epoch 221, x1 -0.862200, x2 -0.019736
epoch 222, x1 -0.862200, x2 -0.019736
epoch 223, x1 -0.862200, x2 -0.019736
epoch 224, x1 -0.862200, x2 -0.019736
epoch 225, x1 -0.862200, x2 -0.019736
epoch 226, x1 -0.862200, x2 -0.019736
epoch 227, x1 -0.862200, x2 -0.019736
epoch 228, x1 -0.862200, x2 -0.019736
epoch 229, x1 -0.862200, x2 -0.019736
epoch 230, x1 -0.862200, x2 -0.019736
epoch 231, x1 -0.862200, x2 -0.019736
epoch 232, x1 -0.862200, x2 -0.019736
epoch 233, x1 -0.862200, x2 -0.019736
epoch 234, x1 -0.862200, x2 -0.019736
epoch 235, x1 -0.862200, x2 -0.019736
epoch 236, x1 -0.862200, x2 -0.019736
epoch 237, x1 -0.862200, x2 -0.019736
epoch 238, x1 -0.862200, x2 -0.019736
epoch 239, x1 -0.862200, x2 -0.019736
epoch 240, x1 -0.862200, x2 -0.019736
epoch 241, x1 -0.862200, x2 -0.019736
epoch 242, x1 -0.862200, x2 -0.019736
epoch 243, x1 -0.862200, x2 -0.019736
epoch 244, x1 -0.862200, x2 -0.019736
epoch 245, x1 -0.862200, x2 -0.019736
epoch 246, x1 -0.862200, x2 -0.019736
epoch 247, x1 -0.862200, x2 -0.019736
epoch 248, x1 -0.862200, x2 -0.019736
epoch 249, x1 -0.862200, x2 -0.019736
epoch 250, x1 -0.862200, x2 -0.019736
epoch 251, x1 -0.862200, x2 -0.019736
epoch 252, x1 -0.862200, x2 -0.019736
epoch 253, x1 -0.862200, x2 -0.019736
epoch 254, x1 -0.862200, x2 -0.019736
epoch 255, x1 -0.862200, x2 -0.019736
epoch 256, x1 -0.862200, x2 -0.019736
epoch 257, x1 -0.862200, x2 -0.019736
epoch 258, x1 -0.862200, x2 -0.019736
epoch 259, x1 -0.862200, x2 -0.019736
epoch 260, x1 -0.862200, x2 -0.019736
epoch 261, x1 -0.862200, x2 -0.019736
epoch 262, x1 -0.862200, x2 -0.019736
epoch 263, x1 -0.862200, x2 -0.019736
epoch 264, x1 -0.862200, x2 -0.019736
epoch 265, x1 -0.862200, x2 -0.019736
epoch 266, x1 -0.862200, x2 -0.019736
epoch 267, x1 -0.862200, x2 -0.019736
epoch 268, x1 -0.862200, x2 -0.019736
epoch 269, x1 -0.862200, x2 -0.019736
epoch 270, x1 -0.862200, x2 -0.019736
epoch 271, x1 -0.862200, x2 -0.019736
epoch 272, x1 -0.862200, x2 -0.019736
epoch 273, x1 -0.862200, x2 -0.019736
epoch 274, x1 -0.862200, x2 -0.019736
epoch 275, x1 -0.862200, x2 -0.019736
epoch 276, x1 -0.862200, x2 -0.019736
epoch 277, x1 -0.862200, x2 -0.019736
epoch 278, x1 -0.862200, x2 -0.019736
epoch 279, x1 -0.862200, x2 -0.019736
epoch 280, x1 -0.862200, x2 -0.019736
epoch 281, x1 -0.862200, x2 -0.019736
epoch 282, x1 -0.862200, x2 -0.019736
epoch 283, x1 -0.862200, x2 -0.019736
epoch 284, x1 -0.862200, x2 -0.019736
epoch 285, x1 -0.862200, x2 -0.019736
epoch 286, x1 -0.862200, x2 -0.019736
epoch 287, x1 -0.862200, x2 -0.019736
epoch 288, x1 -0.862200, x2 -0.019736
epoch 289, x1 -0.862200, x2 -0.019736
epoch 290, x1 -0.862200, x2 -0.019736
epoch 291, x1 -0.862200, x2 -0.019736
epoch 292, x1 -0.862200, x2 -0.019736
epoch 293, x1 -0.862200, x2 -0.019736
epoch 294, x1 -0.862200, x2 -0.019736
epoch 295, x1 -0.862200, x2 -0.019736
epoch 296, x1 -0.862200, x2 -0.019736
epoch 297, x1 -0.862200, x2 -0.019736
epoch 298, x1 -0.862200, x2 -0.019736
epoch 299, x1 -0.862200, x2 -0.019736
epoch 300, x1 -0.862200, x2 -0.019736
epoch 301, x1 -0.862200, x2 -0.019736
epoch 302, x1 -0.862200, x2 -0.019736
epoch 303, x1 -0.862200, x2 -0.019736
epoch 304, x1 -0.862200, x2 -0.019736
epoch 305, x1 -0.862200, x2 -0.019736
epoch 306, x1 -0.862200, x2 -0.019736
epoch 307, x1 -0.862200, x2 -0.019736
epoch 308, x1 -0.862200, x2 -0.019736
epoch 309, x1 -0.862200, x2 -0.019736
epoch 310, x1 -0.862200, x2 -0.019736
epoch 311, x1 -0.862200, x2 -0.019736
epoch 312, x1 -0.862200, x2 -0.019736
epoch 313, x1 -0.862200, x2 -0.019736
epoch 314, x1 -0.862200, x2 -0.019736
epoch 315, x1 -0.862200, x2 -0.019736
epoch 316, x1 -0.862200, x2 -0.019736
epoch 317, x1 -0.862200, x2 -0.019736
epoch 318, x1 -0.862200, x2 -0.019736
epoch 319, x1 -0.862200, x2 -0.019736
epoch 320, x1 -0.862200, x2 -0.019736
epoch 321, x1 -0.862200, x2 -0.019736
epoch 322, x1 -0.862200, x2 -0.019736
epoch 323, x1 -0.862200, x2 -0.019736
epoch 324, x1 -0.862200, x2 -0.019736
epoch 325, x1 -0.862200, x2 -0.019736
epoch 326, x1 -0.862200, x2 -0.019736
epoch 327, x1 -0.862200, x2 -0.019736
epoch 328, x1 -0.862200, x2 -0.019736
epoch 329, x1 -0.862200, x2 -0.019736
epoch 330, x1 -0.862200, x2 -0.019736
epoch 331, x1 -0.862200, x2 -0.019736
epoch 332, x1 -0.862200, x2 -0.019736
epoch 333, x1 -0.862200, x2 -0.019736
epoch 334, x1 -0.862200, x2 -0.019736
epoch 335, x1 -0.862200, x2 -0.019736
epoch 336, x1 -0.862200, x2 -0.019736
epoch 337, x1 -0.862200, x2 -0.019736
epoch 338, x1 -0.862200, x2 -0.019736
epoch 339, x1 -0.862200, x2 -0.019736
epoch 340, x1 -0.862200, x2 -0.019736
epoch 341, x1 -0.862200, x2 -0.019736
epoch 342, x1 -0.862200, x2 -0.019736
epoch 343, x1 -0.862200, x2 -0.019736
epoch 344, x1 -0.862200, x2 -0.019736
epoch 345, x1 -0.862200, x2 -0.019736
epoch 346, x1 -0.862200, x2 -0.019736
epoch 347, x1 -0.862200, x2 -0.019736
epoch 348, x1 -0.862200, x2 -0.019736
epoch 349, x1 -0.862200, x2 -0.019736
epoch 350, x1 -0.862200, x2 -0.019736
epoch 351, x1 -0.862200, x2 -0.019736
epoch 352, x1 -0.862200, x2 -0.019736
epoch 353, x1 -0.862200, x2 -0.019736
epoch 354, x1 -0.862200, x2 -0.019736
epoch 355, x1 -0.862200, x2 -0.019736
epoch 356, x1 -0.862200, x2 -0.019736
epoch 357, x1 -0.862200, x2 -0.019736
epoch 358, x1 -0.862200, x2 -0.019736
epoch 359, x1 -0.862200, x2 -0.019736
epoch 360, x1 -0.862200, x2 -0.019736
epoch 361, x1 -0.862200, x2 -0.019736
epoch 362, x1 -0.862200, x2 -0.019736
epoch 363, x1 -0.862200, x2 -0.019736
epoch 364, x1 -0.862200, x2 -0.019736
epoch 365, x1 -0.862200, x2 -0.019736
epoch 366, x1 -0.862200, x2 -0.019736
epoch 367, x1 -0.862200, x2 -0.019736
epoch 368, x1 -0.862200, x2 -0.019736
epoch 369, x1 -0.862200, x2 -0.019736
epoch 370, x1 -0.862200, x2 -0.019736
epoch 371, x1 -0.862200, x2 -0.019736
epoch 372, x1 -0.862200, x2 -0.019736
epoch 373, x1 -0.862200, x2 -0.019736
epoch 374, x1 -0.862200, x2 -0.019736
epoch 375, x1 -0.862200, x2 -0.019736
epoch 376, x1 -0.862200, x2 -0.019736
epoch 377, x1 -0.862200, x2 -0.019736
epoch 378, x1 -0.862200, x2 -0.019736
epoch 379, x1 -0.862200, x2 -0.019736
epoch 380, x1 -0.862200, x2 -0.019736
epoch 381, x1 -0.862200, x2 -0.019736
epoch 382, x1 -0.862200, x2 -0.019736
epoch 383, x1 -0.862200, x2 -0.019736
epoch 384, x1 -0.862200, x2 -0.019736
epoch 385, x1 -0.862200, x2 -0.019736
epoch 386, x1 -0.862200, x2 -0.019736
epoch 387, x1 -0.862200, x2 -0.019736
epoch 388, x1 -0.862200, x2 -0.019736
epoch 389, x1 -0.862200, x2 -0.019736
epoch 390, x1 -0.862200, x2 -0.019736
epoch 391, x1 -0.862200, x2 -0.019736
epoch 392, x1 -0.862200, x2 -0.019736
epoch 393, x1 -0.862200, x2 -0.019736
epoch 394, x1 -0.862200, x2 -0.019736
epoch 395, x1 -0.862200, x2 -0.019736
epoch 396, x1 -0.862200, x2 -0.019736
epoch 397, x1 -0.862200, x2 -0.019736
epoch 398, x1 -0.862200, x2 -0.019736
epoch 399, x1 -0.862200, x2 -0.019736
epoch 400, x1 -0.862200, x2 -0.019736
epoch 401, x1 -0.862200, x2 -0.019736
epoch 402, x1 -0.862200, x2 -0.019736
epoch 403, x1 -0.862200, x2 -0.019736
epoch 404, x1 -0.862200, x2 -0.019736
epoch 405, x1 -0.862200, x2 -0.019736
epoch 406, x1 -0.862200, x2 -0.019736
epoch 407, x1 -0.862200, x2 -0.019736
epoch 408, x1 -0.862200, x2 -0.019736
epoch 409, x1 -0.862200, x2 -0.019736
epoch 410, x1 -0.862200, x2 -0.019736
epoch 411, x1 -0.862200, x2 -0.019736
epoch 412, x1 -0.862200, x2 -0.019736
epoch 413, x1 -0.862200, x2 -0.019736
epoch 414, x1 -0.862200, x2 -0.019736
epoch 415, x1 -0.862200, x2 -0.019736
epoch 416, x1 -0.862200, x2 -0.019736
epoch 417, x1 -0.862200, x2 -0.019736
epoch 418, x1 -0.862200, x2 -0.019736
epoch 419, x1 -0.862200, x2 -0.019736
epoch 420, x1 -0.862200, x2 -0.019736
epoch 421, x1 -0.862200, x2 -0.019736
epoch 422, x1 -0.862200, x2 -0.019736
epoch 423, x1 -0.862200, x2 -0.019736
epoch 424, x1 -0.862200, x2 -0.019736
epoch 425, x1 -0.862200, x2 -0.019736
epoch 426, x1 -0.862200, x2 -0.019736
epoch 427, x1 -0.862200, x2 -0.019736
epoch 428, x1 -0.862200, x2 -0.019736
epoch 429, x1 -0.862200, x2 -0.019736
epoch 430, x1 -0.862200, x2 -0.019736
epoch 431, x1 -0.862200, x2 -0.019736
epoch 432, x1 -0.862200, x2 -0.019736
epoch 433, x1 -0.862200, x2 -0.019736
epoch 434, x1 -0.862200, x2 -0.019736
epoch 435, x1 -0.862200, x2 -0.019736
epoch 436, x1 -0.862200, x2 -0.019736
epoch 437, x1 -0.862200, x2 -0.019736
epoch 438, x1 -0.862200, x2 -0.019736
epoch 439, x1 -0.862200, x2 -0.019736
epoch 440, x1 -0.862200, x2 -0.019736
epoch 441, x1 -0.862200, x2 -0.019736
epoch 442, x1 -0.862200, x2 -0.019736
epoch 443, x1 -0.862200, x2 -0.019736
epoch 444, x1 -0.862200, x2 -0.019736
epoch 445, x1 -0.862200, x2 -0.019736
epoch 446, x1 -0.862200, x2 -0.019736
epoch 447, x1 -0.862200, x2 -0.019736
epoch 448, x1 -0.862200, x2 -0.019736
epoch 449, x1 -0.862200, x2 -0.019736
epoch 450, x1 -0.862200, x2 -0.019736
epoch 451, x1 -0.862200, x2 -0.019736
epoch 452, x1 -0.862200, x2 -0.019736
epoch 453, x1 -0.862200, x2 -0.019736
epoch 454, x1 -0.862200, x2 -0.019736
epoch 455, x1 -0.862200, x2 -0.019736
epoch 456, x1 -0.862200, x2 -0.019736
epoch 457, x1 -0.862200, x2 -0.019736
epoch 458, x1 -0.862200, x2 -0.019736
epoch 459, x1 -0.862200, x2 -0.019736
epoch 460, x1 -0.862200, x2 -0.019736
epoch 461, x1 -0.862200, x2 -0.019736
epoch 462, x1 -0.862200, x2 -0.019736
epoch 463, x1 -0.862200, x2 -0.019736
epoch 464, x1 -0.862200, x2 -0.019736
epoch 465, x1 -0.862200, x2 -0.019736
epoch 466, x1 -0.862200, x2 -0.019736
epoch 467, x1 -0.862200, x2 -0.019736
epoch 468, x1 -0.862200, x2 -0.019736
epoch 469, x1 -0.862200, x2 -0.019736
epoch 470, x1 -0.862200, x2 -0.019736
epoch 471, x1 -0.862200, x2 -0.019736
epoch 472, x1 -0.862200, x2 -0.019736
epoch 473, x1 -0.862200, x2 -0.019736
epoch 474, x1 -0.862200, x2 -0.019736
epoch 475, x1 -0.862200, x2 -0.019736
epoch 476, x1 -0.862200, x2 -0.019736
epoch 477, x1 -0.862200, x2 -0.019736
epoch 478, x1 -0.862200, x2 -0.019736
epoch 479, x1 -0.862200, x2 -0.019736
epoch 480, x1 -0.862200, x2 -0.019736
epoch 481, x1 -0.862200, x2 -0.019736
epoch 482, x1 -0.862200, x2 -0.019736
epoch 483, x1 -0.862200, x2 -0.019736
epoch 484, x1 -0.862200, x2 -0.019736
epoch 485, x1 -0.862200, x2 -0.019736
epoch 486, x1 -0.862200, x2 -0.019736
epoch 487, x1 -0.862200, x2 -0.019736
epoch 488, x1 -0.862200, x2 -0.019736
epoch 489, x1 -0.862200, x2 -0.019736
epoch 490, x1 -0.862200, x2 -0.019736
epoch 491, x1 -0.862200, x2 -0.019736
epoch 492, x1 -0.862200, x2 -0.019736
epoch 493, x1 -0.862200, x2 -0.019736
epoch 494, x1 -0.862200, x2 -0.019736
epoch 495, x1 -0.862200, x2 -0.019736
epoch 496, x1 -0.862200, x2 -0.019736
epoch 497, x1 -0.862200, x2 -0.019736
epoch 498, x1 -0.862200, x2 -0.019736
epoch 499, x1 -0.862200, x2 -0.019736
epoch 500, x1 -0.862200, x2 -0.019736
epoch 501, x1 -0.862200, x2 -0.019736
epoch 502, x1 -0.862200, x2 -0.019736
epoch 503, x1 -0.862200, x2 -0.019736
epoch 504, x1 -0.862200, x2 -0.019736
epoch 505, x1 -0.862200, x2 -0.019736
epoch 506, x1 -0.862200, x2 -0.019736
epoch 507, x1 -0.862200, x2 -0.019736
epoch 508, x1 -0.862200, x2 -0.019736
epoch 509, x1 -0.862200, x2 -0.019736
epoch 510, x1 -0.862200, x2 -0.019736
epoch 511, x1 -0.862200, x2 -0.019736
epoch 512, x1 -0.862200, x2 -0.019736
epoch 513, x1 -0.862200, x2 -0.019736
epoch 514, x1 -0.862200, x2 -0.019736
epoch 515, x1 -0.862200, x2 -0.019736
epoch 516, x1 -0.862200, x2 -0.019736
epoch 517, x1 -0.862200, x2 -0.019736
epoch 518, x1 -0.862200, x2 -0.019736
epoch 519, x1 -0.862200, x2 -0.019736
epoch 520, x1 -0.862200, x2 -0.019736
epoch 521, x1 -0.862200, x2 -0.019736
epoch 522, x1 -0.862200, x2 -0.019736
epoch 523, x1 -0.862200, x2 -0.019736
epoch 524, x1 -0.862200, x2 -0.019736
epoch 525, x1 -0.862200, x2 -0.019736
epoch 526, x1 -0.862200, x2 -0.019736
epoch 527, x1 -0.862200, x2 -0.019736
epoch 528, x1 -0.862200, x2 -0.019736
epoch 529, x1 -0.862200, x2 -0.019736
epoch 530, x1 -0.862200, x2 -0.019736
epoch 531, x1 -0.862200, x2 -0.019736
epoch 532, x1 -0.862200, x2 -0.019736
epoch 533, x1 -0.862200, x2 -0.019736
epoch 534, x1 -0.862200, x2 -0.019736
epoch 535, x1 -0.862200, x2 -0.019736
epoch 536, x1 -0.862200, x2 -0.019736
epoch 537, x1 -0.862200, x2 -0.019736
epoch 538, x1 -0.862200, x2 -0.019736
epoch 539, x1 -0.862200, x2 -0.019736
epoch 540, x1 -0.862200, x2 -0.019736
epoch 541, x1 -0.862200, x2 -0.019736
epoch 542, x1 -0.862200, x2 -0.019736
epoch 543, x1 -0.862200, x2 -0.019736
epoch 544, x1 -0.862200, x2 -0.019736
epoch 545, x1 -0.862200, x2 -0.019736
epoch 546, x1 -0.862200, x2 -0.019736
epoch 547, x1 -0.862200, x2 -0.019736
epoch 548, x1 -0.862200, x2 -0.019736
epoch 549, x1 -0.862200, x2 -0.019736
epoch 550, x1 -0.862200, x2 -0.019736
epoch 551, x1 -0.862200, x2 -0.019736
epoch 552, x1 -0.862200, x2 -0.019736
epoch 553, x1 -0.862200, x2 -0.019736
epoch 554, x1 -0.862200, x2 -0.019736
epoch 555, x1 -0.862200, x2 -0.019736
epoch 556, x1 -0.862200, x2 -0.019736
epoch 557, x1 -0.862200, x2 -0.019736
epoch 558, x1 -0.862200, x2 -0.019736
epoch 559, x1 -0.862200, x2 -0.019736
epoch 560, x1 -0.862200, x2 -0.019736
epoch 561, x1 -0.862200, x2 -0.019736
epoch 562, x1 -0.862200, x2 -0.019736
epoch 563, x1 -0.862200, x2 -0.019736
epoch 564, x1 -0.862200, x2 -0.019736
epoch 565, x1 -0.862200, x2 -0.019736
epoch 566, x1 -0.862200, x2 -0.019736
epoch 567, x1 -0.862200, x2 -0.019736
epoch 568, x1 -0.862200, x2 -0.019736
epoch 569, x1 -0.862200, x2 -0.019736
epoch 570, x1 -0.862200, x2 -0.019736
epoch 571, x1 -0.862200, x2 -0.019736
epoch 572, x1 -0.862200, x2 -0.019736
epoch 573, x1 -0.862200, x2 -0.019736
epoch 574, x1 -0.862200, x2 -0.019736
epoch 575, x1 -0.862200, x2 -0.019736
epoch 576, x1 -0.862200, x2 -0.019736
epoch 577, x1 -0.862200, x2 -0.019736
epoch 578, x1 -0.862200, x2 -0.019736
epoch 579, x1 -0.862200, x2 -0.019736
epoch 580, x1 -0.862200, x2 -0.019736
epoch 581, x1 -0.862200, x2 -0.019736
epoch 582, x1 -0.862200, x2 -0.019736
epoch 583, x1 -0.862200, x2 -0.019736
epoch 584, x1 -0.862200, x2 -0.019736
epoch 585, x1 -0.862200, x2 -0.019736
epoch 586, x1 -0.862200, x2 -0.019736
epoch 587, x1 -0.862200, x2 -0.019736
epoch 588, x1 -0.862200, x2 -0.019736
epoch 589, x1 -0.862200, x2 -0.019736
epoch 590, x1 -0.862200, x2 -0.019736
epoch 591, x1 -0.862200, x2 -0.019736
epoch 592, x1 -0.862200, x2 -0.019736
epoch 593, x1 -0.862200, x2 -0.019736
epoch 594, x1 -0.862200, x2 -0.019736
epoch 595, x1 -0.862200, x2 -0.019736
epoch 596, x1 -0.862200, x2 -0.019736
epoch 597, x1 -0.862200, x2 -0.019736
epoch 598, x1 -0.862200, x2 -0.019736
epoch 599, x1 -0.862200, x2 -0.019736
epoch 600, x1 -0.862200, x2 -0.019736
epoch 601, x1 -0.862200, x2 -0.019736
epoch 602, x1 -0.862200, x2 -0.019736
epoch 603, x1 -0.862200, x2 -0.019736
epoch 604, x1 -0.862200, x2 -0.019736
epoch 605, x1 -0.862200, x2 -0.019736
epoch 606, x1 -0.862200, x2 -0.019736
epoch 607, x1 -0.862200, x2 -0.019736
epoch 608, x1 -0.862200, x2 -0.019736
epoch 609, x1 -0.862200, x2 -0.019736
epoch 610, x1 -0.862200, x2 -0.019736
epoch 611, x1 -0.862200, x2 -0.019736
epoch 612, x1 -0.862200, x2 -0.019736
epoch 613, x1 -0.862200, x2 -0.019736
epoch 614, x1 -0.862200, x2 -0.019736
epoch 615, x1 -0.862200, x2 -0.019736
epoch 616, x1 -0.862200, x2 -0.019736
epoch 617, x1 -0.862200, x2 -0.019736
epoch 618, x1 -0.862200, x2 -0.019736
epoch 619, x1 -0.862200, x2 -0.019736
epoch 620, x1 -0.862200, x2 -0.019736
epoch 621, x1 -0.862200, x2 -0.019736
epoch 622, x1 -0.862200, x2 -0.019736
epoch 623, x1 -0.862200, x2 -0.019736
epoch 624, x1 -0.862200, x2 -0.019736
epoch 625, x1 -0.862200, x2 -0.019736
epoch 626, x1 -0.862200, x2 -0.019736
epoch 627, x1 -0.862200, x2 -0.019736
epoch 628, x1 -0.862200, x2 -0.019736
epoch 629, x1 -0.862200, x2 -0.019736
epoch 630, x1 -0.862200, x2 -0.019736
epoch 631, x1 -0.862200, x2 -0.019736
epoch 632, x1 -0.862200, x2 -0.019736
epoch 633, x1 -0.862200, x2 -0.019736
epoch 634, x1 -0.862200, x2 -0.019736
epoch 635, x1 -0.862200, x2 -0.019736
epoch 636, x1 -0.862200, x2 -0.019736
epoch 637, x1 -0.862200, x2 -0.019736
epoch 638, x1 -0.862200, x2 -0.019736
epoch 639, x1 -0.862200, x2 -0.019736
epoch 640, x1 -0.862200, x2 -0.019736
epoch 641, x1 -0.862200, x2 -0.019736
epoch 642, x1 -0.862200, x2 -0.019736
epoch 643, x1 -0.862200, x2 -0.019736
epoch 644, x1 -0.862200, x2 -0.019736
epoch 645, x1 -0.862200, x2 -0.019736
epoch 646, x1 -0.862200, x2 -0.019736
epoch 647, x1 -0.862200, x2 -0.019736
epoch 648, x1 -0.862200, x2 -0.019736
epoch 649, x1 -0.862200, x2 -0.019736
epoch 650, x1 -0.862200, x2 -0.019736
epoch 651, x1 -0.862200, x2 -0.019736
epoch 652, x1 -0.862200, x2 -0.019736
epoch 653, x1 -0.862200, x2 -0.019736
epoch 654, x1 -0.862200, x2 -0.019736
epoch 655, x1 -0.862200, x2 -0.019736
epoch 656, x1 -0.862200, x2 -0.019736
epoch 657, x1 -0.862200, x2 -0.019736
epoch 658, x1 -0.862200, x2 -0.019736
epoch 659, x1 -0.862200, x2 -0.019736
epoch 660, x1 -0.862200, x2 -0.019736
epoch 661, x1 -0.862200, x2 -0.019736
epoch 662, x1 -0.862200, x2 -0.019736
epoch 663, x1 -0.862200, x2 -0.019736
epoch 664, x1 -0.862200, x2 -0.019736
epoch 665, x1 -0.862200, x2 -0.019736
epoch 666, x1 -0.862200, x2 -0.019736
epoch 667, x1 -0.862200, x2 -0.019736
epoch 668, x1 -0.862200, x2 -0.019736
epoch 669, x1 -0.862200, x2 -0.019736
epoch 670, x1 -0.862200, x2 -0.019736
epoch 671, x1 -0.862200, x2 -0.019736
epoch 672, x1 -0.862200, x2 -0.019736
epoch 673, x1 -0.862200, x2 -0.019736
epoch 674, x1 -0.862200, x2 -0.019736
epoch 675, x1 -0.862200, x2 -0.019736
epoch 676, x1 -0.862200, x2 -0.019736
epoch 677, x1 -0.862200, x2 -0.019736
epoch 678, x1 -0.862200, x2 -0.019736
epoch 679, x1 -0.862200, x2 -0.019736
epoch 680, x1 -0.862200, x2 -0.019736
epoch 681, x1 -0.862200, x2 -0.019736
epoch 682, x1 -0.862200, x2 -0.019736
epoch 683, x1 -0.862200, x2 -0.019736
epoch 684, x1 -0.862200, x2 -0.019736
epoch 685, x1 -0.862200, x2 -0.019736
epoch 686, x1 -0.862200, x2 -0.019736
epoch 687, x1 -0.862200, x2 -0.019736
epoch 688, x1 -0.862200, x2 -0.019736
epoch 689, x1 -0.862200, x2 -0.019736
epoch 690, x1 -0.862200, x2 -0.019736
epoch 691, x1 -0.862200, x2 -0.019736
epoch 692, x1 -0.862200, x2 -0.019736
epoch 693, x1 -0.862200, x2 -0.019736
epoch 694, x1 -0.862200, x2 -0.019736
epoch 695, x1 -0.862200, x2 -0.019736
epoch 696, x1 -0.862200, x2 -0.019736
epoch 697, x1 -0.862200, x2 -0.019736
epoch 698, x1 -0.862200, x2 -0.019736
epoch 699, x1 -0.862200, x2 -0.019736
epoch 700, x1 -0.862200, x2 -0.019736
epoch 701, x1 -0.862200, x2 -0.019736
epoch 702, x1 -0.862200, x2 -0.019736
epoch 703, x1 -0.862200, x2 -0.019736
epoch 704, x1 -0.862200, x2 -0.019736
epoch 705, x1 -0.862200, x2 -0.019736
epoch 706, x1 -0.862200, x2 -0.019736
epoch 707, x1 -0.862200, x2 -0.019736
epoch 708, x1 -0.862200, x2 -0.019736
epoch 709, x1 -0.862200, x2 -0.019736
epoch 710, x1 -0.862200, x2 -0.019736
epoch 711, x1 -0.862200, x2 -0.019736
epoch 712, x1 -0.862200, x2 -0.019736
epoch 713, x1 -0.862200, x2 -0.019736
epoch 714, x1 -0.862200, x2 -0.019736
epoch 715, x1 -0.862200, x2 -0.019736
epoch 716, x1 -0.862200, x2 -0.019736
epoch 717, x1 -0.862200, x2 -0.019736
epoch 718, x1 -0.862200, x2 -0.019736
epoch 719, x1 -0.862200, x2 -0.019736
epoch 720, x1 -0.862200, x2 -0.019736
epoch 721, x1 -0.862200, x2 -0.019736
epoch 722, x1 -0.862200, x2 -0.019736
epoch 723, x1 -0.862200, x2 -0.019736
epoch 724, x1 -0.862200, x2 -0.019736
epoch 725, x1 -0.862200, x2 -0.019736
epoch 726, x1 -0.862200, x2 -0.019736
epoch 727, x1 -0.862200, x2 -0.019736
epoch 728, x1 -0.862200, x2 -0.019736
epoch 729, x1 -0.862200, x2 -0.019736
epoch 730, x1 -0.862200, x2 -0.019736
epoch 731, x1 -0.862200, x2 -0.019736
epoch 732, x1 -0.862200, x2 -0.019736
epoch 733, x1 -0.862200, x2 -0.019736
epoch 734, x1 -0.862200, x2 -0.019736
epoch 735, x1 -0.862200, x2 -0.019736
epoch 736, x1 -0.862200, x2 -0.019736
epoch 737, x1 -0.862200, x2 -0.019736
epoch 738, x1 -0.862200, x2 -0.019736
epoch 739, x1 -0.862200, x2 -0.019736
epoch 740, x1 -0.862200, x2 -0.019736
epoch 741, x1 -0.862200, x2 -0.019736
epoch 742, x1 -0.862200, x2 -0.019736
epoch 743, x1 -0.862200, x2 -0.019736
epoch 744, x1 -0.862200, x2 -0.019736
epoch 745, x1 -0.862200, x2 -0.019736
epoch 746, x1 -0.862200, x2 -0.019736
epoch 747, x1 -0.862200, x2 -0.019736
epoch 748, x1 -0.862200, x2 -0.019736
epoch 749, x1 -0.862200, x2 -0.019736
epoch 750, x1 -0.862200, x2 -0.019736
epoch 751, x1 -0.862200, x2 -0.019736
epoch 752, x1 -0.862200, x2 -0.019736
epoch 753, x1 -0.862200, x2 -0.019736
epoch 754, x1 -0.862200, x2 -0.019736
epoch 755, x1 -0.862200, x2 -0.019736
epoch 756, x1 -0.862200, x2 -0.019736
epoch 757, x1 -0.862200, x2 -0.019736
epoch 758, x1 -0.862200, x2 -0.019736
epoch 759, x1 -0.862200, x2 -0.019736
epoch 760, x1 -0.862200, x2 -0.019736
epoch 761, x1 -0.862200, x2 -0.019736
epoch 762, x1 -0.862200, x2 -0.019736
epoch 763, x1 -0.862200, x2 -0.019736
epoch 764, x1 -0.862200, x2 -0.019736
epoch 765, x1 -0.862200, x2 -0.019736
epoch 766, x1 -0.862200, x2 -0.019736
epoch 767, x1 -0.862200, x2 -0.019736
epoch 768, x1 -0.862200, x2 -0.019736
epoch 769, x1 -0.862200, x2 -0.019736
epoch 770, x1 -0.862200, x2 -0.019736
epoch 771, x1 -0.862200, x2 -0.019736
epoch 772, x1 -0.862200, x2 -0.019736
epoch 773, x1 -0.862200, x2 -0.019736
epoch 774, x1 -0.862200, x2 -0.019736
epoch 775, x1 -0.862200, x2 -0.019736
epoch 776, x1 -0.862200, x2 -0.019736
epoch 777, x1 -0.862200, x2 -0.019736
epoch 778, x1 -0.862200, x2 -0.019736
epoch 779, x1 -0.862200, x2 -0.019736
epoch 780, x1 -0.862200, x2 -0.019736
epoch 781, x1 -0.862200, x2 -0.019736
epoch 782, x1 -0.862200, x2 -0.019736
epoch 783, x1 -0.862200, x2 -0.019736
epoch 784, x1 -0.862200, x2 -0.019736
epoch 785, x1 -0.862200, x2 -0.019736
epoch 786, x1 -0.862200, x2 -0.019736
epoch 787, x1 -0.862200, x2 -0.019736
epoch 788, x1 -0.862200, x2 -0.019736
epoch 789, x1 -0.862200, x2 -0.019736
epoch 790, x1 -0.862200, x2 -0.019736
epoch 791, x1 -0.862200, x2 -0.019736
epoch 792, x1 -0.862200, x2 -0.019736
epoch 793, x1 -0.862200, x2 -0.019736
epoch 794, x1 -0.862200, x2 -0.019736
epoch 795, x1 -0.862200, x2 -0.019736
epoch 796, x1 -0.862200, x2 -0.019736
epoch 797, x1 -0.862200, x2 -0.019736
epoch 798, x1 -0.862200, x2 -0.019736
epoch 799, x1 -0.862200, x2 -0.019736
epoch 800, x1 -0.862200, x2 -0.019736
epoch 801, x1 -0.862200, x2 -0.019736
epoch 802, x1 -0.862200, x2 -0.019736
epoch 803, x1 -0.862200, x2 -0.019736
epoch 804, x1 -0.862200, x2 -0.019736
epoch 805, x1 -0.862200, x2 -0.019736
epoch 806, x1 -0.862200, x2 -0.019736
epoch 807, x1 -0.862200, x2 -0.019736
epoch 808, x1 -0.862200, x2 -0.019736
epoch 809, x1 -0.862200, x2 -0.019736
epoch 810, x1 -0.862200, x2 -0.019736
epoch 811, x1 -0.862200, x2 -0.019736
epoch 812, x1 -0.862200, x2 -0.019736
epoch 813, x1 -0.862200, x2 -0.019736
epoch 814, x1 -0.862200, x2 -0.019736
epoch 815, x1 -0.862200, x2 -0.019736
epoch 816, x1 -0.862200, x2 -0.019736
epoch 817, x1 -0.862200, x2 -0.019736
epoch 818, x1 -0.862200, x2 -0.019736
epoch 819, x1 -0.862200, x2 -0.019736
epoch 820, x1 -0.862200, x2 -0.019736
epoch 821, x1 -0.862200, x2 -0.019736
epoch 822, x1 -0.862200, x2 -0.019736
epoch 823, x1 -0.862200, x2 -0.019736
epoch 824, x1 -0.862200, x2 -0.019736
epoch 825, x1 -0.862200, x2 -0.019736
epoch 826, x1 -0.862200, x2 -0.019736
epoch 827, x1 -0.862200, x2 -0.019736
epoch 828, x1 -0.862200, x2 -0.019736
epoch 829, x1 -0.862200, x2 -0.019736
epoch 830, x1 -0.862200, x2 -0.019736
epoch 831, x1 -0.862200, x2 -0.019736
epoch 832, x1 -0.862200, x2 -0.019736
epoch 833, x1 -0.862200, x2 -0.019736
epoch 834, x1 -0.862200, x2 -0.019736
epoch 835, x1 -0.862200, x2 -0.019736
epoch 836, x1 -0.862200, x2 -0.019736
epoch 837, x1 -0.862200, x2 -0.019736
epoch 838, x1 -0.862200, x2 -0.019736
epoch 839, x1 -0.862200, x2 -0.019736
epoch 840, x1 -0.862200, x2 -0.019736
epoch 841, x1 -0.862200, x2 -0.019736
epoch 842, x1 -0.862200, x2 -0.019736
epoch 843, x1 -0.862200, x2 -0.019736
epoch 844, x1 -0.862200, x2 -0.019736
epoch 845, x1 -0.862200, x2 -0.019736
epoch 846, x1 -0.862200, x2 -0.019736
epoch 847, x1 -0.862200, x2 -0.019736
epoch 848, x1 -0.862200, x2 -0.019736
epoch 849, x1 -0.862200, x2 -0.019736
epoch 850, x1 -0.862200, x2 -0.019736
epoch 851, x1 -0.862200, x2 -0.019736
epoch 852, x1 -0.862200, x2 -0.019736
epoch 853, x1 -0.862200, x2 -0.019736
epoch 854, x1 -0.862200, x2 -0.019736
epoch 855, x1 -0.862200, x2 -0.019736
epoch 856, x1 -0.862200, x2 -0.019736
epoch 857, x1 -0.862200, x2 -0.019736
epoch 858, x1 -0.862200, x2 -0.019736
epoch 859, x1 -0.862200, x2 -0.019736
epoch 860, x1 -0.862200, x2 -0.019736
epoch 861, x1 -0.862200, x2 -0.019736
epoch 862, x1 -0.862200, x2 -0.019736
epoch 863, x1 -0.862200, x2 -0.019736
epoch 864, x1 -0.862200, x2 -0.019736
epoch 865, x1 -0.862200, x2 -0.019736
epoch 866, x1 -0.862200, x2 -0.019736
epoch 867, x1 -0.862200, x2 -0.019736
epoch 868, x1 -0.862200, x2 -0.019736
epoch 869, x1 -0.862200, x2 -0.019736
epoch 870, x1 -0.862200, x2 -0.019736
epoch 871, x1 -0.862200, x2 -0.019736
epoch 872, x1 -0.862200, x2 -0.019736
epoch 873, x1 -0.862200, x2 -0.019736
epoch 874, x1 -0.862200, x2 -0.019736
epoch 875, x1 -0.862200, x2 -0.019736
epoch 876, x1 -0.862200, x2 -0.019736
epoch 877, x1 -0.862200, x2 -0.019736
epoch 878, x1 -0.862200, x2 -0.019736
epoch 879, x1 -0.862200, x2 -0.019736
epoch 880, x1 -0.862200, x2 -0.019736
epoch 881, x1 -0.862200, x2 -0.019736
epoch 882, x1 -0.862200, x2 -0.019736
epoch 883, x1 -0.862200, x2 -0.019736
epoch 884, x1 -0.862200, x2 -0.019736
epoch 885, x1 -0.862200, x2 -0.019736
epoch 886, x1 -0.862200, x2 -0.019736
epoch 887, x1 -0.862200, x2 -0.019736
epoch 888, x1 -0.862200, x2 -0.019736
epoch 889, x1 -0.862200, x2 -0.019736
epoch 890, x1 -0.862200, x2 -0.019736
epoch 891, x1 -0.862200, x2 -0.019736
epoch 892, x1 -0.862200, x2 -0.019736
epoch 893, x1 -0.862200, x2 -0.019736
epoch 894, x1 -0.862200, x2 -0.019736
epoch 895, x1 -0.862200, x2 -0.019736
epoch 896, x1 -0.862200, x2 -0.019736
epoch 897, x1 -0.862200, x2 -0.019736
epoch 898, x1 -0.862200, x2 -0.019736
epoch 899, x1 -0.862200, x2 -0.019736
epoch 900, x1 -0.862200, x2 -0.019736
epoch 901, x1 -0.862200, x2 -0.019736
epoch 902, x1 -0.862200, x2 -0.019736
epoch 903, x1 -0.862200, x2 -0.019736
epoch 904, x1 -0.862200, x2 -0.019736
epoch 905, x1 -0.862200, x2 -0.019736
epoch 906, x1 -0.862200, x2 -0.019736
epoch 907, x1 -0.862200, x2 -0.019736
epoch 908, x1 -0.862200, x2 -0.019736
epoch 909, x1 -0.862200, x2 -0.019736
epoch 910, x1 -0.862200, x2 -0.019736
epoch 911, x1 -0.862200, x2 -0.019736
epoch 912, x1 -0.862200, x2 -0.019736
epoch 913, x1 -0.862200, x2 -0.019736
epoch 914, x1 -0.862200, x2 -0.019736
epoch 915, x1 -0.862200, x2 -0.019736
epoch 916, x1 -0.862200, x2 -0.019736
epoch 917, x1 -0.862200, x2 -0.019736
epoch 918, x1 -0.862200, x2 -0.019736
epoch 919, x1 -0.862200, x2 -0.019736
epoch 920, x1 -0.862200, x2 -0.019736
epoch 921, x1 -0.862200, x2 -0.019736
epoch 922, x1 -0.862200, x2 -0.019736
epoch 923, x1 -0.862200, x2 -0.019736
epoch 924, x1 -0.862200, x2 -0.019736
epoch 925, x1 -0.862200, x2 -0.019736
epoch 926, x1 -0.862200, x2 -0.019736
epoch 927, x1 -0.862200, x2 -0.019736
epoch 928, x1 -0.862200, x2 -0.019736
epoch 929, x1 -0.862200, x2 -0.019736
epoch 930, x1 -0.862200, x2 -0.019736
epoch 931, x1 -0.862200, x2 -0.019736
epoch 932, x1 -0.862200, x2 -0.019736
epoch 933, x1 -0.862200, x2 -0.019736
epoch 934, x1 -0.862200, x2 -0.019736
epoch 935, x1 -0.862200, x2 -0.019736
epoch 936, x1 -0.862200, x2 -0.019736
epoch 937, x1 -0.862200, x2 -0.019736
epoch 938, x1 -0.862200, x2 -0.019736
epoch 939, x1 -0.862200, x2 -0.019736
epoch 940, x1 -0.862200, x2 -0.019736
epoch 941, x1 -0.862200, x2 -0.019736
epoch 942, x1 -0.862200, x2 -0.019736
epoch 943, x1 -0.862200, x2 -0.019736
epoch 944, x1 -0.862200, x2 -0.019736
epoch 945, x1 -0.862200, x2 -0.019736
epoch 946, x1 -0.862200, x2 -0.019736
epoch 947, x1 -0.862200, x2 -0.019736
epoch 948, x1 -0.862200, x2 -0.019736
epoch 949, x1 -0.862200, x2 -0.019736
epoch 950, x1 -0.862200, x2 -0.019736
epoch 951, x1 -0.862200, x2 -0.019736
epoch 952, x1 -0.862200, x2 -0.019736
epoch 953, x1 -0.862200, x2 -0.019736
epoch 954, x1 -0.862200, x2 -0.019736
epoch 955, x1 -0.862200, x2 -0.019736
epoch 956, x1 -0.862200, x2 -0.019736
epoch 957, x1 -0.862200, x2 -0.019736
epoch 958, x1 -0.862200, x2 -0.019736
epoch 959, x1 -0.862200, x2 -0.019736
epoch 960, x1 -0.862200, x2 -0.019736
epoch 961, x1 -0.862200, x2 -0.019736
epoch 962, x1 -0.862200, x2 -0.019736
epoch 963, x1 -0.862200, x2 -0.019736
epoch 964, x1 -0.862200, x2 -0.019736
epoch 965, x1 -0.862200, x2 -0.019736
epoch 966, x1 -0.862200, x2 -0.019736
epoch 967, x1 -0.862200, x2 -0.019736
epoch 968, x1 -0.862200, x2 -0.019736
epoch 969, x1 -0.862200, x2 -0.019736
epoch 970, x1 -0.862200, x2 -0.019736
epoch 971, x1 -0.862200, x2 -0.019736
epoch 972, x1 -0.862200, x2 -0.019736
epoch 973, x1 -0.862200, x2 -0.019736
epoch 974, x1 -0.862200, x2 -0.019736
epoch 975, x1 -0.862200, x2 -0.019736
epoch 976, x1 -0.862200, x2 -0.019736
epoch 977, x1 -0.862200, x2 -0.019736
epoch 978, x1 -0.862200, x2 -0.019736
epoch 979, x1 -0.862200, x2 -0.019736
epoch 980, x1 -0.862200, x2 -0.019736
epoch 981, x1 -0.862200, x2 -0.019736
epoch 982, x1 -0.862200, x2 -0.019736
epoch 983, x1 -0.862200, x2 -0.019736
epoch 984, x1 -0.862200, x2 -0.019736
epoch 985, x1 -0.862200, x2 -0.019736
epoch 986, x1 -0.862200, x2 -0.019736
epoch 987, x1 -0.862200, x2 -0.019736
epoch 988, x1 -0.862200, x2 -0.019736
epoch 989, x1 -0.862200, x2 -0.019736
epoch 990, x1 -0.862200, x2 -0.019736
epoch 991, x1 -0.862200, x2 -0.019736
epoch 992, x1 -0.862200, x2 -0.019736
epoch 993, x1 -0.862200, x2 -0.019736
epoch 994, x1 -0.862200, x2 -0.019736
epoch 995, x1 -0.862200, x2 -0.019736
epoch 996, x1 -0.862200, x2 -0.019736
epoch 997, x1 -0.862200, x2 -0.019736
epoch 998, x1 -0.862200, x2 -0.019736
epoch 999, x1 -0.862200, x2 -0.019736
epoch 1000, x1 -0.862200, x2 -0.019736
../_images/output_sgd_0425f7_5_1.svg

As expected, the variance in the parameters is significantly reduced. However, this comes at the expense of failing to converge to the optimal solution \(\mathbf{x} = (0, 0)\). Even after 1000 steps are we are still very far away from the optimal solution. Indeed, the algorithm fails to converge at all. On the other hand, if we use a polynomial decay where the learning rate decays with the inverse square root of the number of steps convergence is good.

def polynomial():
    global ctr
    ctr += 1
    return (1 + 0.1 * ctr)**(-0.5)

ctr = 1
lr = polynomial  # Set up learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50))
epoch 1, x1 -4.117116, x2 -1.264984
epoch 2, x1 -3.393946, x2 -0.943845
epoch 3, x1 -2.625807, x2 -0.468526
epoch 4, x1 -2.116790, x2 -0.212297
epoch 5, x1 -1.683528, x2 -0.114543
epoch 6, x1 -1.532377, x2 -0.168514
epoch 7, x1 -1.363571, x2 -0.001855
epoch 8, x1 -1.080831, x2 -0.058512
epoch 9, x1 -0.889631, x2 -0.049189
epoch 10, x1 -0.740804, x2 -0.086510
epoch 11, x1 -0.582573, x2 -0.129555
epoch 12, x1 -0.557373, x2 -0.101391
epoch 13, x1 -0.498110, x2 -0.032276
epoch 14, x1 -0.403853, x2 -0.107747
epoch 15, x1 -0.284433, x2 -0.160354
epoch 16, x1 -0.304945, x2 -0.006315
epoch 17, x1 -0.322509, x2 -0.051712
epoch 18, x1 -0.224754, x2 0.017035
epoch 19, x1 -0.193038, x2 -0.005604
epoch 20, x1 -0.150690, x2 -0.029124
epoch 21, x1 -0.083835, x2 -0.030703
epoch 22, x1 0.022822, x2 -0.039436
epoch 23, x1 -0.027158, x2 -0.125483
epoch 24, x1 -0.063541, x2 -0.192071
epoch 25, x1 -0.063003, x2 -0.132565
epoch 26, x1 -0.024215, x2 -0.173649
epoch 27, x1 -0.030869, x2 -0.149434
epoch 28, x1 -0.040262, x2 -0.239039
epoch 29, x1 -0.153324, x2 -0.220784
epoch 30, x1 -0.084111, x2 -0.209840
epoch 31, x1 -0.055423, x2 -0.084103
epoch 32, x1 -0.080356, x2 -0.067376
epoch 33, x1 -0.014694, x2 -0.076777
epoch 34, x1 -0.039895, x2 -0.028936
epoch 35, x1 -0.033569, x2 0.027978
epoch 36, x1 -0.121242, x2 -0.037941
epoch 37, x1 -0.084340, x2 -0.040304
epoch 38, x1 -0.040585, x2 -0.066500
epoch 39, x1 -0.022329, x2 -0.051323
epoch 40, x1 -0.040633, x2 -0.092040
epoch 41, x1 -0.052630, x2 -0.107517
epoch 42, x1 -0.100402, x2 -0.100915
epoch 43, x1 -0.085802, x2 -0.048426
epoch 44, x1 -0.156156, x2 0.038083
epoch 45, x1 -0.162863, x2 0.079936
epoch 46, x1 -0.175477, x2 0.157734
epoch 47, x1 -0.123091, x2 0.107476
epoch 48, x1 -0.077534, x2 0.103106
epoch 49, x1 -0.036963, x2 0.054364
epoch 50, x1 -0.024847, x2 0.090820
../_images/output_sgd_0425f7_7_1.svg

There exist many more choices for how to set the learning rate. For instance, we could start with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even alternate between smaller and larger learning rates. There exists a large variety of such schedules. For now let us focus on learning rate schedules for which a comprehensive theoretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent lecture notes of Tibshirani 2015.

11.4.3. Convergence Analysis for Convex Objectives

The following is optional and primarily serves to convey more intuition about the problem. We limit ourselves to one of the simplest proofs, as described by [Nesterov & Vial, 2000]. Significantly more advanced proof techniques exist, e.g., whenever the objective function is particularly well behaved. [Hazan et al., 2008] show that for strongly convex functions, i.e., for functions that can be bounded from below by \(\mathbf{x}^\top \mathbf{Q} \mathbf{x}\), it is possible to minimize them in a small number of steps while decreasing the learning rate like \(\eta(t) = \eta_0/(\beta t + 1)\). Unfortunately this case never really occurs in deep learning and we are left with a much more slowly decreasing rate in practice.

Consider the case where

(11.4.6)\[\mathbf{w}_{t+1} = \mathbf{w}_{t} - \eta_t \partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w}).\]

In particular, assume that \(\mathbf{x}_t\) is drawn from some distribution \(P(\mathbf{x})\) and that \(l(\mathbf{x}, \mathbf{w})\) is a convex function in \(\mathbf{w}\) for all \(\mathbf{x}\). Last denote by

(11.4.7)\[R(\mathbf{w}) = E_{\mathbf{x} \sim P}[l(\mathbf{x}, \mathbf{w})]\]

the expected risk and by \(R^*\) its minimum with regard to \(\mathbf{w}\). Last let \(\mathbf{w}^*\) be the minimizer (we assume that it exists within the domain which \(\mathbf{w}\) is defined). In this case we can track the distance between the current parameter \(\mathbf{w}_t\) and the risk minimizer \(\mathbf{w}^*\) and see whether it improves over time:

(11.4.8)\[\begin{split}\begin{aligned} \|\mathbf{w}_{t+1} - \mathbf{w}^*\|^2 & = \|\mathbf{w}_{t} - \eta_t \partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w}) - \mathbf{w}^*\|^2 \\ & = \|\mathbf{w}_{t} - \mathbf{w}^*\|^2 + \eta_t^2 \|\partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w})\|^2 - 2 \eta_t \left\langle \mathbf{w}_t - \mathbf{w}^*, \partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w})\right\rangle. \end{aligned}\end{split}\]

The gradient \(\partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w})\) can be bounded from above by some Lipschitz constant \(L\), hence we have that

(11.4.9)\[\eta_t^2 \|\partial_\mathbf{w} l(\mathbf{x}_t, \mathbf{w})\|^2 \leq \eta_t^2 L^2.\]

We are mostly interested in how the distance between \(\mathbf{w}_t\) and \(\mathbf{w}^*\) changes in expectation. In fact, for any specific sequence of steps the distance might well increase, depending on whichever \(\mathbf{x}_t\) we encounter. Hence we need to bound the inner product. By convexity we have that

(11.4.10)\[l(\mathbf{x}_t, \mathbf{w}^*) \geq l(\mathbf{x}_t, \mathbf{w}_t) + \left\langle \mathbf{w}^* - \mathbf{w}_t, \partial_{\mathbf{w}} l(\mathbf{x}_t, \mathbf{w}_t) \right\rangle.\]

Using both inequalities and plugging it into the above we obtain a bound on the distance between parameters at time \(t+1\) as follows:

(11.4.11)\[\|\mathbf{w}_{t} - \mathbf{w}^*\|^2 - \|\mathbf{w}_{t+1} - \mathbf{w}^*\|^2 \geq 2 \eta_t (l(\mathbf{x}_t, \mathbf{w}_t) - l(\mathbf{x}_t, \mathbf{w}^*)) - \eta_t^2 L^2.\]

This means that we make progress as long as the expected difference between current loss and the optimal loss outweighs \(\eta_t L^2\). Since the former is bound to converge to \(0\) it follows that the learning rate \(\eta_t\) also needs to vanish.

Next we take expectations over this expression. This yields

(11.4.12)\[E_{\mathbf{w}_t}\left[\|\mathbf{w}_{t} - \mathbf{w}^*\|^2\right] - E_{\mathbf{w}_{t+1}\mid \mathbf{w}_t}\left[\|\mathbf{w}_{t+1} - \mathbf{w}^*\|^2\right] \geq 2 \eta_t [E[R[\mathbf{w}_t]] - R^*] - \eta_t^2 L^2.\]

The last step involves summing over the inequalities for \(t \in \{t, \ldots, T\}\). Since the sum telescopes and by dropping the lower term we obtain

(11.4.13)\[\|\mathbf{w}_{0} - \mathbf{w}^*\|^2 \geq 2 \sum_{t=1}^T \eta_t [E[R[\mathbf{w}_t]] - R^*] - L^2 \sum_{t=1}^T \eta_t^2.\]

Note that we exploited that \(\mathbf{w}_0\) is given and thus the expectation can be dropped. Last define

(11.4.14)\[\bar{\mathbf{w}} := \frac{\sum_{t=1}^T \eta_t \mathbf{w}_t}{\sum_{t=1}^T \eta_t}.\]

Then by convexity it follows that

(11.4.15)\[\sum_t \eta_t E[R[\mathbf{w}_t]] \geq \sum \eta_t \cdot \left[E[\bar{\mathbf{w}}]\right].\]

Plugging this into the above inequality yields the bound

(11.4.16)\[\left[E[\bar{\mathbf{w}}]\right] - R^* \leq \frac{r^2 + L^2 \sum_{t=1}^T \eta_t^2}{2 \sum_{t=1}^T \eta_t}.\]

Here \(r^2 := \|\mathbf{w}_0 - \mathbf{w}^*\|^2\) is a bound on the distance between the initial choice of parameters and the final outcome. In short, the speed of convergence depends on how rapidly the loss function changes via the Lipschitz constant \(L\) and how far away from optimality the initial value is \(r\). Note that the bound is in terms of \(\bar{\mathbf{w}}\) rather than \(\mathbf{w}_T\). This is the case since \(\bar{\mathbf{w}}\) is a smoothed version of the optimization path. Now let us analyze some choices for \(\eta_t\).

  • Known Time Horizon. Whenever \(r, L\) and \(T\) are known we can pick \(\eta = r/L \sqrt{T}\). This yields as upper bound \(r L (1 + 1/T)/2\sqrt{T} < rL/\sqrt{T}\). That is, we converge with rate \(\mathcal{O}(1/\sqrt{T})\) to the optimal solution.

  • Unknown Time Horizon. Whenever we want to have a good solution for any time \(T\) we can pick \(\eta = \mathcal{O}(1/\sqrt{T})\). This costs us an extra logarithmic factor and it leads to an upper bound of the form \(\mathcal{O}(\log T / \sqrt{T})\).

Note that for strongly convex losses \(l(\mathbf{x}, \mathbf{w}') \geq l(\mathbf{x}, \mathbf{w}) + \langle \mathbf{w}'-\mathbf{w}, \partial_\mathbf{w} l(\mathbf{x}, \mathbf{w}) \rangle + \frac{\lambda}{2} \|\mathbf{w}-\mathbf{w}'\|^2\) we can design even more rapidly converging optimization schedules. In fact, an exponential decay in \(\eta\) leads to a bound of the form \(\mathcal{O}(\log T / T)\).

11.4.4. Stochastic Gradients and Finite Samples

So far we have played a bit fast and loose when it comes to talking about stochastic gradient descent. We posited that we draw instances \(x_i\), typically with labels \(y_i\) from some distribution \(p(x, y)\) and that we use this to update the weights \(w\) in some manner. In particular, for a finite sample size we simply argued that the discrete distribution \(p(x, y) = \frac{1}{n} \sum_{i=1}^n \delta_{x_i}(x) \delta_{y_i}(y)\) allows us to perform SGD over it.

However, this is not really what we did. In the toy examples in the current section we simply added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs \((x_i, y_i)\). It turns out that this is justified here (see the exercises for a detailed discussion). More troubling is that in all previous discussions we clearly did not do this. Instead we iterated over all instances exactly once. To see why this is preferable consider the converse, namely that we are sampling \(n\) observations from the discrete distribution with replacement. The probability of choosing an element \(i\) at random is \(N^{-1}\). Thus to choose it at least once is

(11.4.17)\[P(\mathrm{choose~} i) = 1 - P(\mathrm{omit~} i) = 1 - (1-N^{-1})^N \approx 1-e^{-1} \approx 0.63.\]

A similar reasoning shows that the probability of picking a sample exactly once is given by \({N \choose 1} N^{-1} (1-N^{-1})^{N-1} = \frac{N-1}{N} (1-N^{-1})^{N} \approx e^{-1} \approx 0.37\). This leads to an increased variance and decreased data efficiency relative to sampling without replacement. Hence, in practice we perform the latter (and this is the default choice throughout this book). Last note that repeated passes through the dataset traverse it in a different random order.

11.4.5. Summary

  • For convex problems we can prove that for a wide choice of learning rates Stochastic Gradient Descent will converge to the optimal solution.

  • For deep learning this is generally not the case. However, the analysis of convex problems gives us useful insight into how to approach optimization, namely to reduce the learning rate progressively, albeit not too quickly.

  • Problems occur when the learning rate is too small or too large. In practice a suitable learning rate is often found only after multiple experiments.

  • When there are more examples in the training dataset, it costs more to compute each iteration for gradient descent, so SGD is preferred in these cases.

  • Optimality guarantees for SGD are in general not available in nonconvex cases since the number of local minima that require checking might well be exponential.

11.4.6. Exercises

  1. Experiment with different learning rate schedules for SGD and with different numbers of iterations. In particular, plot the distance from the optimal solution \((0, 0)\) as a function of the number of iterations.

  2. Prove that for the function \(f(x_1, x_2) = x_1^2 + 2 x_2^2\) adding normal noise to the gradient is equivalent to minimizing a loss function \(l(\mathbf{x}, \mathbf{w}) = (x_1 - w_1)^2 + 2 (x_2 - w_2)^2\) where \(x\) is drawn from a normal distribution.

    • Derive mean and variance of the distribution for \(\mathbf{x}\).

    • Show that this property holds in general for objective functions \(f(\mathbf{x}) = \frac{1}{2} (\mathbf{x} - \mathbf{\mu})^\top Q (\mathbf{x} - \mathbf{\mu})\) for \(Q \succeq 0\).

  3. Compare convergence of SGD when you sample from \(\{(x_1, y_1), \ldots, (x_m, y_m)\}\) with replacement and when you sample without replacement.

  4. How would you change the SGD solver if some gradient (or rather some coordinate associated with it) was consistently larger than all other gradients?

  5. Assume that \(f(x) = x^2 (1 + \sin x)\). How many local minima does \(f\) have? Can you change \(f\) in such a way that to minimize it one needs to evaluate all local minima?

11.4.7. Discussions

image0