12.5. Minibatch Stochastic Gradient Descent¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab

So far we encountered two extremes in the approach to gradient-based learning: Section 12.3 uses the full dataset to compute gradients and to update parameters, one pass at a time. Conversely Section 12.4 processes one training example at a time to make progress. Either of them has its own drawbacks. Gradient descent is not particularly data efficient whenever data is very similar. Stochastic gradient descent is not particularly computationally efficient since CPUs and GPUs cannot exploit the full power of vectorization. This suggests that there might be something in between, and in fact, that is what we have been using so far in the examples we discussed.

12.5.1. Vectorization and Caches¶

At the heart of the decision to use minibatches is computational efficiency. This is most easily understood when considering parallelization to multiple GPUs and multiple servers. In this case we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already arrive at a minibatch size no smaller than 128.

Things are a bit more subtle when it comes to single GPUs or even CPUs. These devices have multiple types of memory, often multiple types of computational units and different bandwidth constraints between them. For instance, a CPU has a small number of registers and then the L1, L2, and in some cases even L3 cache (which is shared among different processor cores). These caches are of increasing size and latency (and at the same time they are of decreasing bandwidth). Suffice to say, the processor is capable of performing many more operations than what the main memory interface is able to provide.

First, a 2GHz CPU with 16 cores and AVX-512 vectorization can process up to \(2 \cdot 10^9 \cdot 16 \cdot 32 = 10^{12}\) bytes per second. The capability of GPUs easily exceeds this number by a factor of 100. On the other hand, a midrange server processor might not have much more than 100 GB/s bandwidth, i.e., less than one tenth of what would be required to keep the processor fed. To make matters worse, not all memory access is created equal: memory interfaces are typically 64 bit wide or wider (e.g., on GPUs up to 384 bit), hence reading a single byte incurs the cost of a much wider access.

Second, there is significant overhead for the first access whereas sequential access is relatively cheap (this is often called a burst read). There are many more things to keep in mind, such as caching when we have multiple sockets, chiplets, and other structures. See this Wikipedia article for a more in-depth discussion.

The way to alleviate these constraints is to use a hierarchy of CPU caches that are actually fast enough to supply the processor with data. This is the driving force behind batching in deep learning. To keep matters simple, consider matrix-matrix multiplication, say \(\mathbf{A} = \mathbf{B}\mathbf{C}\). We have a number of options for calculating \(\mathbf{A}\). For instance, we could try the following:

We could compute \(\mathbf{A}_{ij} = \mathbf{B}_{i,:} \mathbf{C}_{:,j}\), i.e., we could compute it elementwise by means of dot products.
We could compute \(\mathbf{A}_{:,j} = \mathbf{B} \mathbf{C}_{:,j}\), i.e., we could compute it one column at a time. Likewise we could compute \(\mathbf{A}\) one row \(\mathbf{A}_{i,:}\) at a time.
We could simply compute \(\mathbf{A} = \mathbf{B} \mathbf{C}\).
We could break \(\mathbf{B}\) and \(\mathbf{C}\) into smaller block matrices and compute \(\mathbf{A}\) one block at a time.

If we follow the first option, we will need to copy one row and one column vector into the CPU each time we want to compute an element \(\mathbf{A}_{ij}\). Even worse, due to the fact that matrix elements are aligned sequentially we are thus required to access many disjoint locations for one of the two vectors as we read them from memory. The second option is much more favorable. In it, we are able to keep the column vector \(\mathbf{C}_{:,j}\) in the CPU cache while we keep on traversing through \(\mathbf{B}\). This halves the memory bandwidth requirement with correspondingly faster access. Of course, option 3 is most desirable. Unfortunately, most matrices might not entirely fit into cache (this is what we are discussing after all). However, option 4 offers a practically useful alternative: we can move blocks of the matrix into cache and multiply them locally. Optimized libraries take care of this for us. Let’s have a look at how efficient these operations are in practice.

Beyond computational efficiency, the overhead introduced by Python and by the deep learning framework itself is considerable. Recall that each time we execute a command the Python interpreter sends a command to the MXNet engine which needs to insert it into the computational graph and deal with it during scheduling. Such overhead can be quite detrimental. In short, it is highly advisable to use vectorization (and matrices) whenever possible.

12.5. Minibatch Stochastic Gradient Descent¶ Colab [pytorch] Open the notebook in Colab Colab [mxnet] Open the notebook in Colab Colab [jax] Open the notebook in Colab Colab [tensorflow] Open the notebook in Colab SageMaker Studio Lab Open the notebook in SageMaker Studio Lab

12.5.1. Vectorization and Caches¶

12.5.2. Minibatches¶

12.5.3. Reading the Dataset¶

12.5.4. Implementation from Scratch¶

12.5.5. Concise Implementation¶

12.5.6. Summary¶

12.5.7. Exercises¶

12.5. Minibatch Stochastic Gradient Descent¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab