12.6. Concise Implementation for Multiple GPUs

Implementing parallelism from scratch for every new model is no fun. Moreover, there’s significant benefit in optimizing synchronization tools for high performance. In the following we’ll show how to do this using Gluon. The math and the algorithms are the same as in Section 12.5. As before we begin by importing the required modules (quite unsurprisingly you’ll need at least two GPUs to run this notebook).

import d2l
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
npx.set_np()

12.6.1. A Toy Network

Let’s use a slightly more meaningful network than LeNet from the previous section that’s still sufficiently easy and quick to train. We pick a ResNet-18 variant [He et al., 2016a]. Since the input images are tiny we modify it slightly. In particular, the difference to Section 7.6 is that we use a smaller convolution kernel, stride, and padding at the beginning. Moreover, we remove the max-pooling layer.

# Saved in the d2l package for later use
def resnet18(num_classes):
    """A slightly modified ResNet-18 model."""
    def resnet_block(num_channels, num_residuals, first_block=False):
        blk = nn.Sequential()
        for i in range(num_residuals):
            if i == 0 and not first_block:
                blk.add(d2l.Residual(
                    num_channels, use_1x1conv=True, strides=2))
            else:
                blk.add(d2l.Residual(num_channels))
        return blk

    net = nn.Sequential()
    # This model uses a smaller convolution kernel, stride, and padding and
    # removes the maximum pooling layer
    net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
            nn.BatchNorm(), nn.Activation('relu'))
    net.add(resnet_block(64, 2, first_block=True),
            resnet_block(128, 2),
            resnet_block(256, 2),
            resnet_block(512, 2))
    net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
    return net

12.6.2. Parameter Initialization and Logistics

The initialize method allows us to set initial defaults for parameters on a device of our choice. For a refresher see Section 4.8. What is particularly convenient is that it also lets us initialize the network on multiple devices simultaneously. Let’s try how this works in practice.

net = resnet18(10)
# get a list of GPUs
ctx = d2l.try_all_gpus()
# initialize the network on all of them
net.initialize(init=init.Normal(sigma=0.01), ctx=ctx)

Using the split_and_load function introduced in the previous section we can divide a minibatch of data and copy portions to the list of devices provided by the context variable. The network object automatically uses the appropriate GPU to compute the value of the forward pass. As before we generate 4 observations and split them over the GPUs.

x = np.random.uniform(size=(4, 1, 28, 28))
gpu_x = gluon.utils.split_and_load(x, ctx)
net(gpu_x[0]), net(gpu_x[1])
(array([[ 2.2610195e-06,  2.2045997e-06, -5.4046791e-06,  1.2869948e-06,
          5.1373158e-06, -3.8297981e-06,  1.4339003e-07,  5.4683455e-06,
         -2.8279192e-06, -3.9651109e-06],
        [ 2.0698667e-06,  2.0084674e-06, -5.6382505e-06,  1.0498467e-06,
          5.5506439e-06, -4.1065477e-06,  6.0830052e-07,  5.4521779e-06,
         -3.7365023e-06, -4.1891631e-06]], ctx=gpu(0)),
 array([[ 2.4629780e-06,  2.6015523e-06, -5.4362636e-06,  1.2938224e-06,
          5.6387898e-06, -4.1360108e-06,  3.5758808e-07,  5.5125256e-06,
         -3.1957325e-06, -4.2976326e-06],
        [ 1.9431675e-06,  2.2600425e-06, -5.2698183e-06,  1.4807407e-06,
          5.4830939e-06, -3.9678880e-06,  7.5751586e-08,  5.6764347e-06,
         -3.2530236e-06, -4.0943960e-06]], ctx=gpu(1)))

Once data passes through the network, the corresponding parameters are initialized on the device the data passed through. This means that initialization happens on a per-device basis. Since we picked GPU 0 and GPU 1 for initialization, the network is initialized only there, and not on the CPU. In fact, the parameters don’t even exist on the device. We can verify this by printing out the parameters and observing any errors that might arise.

weight = net[0].params.get('weight')

try:
    weight.data()
except RuntimeError:
    print('not initialized on cpu')
weight.data(ctx[0])[0], weight.data(ctx[1])[0]
not initialized on cpu
(array([[[ 0.01382882, -0.01183044,  0.01417866],
         [-0.00319718,  0.00439528,  0.02562625],
         [-0.00835081,  0.01387452, -0.01035946]]], ctx=gpu(0)),
 array([[[ 0.01382882, -0.01183044,  0.01417866],
         [-0.00319718,  0.00439528,  0.02562625],
         [-0.00835081,  0.01387452, -0.01035946]]], ctx=gpu(1)))

Lastly let’s replace the code to evaluate the accuracy by one that works in parallel across multiple devices. This serves as a replacement of the evaluate_accuracy_gpu function from Section 6.6. The main difference is that we split a batch before invoking the network. All else is essentially identical.

# Saved in the d2l package for later use
def evaluate_accuracy_gpus(net, data_iter, split_f=d2l.split_batch):
    # Query the list of devices
    ctx = list(net.collect_params().values())[0].list_ctx()
    metric = d2l.Accumulator(2)  # num_corrected_examples, num_examples
    for features, labels in data_iter:
        Xs, ys = split_f(features, labels, ctx)
        pys = [net(X) for X in Xs]  # Run in parallel
        metric.add(sum(float(d2l.accuracy(py, y)) for py, y in zip(pys, ys)),
                   labels.size)
    return metric[0]/metric[1]

12.6.3. Training

As before, the training code needs to perform a number of basic functions for efficient parallelism:

  • Network parameters need to be initialized across all devices.

  • While iterating over the dataset minibatches are to be divided across all devices.

  • We compute the loss and its gradient in parallel across devices.

  • Losses are aggregated (by the trainer method) and parameters are updated accordingly.

In the end we compute the accuracy (again in parallel) to report the final value of the network. The training routine is quite similar to implementations in previous chapters, except that we need to split and aggregate data.

def train(num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    ctx = [d2l.try_gpu(i) for i in range(num_gpus)]
    net.initialize(init=init.Normal(sigma=0.01), ctx=ctx, force_reinit=True)
    trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr})
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    timer, num_epochs = d2l.Timer(), 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    for epoch in range(num_epochs):
        timer.start()
        for features, labels in train_iter:
            Xs, ys = d2l.split_batch(features, labels, ctx)
            with autograd.record():
                losses = [loss(net(X), y) for X, y in zip(Xs, ys)]
            for l in losses:
                l.backward()
            trainer.step(batch_size)
        npx.waitall()
        timer.stop()
        animator.add(epoch+1, (evaluate_accuracy_gpus(net, test_iter),))
    print('test acc: %.2f, %.1f sec/epoch on %s' % (
        animator.Y[0][-1], timer.avg(), ctx))

12.6.4. Experiments

Let’s see how this works in practice. As a warmup we train the network on a single GPU.

train(num_gpus=1, batch_size=256, lr=0.1)
test acc: 0.93, 13.4 sec/epoch on [gpu(0)]
../_images/output_multiple-gpus-gluon_f0c488_15_1.svg

Next we use 2 GPUs for training. Compared to LeNet the model for ResNet-18 is considerably more complex. This is where parallelization shows its advantage. The time for computation is meaningfully larger than the time for synchronizing parameters. This improves scalability since the overhead for parallelization is less relevant.

train(num_gpus=2, batch_size=512, lr=0.2)
test acc: 0.88, 6.9 sec/epoch on [gpu(0), gpu(1)]
../_images/output_multiple-gpus-gluon_f0c488_17_1.svg

12.6.5. Summary

  • Gluon provides primitives for model initialization across multiple devices by providing a context list.

  • Data is automatically evaluated on the devices where the data can be found.

  • Take care to initialize the networks on each device before trying to access the parameters on that device. Otherwise you will encounter an error.

  • The optimization algorithms automatically aggregate over multiple GPUs.

12.6.6. Exercises

  1. This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use more GPUs for computation. What happens if you try this on a p2.16xlarge instance with 16 GPUs?

  2. Sometimes, different devices provide different computing power. We could use the GPUs and the CPU at the same time. How should we divide the work? Is it worth the effort? Why? Why not?

  3. What happens if we drop npx.waitall()? How would you modify training such that you have an overlap of up to two steps for parallelism?

12.6.7. Discussions

image0