9.4. Multi-GPU Computation Implementation from Scratch¶
In this section, we will show how to use multiple GPU for computation.
For example, we can train the same model using multiple GPUs. As you
might expect, running the programs in this section requires at least two
GPUs. In fact, installing multiple GPUs on a single machine is common
because there are usually multiple PCIe slots on the motherboard. If the
NVIDIA driver is properly installed, we can use the nvidia-smi
command to view all GPUs on the current computer.
In [1]:
!nvidia-smi
Tue Feb 19 07:49:54 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1D.0 Off | 0 |
| N/A 29C P0 45W / 150W | 0MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 41W / 150W | 0MiB / 7618MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
As we discussed in the “Automatic Parallel Computation” section, most operations can use all the computational resources of all CPUs, or all computational resources of a single GPU. However, if we use multiple GPUs for model training, we still need to implement the corresponding algorithms. Of these, the most commonly used algorithm is called data parallelism.
9.4.1. Data Parallelism¶
In the deep learning field, Data Parallelism is currently the most widely used method for dividing model training tasks among multiple GPUs. Recall the process for training models using optimization algorithms described in the “Mini-batch Stochastic Gradient Descent” section. Now, we will demonstrate how data parallelism works using mini-batch stochastic gradient descent as an example.
Assume there are \(k\) GPUs on a machine. Given the model to be trained, each GPU will maintain a complete set of model parameters independently. In any iteration of model training, given a random mini-batch, we divide the examples in the batch into \(k\) portions and distribute one to each GPU. Then, each GPU will calculate the local gradient of the model parameters based on the mini-batch subset it was assigned and the model parameters it maintains. Next, we add together the local gradients on the \(k\) GPUs to get the current mini-batch stochastic gradient. After that, each GPU uses this mini-batch stochastic gradient to update the complete set of model parameters that it maintains. Figure 8.1 depicts the mini-batch stochastic gradient calculation using data parallelism and two GPUs.
Fig. 9.1 Calculation of mini-batch stochastic gradient using data parallelism and two GPUs.
In order to implement data parallelism in a multi-GPU training scenario from scratch, we first import the required packages or modules.
In [2]:
import sys
sys.path.insert(0, '..')
import d2l
import mxnet as mx
from mxnet import autograd, nd
from mxnet.gluon import loss as gloss
import time
9.4.2. Define the Model¶
We use LeNet, introduced in the “Convolutional Neural Networks (LeNet)” section, as the sample model for this section. Here, the model implementation only uses NDArray.
In [3]:
# Initialize model parameters
scale = 0.01
W1 = nd.random.normal(scale=scale, shape=(20, 1, 3, 3))
b1 = nd.zeros(shape=20)
W2 = nd.random.normal(scale=scale, shape=(50, 20, 5, 5))
b2 = nd.zeros(shape=50)
W3 = nd.random.normal(scale=scale, shape=(800, 128))
b3 = nd.zeros(shape=128)
W4 = nd.random.normal(scale=scale, shape=(128, 10))
b4 = nd.zeros(shape=10)
params = [W1, b1, W2, b2, W3, b3, W4, b4]
# Define the model
def lenet(X, params):
h1_conv = nd.Convolution(data=X, weight=params[0], bias=params[1],
kernel=(3, 3), num_filter=20)
h1_activation = nd.relu(h1_conv)
h1 = nd.Pooling(data=h1_activation, pool_type='avg', kernel=(2, 2),
stride=(2, 2))
h2_conv = nd.Convolution(data=h1, weight=params[2], bias=params[3],
kernel=(5, 5), num_filter=50)
h2_activation = nd.relu(h2_conv)
h2 = nd.Pooling(data=h2_activation, pool_type='avg', kernel=(2, 2),
stride=(2, 2))
h2 = nd.flatten(h2)
h3_linear = nd.dot(h2, params[4]) + params[5]
h3 = nd.relu(h3_linear)
y_hat = nd.dot(h3, params[6]) + params[7]
return y_hat
# Cross-entropy loss function
loss = gloss.SoftmaxCrossEntropyLoss()
9.4.3. Synchronize Data Among Multiple GPUs¶
We need to implement some auxiliary functions to synchronize data among
the multiple GPUs. The following get_params
function copies the
model parameters to a specific GPU and initializes the gradient.
In [4]:
def get_params(params, ctx):
new_params = [p.copyto(ctx) for p in params]
for p in new_params:
p.attach_grad()
return new_params
Try to copy the model parameter params
to gpu(0)
.
In [5]:
new_params = get_params(params, mx.gpu(0))
print('b1 weight:', new_params[1])
print('b1 grad:', new_params[1].grad)
b1 weight:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 20 @gpu(0)>
b1 grad:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 20 @gpu(0)>
Here, the data is distributed among multiple GPUs. The following
allreduce
function adds up the data on each GPU and then broadcasts
it to all the GPUs.
In [6]:
def allreduce(data):
for i in range(1, len(data)):
data[0][:] += data[i].copyto(data[0].context)
for i in range(1, len(data)):
data[0].copyto(data[i])
Perform a simple test of the allreduce
function.
In [7]:
data = [nd.ones((1, 2), ctx=mx.gpu(i)) * (i + 1) for i in range(2)]
print('before allreduce:', data)
allreduce(data)
print('after allreduce:', data)
before allreduce: [
[[1. 1.]]
<NDArray 1x2 @gpu(0)>,
[[2. 2.]]
<NDArray 1x2 @gpu(1)>]
after allreduce: [
[[3. 3.]]
<NDArray 1x2 @gpu(0)>,
[[3. 3.]]
<NDArray 1x2 @gpu(1)>]
Given a batch of data instances, the following split_and_load
function can split the sample and copy it to each GPU.
In [8]:
def split_and_load(data, ctx):
n, k = data.shape[0], len(ctx)
m = n // k # For simplicity, we assume the data is divisible
assert m * k == n, '# examples is not divided by # devices.'
return [data[i * m: (i + 1) * m].as_in_context(ctx[i]) for i in range(k)]
Now, we try to divide the 6 data instances equally between 2 GPUs using
the split_and_load
function.
In [9]:
batch = nd.arange(24).reshape((6, 4))
ctx = [mx.gpu(0), mx.gpu(1)]
splitted = split_and_load(batch, ctx)
print('input: ', batch)
print('load into', ctx)
print('output:', splitted)
input:
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]
[16. 17. 18. 19.]
[20. 21. 22. 23.]]
<NDArray 6x4 @cpu(0)>
load into [gpu(0), gpu(1)]
output: [
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]]
<NDArray 3x4 @gpu(0)>,
[[12. 13. 14. 15.]
[16. 17. 18. 19.]
[20. 21. 22. 23.]]
<NDArray 3x4 @gpu(1)>]
9.4.4. Multi-GPU Training on a Single Mini-batch¶
Now we can implement multi-GPU training on a single mini-batch. Its
implementation is primarily based on the data parallelism approach
described in this section. We will use the auxiliary functions we just
discussed, allreduce
and split_and_load
, to synchronize the data
among multiple GPUs.
In [10]:
def train_batch(X, y, gpu_params, ctx, lr):
# When ctx contains multiple GPUs, mini-batches of data instances are
# divided and copied to each GPU
gpu_Xs, gpu_ys = split_and_load(X, ctx), split_and_load(y, ctx)
with autograd.record(): # Loss is calculated separately on each GPU
ls = [loss(lenet(gpu_X, gpu_W), gpu_y)
for gpu_X, gpu_y, gpu_W in zip(gpu_Xs, gpu_ys, gpu_params)]
for l in ls: # Back Propagation is performed separately on each GPU
l.backward()
# Add up all the gradients from each GPU and then broadcast them to all
# the GPUs
for i in range(len(gpu_params[0])):
allreduce([gpu_params[c][i].grad for c in range(len(ctx))])
# The model parameters are updated separately on each GPU
for param in gpu_params:
d2l.sgd(param, lr, X.shape[0]) # Here, we use a full-size batch
9.4.5. Training Functions¶
Now, we can define the training function. Here the training function is slightly different from the one used in the previous chapter. For example, here, we need to copy all the model parameters to multiple GPUs based on data parallelism and perform multi-GPU training on a single mini-batch for each iteration.
In [11]:
def train(num_gpus, batch_size, lr):
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
ctx = [mx.gpu(i) for i in range(num_gpus)]
print('running on:', ctx)
# Copy model parameters to num_gpus GPUs
gpu_params = [get_params(params, c) for c in ctx]
for epoch in range(4):
start = time.time()
for X, y in train_iter:
# Perform multi-GPU training for a single mini-batch
train_batch(X, y, gpu_params, ctx, lr)
nd.waitall()
train_time = time.time() - start
def net(x): # Verify the model on GPU 0
return lenet(x, gpu_params[0])
test_acc = d2l.evaluate_accuracy(test_iter, net, ctx[0])
print('epoch %d, time: %.1f sec, test acc: %.2f'
% (epoch + 1, train_time, test_acc))
9.4.6. Multi-GPU Training Experiment¶
We will start by training with a single GPU. Assume the batch size is 256 and the learning rate is 0.2.
In [12]:
train(num_gpus=1, batch_size=256, lr=0.2)
running on: [gpu(0)]
epoch 1, time: 2.7 sec, test acc: 0.10
epoch 2, time: 2.3 sec, test acc: 0.70
epoch 3, time: 2.2 sec, test acc: 0.77
epoch 4, time: 2.2 sec, test acc: 0.77
By keeping the batch size and learning rate unchanged and changing the number of GPUs to 2, we can see that the improvement in test accuracy is roughly the same as in the results from the previous experiment. Because of the extra communication overhead, we did not observe a significant reduction in the training time.
In [13]:
train(num_gpus=2, batch_size=256, lr=0.2)
running on: [gpu(0), gpu(1)]
epoch 1, time: 2.5 sec, test acc: 0.10
epoch 2, time: 2.2 sec, test acc: 0.64
epoch 3, time: 2.2 sec, test acc: 0.75
epoch 4, time: 2.2 sec, test acc: 0.76
9.4.7. Summary¶
- We can use data parallelism to more fully utilize the computational resources of multiple GPUs to implement multi-GPU model training.
- With the same hyper-parameters, the training accuracy of the model is roughly equivalent when we change the number of GPUs.
9.4.8. Exercises¶
- In a multi-GPU training experiment, use 2 GPUs for training and
double the
batch_size
to 512. How does the training time change? If we want a test accuracy comparable with the results of single-GPU training, how should the learning rate be adjusted? - Change the model prediction part of the experiment to multi-GPU prediction.