Single Shot Multibox Detection (SSD)
====================================
In the previous few sections, we have introduced bounding boxes, anchor
boxes, multiscale object detection, and datasets. Now, we will use this
background knowledge to construct an object detection model: single shot
multibox detection (SSD) :cite:`Liu.Anguelov.Erhan.ea.2016`. This
quick and easy model is already widely used. Some of the design concepts
and implementation details of this model are also applicable to other
object detection models.
Model
-----
:numref:`fig_ssd` shows the design of an SSD model. The model’s main
components are a base network block and several multiscale feature
blocks connected in a series. Here, the base network block is used to
extract features of original images, and it generally takes the form of
a deep convolutional neural network. The paper on SSDs chooses to place
a truncated VGG before the classification layer
:cite:`Liu.Anguelov.Erhan.ea.2016`, but this is now commonly replaced
by ResNet. We can design the base network so that it outputs larger
heights and widths. In this way, more anchor boxes are generated based
on this feature map, allowing us to detect smaller objects. Next, each
multiscale feature block reduces the height and width of the feature map
provided by the previous layer (for example, it may reduce the sizes by
half). The blocks then use each element in the feature map to expand the
receptive field on the input image. In this way, the closer a multiscale
feature block is to the top of :numref:`fig_ssd` the smaller its
output feature map, and the fewer the anchor boxes that are generated
based on the feature map. In addition, the closer a feature block is to
the top, the larger the receptive field of each element in the feature
map and the better suited it is to detect larger objects. As the SSD
generates different numbers of anchor boxes of different sizes based on
the base network block and each multiscale feature block and then
predicts the categories and offsets (i.e., predicted bounding boxes) of
the anchor boxes in order to detect objects of different sizes, SSD is a
multiscale object detection model.
.. _fig_ssd:
.. figure:: ../img/ssd.svg
The SSD is composed of a base network block and several multiscale
feature blocks connected in a series.
Next, we will describe the implementation of the modules in
:numref:`fig_ssd`. First, we need to discuss the implementation of
category prediction and bounding box prediction.
Category Prediction Layer
~~~~~~~~~~~~~~~~~~~~~~~~~
Set the number of object categories to :math:`q`. In this case, the
number of anchor box categories is :math:`q+1`, with 0 indicating an
anchor box that only contains background. For a certain scale, set the
height and width of the feature map to :math:`h` and :math:`w`,
respectively. If we use each element as the center to generate :math:`a`
anchor boxes, we need to classify a total of :math:`hwa` anchor boxes.
If we use a fully connected layer (FCN) for the output, this will likely
result in an excessive number of model parameters. Recall how we used
convolutional layer channels to output category predictions in
:numref:`sec_nin`. SSD uses the same method to reduce the model
complexity.
Specifically, the category prediction layer uses a convolutional layer
that maintains the input height and width. Thus, the output and input
have a one-to-one correspondence to the spatial coordinates along the
width and height of the feature map. Assuming that the output and input
have the same spatial coordinates :math:`(x, y)`, the channel for the
coordinates :math:`(x, y)` on the output feature map contains the
category predictions for all anchor boxes generated using the input
feature map coordinates :math:`(x, y)` as the center. Therefore, there
are :math:`a(q+1)` output channels, with the output channels indexed as
:math:`i(q+1) + j` (:math:`0 \leq j \leq q`) representing the
predictions of the category index :math:`j` for the anchor box index
:math:`i`.
Now, we will define a category prediction layer of this type. After we
specify the parameters :math:`a` and :math:`q`, it uses a
:math:`3\times3` convolutional layer with a padding of 1. The heights
and widths of the input and output of this convolutional layer remain
unchanged.
.. code:: python
%matplotlib inline
import d2l
from mxnet import autograd, contrib, gluon, image, init, np, npx
from mxnet.gluon import nn
npx.set_np()
def cls_predictor(num_anchors, num_classes):
return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3,
padding=1)
Bounding Box Prediction Layer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The design of the bounding box prediction layer is similar to that of
the category prediction layer. The only difference is that, here, we
need to predict 4 offsets for each anchor box, rather than :math:`q+1`
categories.
.. code:: python
def bbox_predictor(num_anchors):
return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)
Concatenating Predictions for Multiple Scales
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As we mentioned, SSD uses feature maps based on multiple scales to
generate anchor boxes and predict their categories and offsets. Because
the shapes and number of anchor boxes centered on the same element
differ for the feature maps of different scales, the prediction outputs
at different scales may have different shapes.
In the following example, we use the same batch of data to construct
feature maps of two different scales, ``Y1`` and ``Y2``. Here, ``Y2``
has half the height and half the width of ``Y1``. Using category
prediction as an example, we assume that each element in the ``Y1`` and
``Y2`` feature maps generates five (Y1) or three (Y2) anchor boxes. When
there are 10 object categories, the number of category prediction output
channels is either :math:`5\times(10+1)=55` or :math:`3\times(10+1)=33`.
The format of the prediction output is (batch size, number of channels,
height, width). As you can see, except for the batch size, the sizes of
the other dimensions are different. Therefore, we must transform them
into a consistent format and concatenate the predictions of the multiple
scales to facilitate subsequent computation.
.. code:: python
def forward(x, block):
block.initialize()
return block(x)
Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10))
Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10))
(Y1.shape, Y2.shape)
.. parsed-literal::
:class: output
((2, 55, 20, 20), (2, 33, 10, 10))
The channel dimension contains the predictions for all anchor boxes with
the same center. We first move the channel dimension to the final
dimension. Because the batch size is the same for all scales, we can
convert the prediction results to binary format (batch size, height
:math:`\times` width :math:`\times` number of channels) to facilitate
subsequent concatenation on the :math:`1^{\mathrm{st}}` dimension.
.. code:: python
def flatten_pred(pred):
return npx.batch_flatten(pred.transpose(0, 2, 3, 1))
def concat_preds(preds):
return np.concatenate([flatten_pred(p) for p in preds], axis=1)
Thus, regardless of the different shapes of ``Y1`` and ``Y2``, we can
still concatenate the prediction results for the two different scales of
the same batch.
.. code:: python
concat_preds([Y1, Y2]).shape
.. parsed-literal::
:class: output
(2, 25300)
Height and Width Downsample Block
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For multiscale object detection, we define the following
``down_sample_blk`` block, which reduces the height and width by 50%.
This block consists of two :math:`3\times3` convolutional layers with a
padding of 1 and a :math:`2\times2` maximum pooling layer with a stride
of 2 connected in a series. As we know, :math:`3\times3` convolutional
layers with a padding of 1 do not change the shape of feature maps.
However, the subsequent pooling layer directly reduces the size of the
feature map by half. Because :math:`1\times 2+(3-1)+(3-1)=6`, each
element in the output feature map has a receptive field on the input
feature map of the shape :math:`6\times6`. As you can see, the height
and width downsample block enlarges the receptive field of each element
in the output feature map.
.. code:: python
def down_sample_blk(num_channels):
blk = nn.Sequential()
for _ in range(2):
blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
nn.BatchNorm(in_channels=num_channels),
nn.Activation('relu'))
blk.add(nn.MaxPool2D(2))
return blk
By testing forward computation in the height and width downsample block,
we can see that it changes the number of input channels and halves the
height and width.
.. code:: python
forward(np.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape
.. parsed-literal::
:class: output
(2, 10, 10, 10)
Base Network Block
~~~~~~~~~~~~~~~~~~
The base network block is used to extract features from original images.
To simplify the computation, we will construct a small base network.
This network consists of three height and width downsample blocks
connected in a series, so it doubles the number of channels at each
step. When we input an original image with the shape
:math:`256\times256`, the base network block outputs a feature map with
the shape :math:`32 \times 32`.
.. code:: python
def base_net():
blk = nn.Sequential()
for num_filters in [16, 32, 64]:
blk.add(down_sample_blk(num_filters))
return blk
forward(np.zeros((2, 3, 256, 256)), base_net()).shape
.. parsed-literal::
:class: output
(2, 64, 32, 32)
The Complete Model
~~~~~~~~~~~~~~~~~~
The SSD model contains a total of five modules. Each module outputs a
feature map used to generate anchor boxes and predict the categories and
offsets of these anchor boxes. The first module is the base network
block, modules two to four are height and width downsample blocks, and
the fifth module is a global maximum pooling layer that reduces the
height and width to 1. Therefore, modules two to five are all multiscale
feature blocks shown in :numref:`fig_ssd`.
.. code:: python
def get_blk(i):
if i == 0:
blk = base_net()
elif i == 4:
blk = nn.GlobalMaxPool2D()
else:
blk = down_sample_blk(128)
return blk
Now, we will define the forward computation process for each module. In
contrast to the previously-described convolutional neural networks, this
module not only returns feature map ``Y`` output by convolutional
computation, but also the anchor boxes of the current scale generated
from ``Y`` and their predicted categories and offsets.
.. code:: python
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
Y = blk(X)
anchors = npx.multibox_prior(Y, sizes=size, ratios=ratio)
cls_preds = cls_predictor(Y)
bbox_preds = bbox_predictor(Y)
return (Y, anchors, cls_preds, bbox_preds)
As we mentioned, the closer a multiscale feature block is to the top in
:numref:`fig_ssd`, the larger the objects it detects and the larger
the anchor boxes it must generate. Here, we first divide the interval
from 0.2 to 1.05 into five equal parts to determine the sizes of smaller
anchor boxes at different scales: 0.2, 0.37, 0.54, etc. Then, according
to :math:`\sqrt{0.2 \times 0.37} = 0.272`,
:math:`\sqrt{0.37 \times 0.54} = 0.447`, and similar formulas, we
determine the sizes of larger anchor boxes at the different scales.
.. code:: python
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
[0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1
Now, we can define the complete model, ``TinySSD``.
.. code:: python
class TinySSD(nn.Block):
def __init__(self, num_classes, **kwargs):
super(TinySSD, self).__init__(**kwargs)
self.num_classes = num_classes
for i in range(5):
# The assignment statement is self.blk_i = get_blk(i)
setattr(self, 'blk_%d' % i, get_blk(i))
setattr(self, 'cls_%d' % i, cls_predictor(num_anchors,
num_classes))
setattr(self, 'bbox_%d' % i, bbox_predictor(num_anchors))
def forward(self, X):
anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
for i in range(5):
# getattr(self, 'blk_%d' % i) accesses self.blk_i
X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
X, getattr(self, 'blk_%d' % i), sizes[i], ratios[i],
getattr(self, 'cls_%d' % i), getattr(self, 'bbox_%d' % i))
# In the reshape function, 0 indicates that the batch size remains
# unchanged
anchors = np.concatenate(anchors, axis=1)
cls_preds = concat_preds(cls_preds)
cls_preds = cls_preds.reshape(
cls_preds.shape[0], -1, self.num_classes + 1)
bbox_preds = concat_preds(bbox_preds)
return anchors, cls_preds, bbox_preds
We now create an SSD model instance and use it to perform forward
computation on image minibatch ``X``, which has a height and width of
256 pixels. As we verified previously, the first module outputs a
feature map with the shape :math:`32 \times 32`. Because modules two to
four are height and width downsample blocks, module five is a global
pooling layer, and each element in the feature map is used as the center
for 4 anchor boxes, a total of
:math:`(32^2 + 16^2 + 8^2 + 4^2 + 1)\times 4 = 5444` anchor boxes are
generated for each image at the five scales.
.. code:: python
net = TinySSD(num_classes=1)
net.initialize()
X = np.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)
print('output anchors:', anchors.shape)
print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)
.. parsed-literal::
:class: output
output anchors: (1, 5444, 4)
output class preds: (32, 5444, 2)
output bbox preds: (32, 21776)
Training
--------
Now, we will explain, step by step, how to train the SSD model for
object detection.
Data Reading and Initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We read the Pikachu dataset we created in the previous section.
.. code:: python
batch_size = 32
train_iter, _ = d2l.load_data_pikachu(batch_size)
There is 1 category in the Pikachu dataset. After defining the module,
we need to initialize the model parameters and define the optimization
algorithm.
.. code:: python
ctx, net = d2l.try_gpu(), TinySSD(num_classes=1)
net.initialize(init=init.Xavier(), ctx=ctx)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': 0.2, 'wd': 5e-4})
Defining Loss and Evaluation Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Object detection is subject to two types of losses. The first is anchor
box category loss. For this, we can simply reuse the cross-entropy loss
function we used in image classification. The second loss is positive
anchor box offset loss. Offset prediction is a normalization problem.
However, here, we do not use the squared loss introduced previously.
Rather, we use the :math:`L_1` norm loss, which is the absolute value of
the difference between the predicted value and the ground-truth value.
The mask variable ``bbox_masks`` removes negative anchor boxes and
padding anchor boxes from the loss calculation. Finally, we add the
anchor box category and offset losses to find the final loss function
for the model.
.. code:: python
cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
bbox_loss = gluon.loss.L1Loss()
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
cls = cls_loss(cls_preds, cls_labels)
bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
return cls + bbox
We can use the accuracy rate to evaluate the classification results. As
we use the :math:`L_1` norm loss, we will use the average absolute error
to evaluate the bounding box prediction results.
.. code:: python
def cls_eval(cls_preds, cls_labels):
# Because the category prediction results are placed in the final
# dimension, argmax must specify this dimension
return float((cls_preds.argmax(axis=-1) == cls_labels).sum())
def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())
Training the Model
~~~~~~~~~~~~~~~~~~
During model training, we must generate multiscale anchor boxes
(``anchors``) in the model’s forward computation process and predict the
category (``cls_preds``) and offset (``bbox_preds``) for each anchor
box. Afterwards, we label the category (``cls_labels``) and offset
(``bbox_labels``) of each generated anchor box based on the label
information ``Y``. Finally, we calculate the loss function using the
predicted and labeled category and offset values. To simplify the code,
we do not evaluate the training dataset here.
.. code:: python
num_epochs, timer = 20, d2l.Timer()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['class error', 'bbox mae'])
for epoch in range(num_epochs):
# accuracy_sum, mae_sum, num_examples, num_labels
metric = d2l.Accumulator(4)
train_iter.reset() # Read data from the start.
for batch in train_iter:
timer.start()
X = batch.data[0].as_in_context(ctx)
Y = batch.label[0].as_in_context(ctx)
with autograd.record():
# Generate multiscale anchor boxes and predict the category and
# offset of each
anchors, cls_preds, bbox_preds = net(X)
# Label the category and offset of each anchor box
bbox_labels, bbox_masks, cls_labels = npx.multibox_target(
anchors, Y, cls_preds.transpose(0, 2, 1))
# Calculate the loss function using the predicted and labeled
# category and offset values
l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
bbox_masks)
l.backward()
trainer.step(batch_size)
metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size,
bbox_eval(bbox_preds, bbox_labels, bbox_masks),
bbox_labels.size)
cls_err, bbox_mae = 1-metric[0]/metric[1], metric[2]/metric[3]
animator.add(epoch+1, (cls_err, bbox_mae))
print('class err %.2e, bbox mae %.2e' % (cls_err, bbox_mae))
print('%.1f examples/sec on %s' % (train_iter.num_image/timer.stop(), ctx))
.. parsed-literal::
:class: output
class err 2.43e-03, bbox mae 2.77e-03
4290.1 examples/sec on gpu(0)
.. figure:: output_ssd_265054_35_1.svg
Prediction
----------
In the prediction stage, we want to detect all objects of interest in
the image. Below, we read the test image and transform its size. Then,
we convert it to the four-dimensional format required by the
convolutional layer.
.. code:: python
img = image.imread('../img/pikachu.jpg')
feature = image.imresize(img, 256, 256).astype('float32')
X = np.expand_dims(feature.transpose(2, 0, 1), axis=0)
Using the ``MultiBoxDetection`` function, we predict the bounding boxes
based on the anchor boxes and their predicted offsets. Then, we use
non-maximum suppression to remove similar bounding boxes.
.. code:: python
def predict(X):
anchors, cls_preds, bbox_preds = net(X.as_in_context(ctx))
cls_probs = npx.softmax(cls_preds).transpose(0, 2, 1)
output = npx.multibox_detection(cls_probs, bbox_preds, anchors)
idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
return output[0, idx]
output = predict(X)
Finally, we take all the bounding boxes with a confidence level of at
least 0.3 and display them as the final output.
.. code:: python
def display(img, output, threshold):
d2l.set_figsize((5, 5))
fig = d2l.plt.imshow(img.asnumpy())
for row in output:
score = float(row[1])
if score < threshold:
continue
h, w = img.shape[0:2]
bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.context)]
d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')
display(img, output, threshold=0.3)
.. figure:: output_ssd_265054_41_0.svg
Summary
-------
- SSD is a multiscale object detection model. This model generates
different numbers of anchor boxes of different sizes based on the
base network block and each multiscale feature block and predicts the
categories and offsets of the anchor boxes to detect objects of
different sizes.
- During SSD model training, the loss function is calculated using the
predicted and labeled category and offset values.
Exercises
---------
1. Due to space limitations, we have ignored some of the implementation
details of SSD models in this experiment. Can you further improve the
model in the following areas?
Loss Function
~~~~~~~~~~~~~
For the predicted offsets, replace :math:`L_1` norm loss with
:math:`L_1` regularization loss. This loss function uses a square
function around zero for greater smoothness. This is the regularized
area controlled by the hyperparameter :math:`\sigma`:
.. math::
f(x) =
\begin{cases}
(\sigma x)^2/2,& \text{if }|x| < 1/\sigma^2\\
|x|-0.5/\sigma^2,& \text{otherwise}
\end{cases}
When :math:`\sigma` is large, this loss is similar to the :math:`L_1`
norm loss. When the value is small, the loss function is smoother.
.. code:: python
sigmas = [10, 1, 0.5]
lines = ['-', '--', '-.']
x = np.arange(-2, 2, 0.1)
d2l.set_figsize()
for l, s in zip(lines, sigmas):
y = npx.smooth_l1(x, scalar=s)
d2l.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s)
d2l.plt.legend();
.. figure:: output_ssd_265054_43_0.svg
In the experiment, we used cross-entropy loss for category prediction.
Now, assume that the prediction probability of the actual category
:math:`j` is :math:`p_j` and the cross-entropy loss is
:math:`-\log p_j`. We can also use the focal loss
:cite:`Lin.Goyal.Girshick.ea.2017`. Given the positive
hyper-parameters :math:`\gamma` and :math:`\alpha`, this loss is defined
as:
.. math:: - \alpha (1-p_j)^{\gamma} \log p_j.
As you can see, by increasing :math:`\gamma`, we can effectively reduce
the loss when the probability of predicting the correct category is
high.
.. code:: python
def focal_loss(gamma, x):
return -(1 - x) ** gamma * np.log(x)
x = np.arange(0.01, 1, 0.01)
for l, gamma in zip(lines, [0, 1, 5]):
y = d2l.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), l,
label='gamma=%.1f' % gamma)
d2l.plt.legend();
.. figure:: output_ssd_265054_45_0.svg
Training and Prediction
~~~~~~~~~~~~~~~~~~~~~~~
2. When an object is relatively large compared to the image, the model
normally adopts a larger input image size.
3. This generally produces a large number of negative anchor boxes when
labeling anchor box categories. We can sample the negative anchor
boxes to better balance the data categories. To do this, we can set
the ``MultiBoxTarget`` function’s ``negative_mining_ratio``
parameter.
4. Assign hyper-parameters with different weights to the anchor box
category loss and positive anchor box offset loss in the loss
function.
5. Refer to the SSD paper. What methods can be used to evaluate the
precision of object detection models
:cite:`Liu.Anguelov.Erhan.ea.2016`?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_ssd.svg