.. _sec_ssd:

Single Shot Multibox Detection
==============================


In :numref:`sec_bbox`–:numref:`sec_object-detection-dataset`, we
introduced bounding boxes, anchor boxes, multiscale object detection,
and the dataset for object detection. Now we are ready to use such
background knowledge to design an object detection model: single shot
multibox detection (SSD) :cite:`Liu.Anguelov.Erhan.ea.2016`. This
model is simple, fast, and widely used. Although this is just one of
vast amounts of object detection models, some of the design principles
and implementation details in this section are also applicable to other
models.

Model
-----

:numref:`fig_ssd` provides an overview of the design of single-shot
multibox detection. This model mainly consists of a base network
followed by several multiscale feature map blocks. The base network is
for extracting features from the input image, so it can use a deep CNN.
For example, the original single-shot multibox detection paper adopts a
VGG network truncated before the classification layer
:cite:`Liu.Anguelov.Erhan.ea.2016`, while ResNet has also been
commonly used. Through our design we can make the base network output
larger feature maps so as to generate more anchor boxes for detecting
smaller objects. Subsequently, each multiscale feature map block reduces
(e.g., by half) the height and width of the feature maps from the
previous block, and enables each unit of the feature maps to increase
its receptive field on the input image.

Recall the design of multiscale object detection through layerwise
representations of images by deep neural networks in
:numref:`sec_multiscale-object-detection`. Since multiscale feature
maps closer to the top of :numref:`fig_ssd` are smaller but have
larger receptive fields, they are suitable for detecting fewer but
larger objects.

In a nutshell, via its base network and several multiscale feature map
blocks, single-shot multibox detection generates a varying number of
anchor boxes with different sizes, and detects varying-size objects by
predicting classes and offsets of these anchor boxes (thus the bounding
boxes); thus, this is a multiscale object detection model.

.. _fig_ssd:

.. figure:: ../img/ssd.svg

   As a multiscale object detection model, single-shot multibox
   detection mainly consists of a base network followed by several
   multiscale feature map blocks.


In the following, we will describe the implementation details of
different blocks in :numref:`fig_ssd`. To begin with, we discuss how
to implement the class and bounding box prediction.

Class Prediction Layer
~~~~~~~~~~~~~~~~~~~~~~

Let the number of object classes be :math:`q`. Then anchor boxes have
:math:`q+1` classes, where class 0 is background. At some scale, suppose
that the height and width of feature maps are :math:`h` and :math:`w`,
respectively. When :math:`a` anchor boxes are generated with each
spatial position of these feature maps as their center, a total of
:math:`hwa` anchor boxes need to be classified. This often makes
classification with fully connected layers infeasible due to likely
heavy parametrization costs. Recall how we used channels of
convolutional layers to predict classes in :numref:`sec_nin`.
Single-shot multibox detection uses the same technique to reduce model
complexity.

Specifically, the class prediction layer uses a convolutional layer
without altering width or height of feature maps. In this way, there can
be a one-to-one correspondence between outputs and inputs at the same
spatial dimensions (width and height) of feature maps. More concretely,
channels of the output feature maps at any spatial position (:math:`x`,
:math:`y`) represent class predictions for all the anchor boxes centered
on (:math:`x`, :math:`y`) of the input feature maps. To produce valid
predictions, there must be :math:`a(q+1)` output channels, where for the
same spatial position the output channel with index :math:`i(q+1) + j`
represents the prediction of the class :math:`j`
(:math:`0 \leq j \leq q`) for the anchor box :math:`i`
(:math:`0 \leq i < a`).

Below we define such a class prediction layer, specifying :math:`a` and
:math:`q` via arguments ``num_anchors`` and ``num_classes``,
respectively. This layer uses a :math:`3\times3` convolutional layer
with a padding of 1. The width and height of the input and output of
this convolutional layer remain unchanged.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-1-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-1-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-1-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    %matplotlib inline
    import torch
    import torchvision
    from torch import nn
    from torch.nn import functional as F
    from d2l import torch as d2l
    
    
    def cls_predictor(num_inputs, num_anchors, num_classes):
        return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
                         kernel_size=3, padding=1)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-1-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    %matplotlib inline
    from mxnet import autograd, gluon, image, init, np, npx
    from mxnet.gluon import nn
    from d2l import mxnet as d2l
    
    npx.set_np()
    
    def cls_predictor(num_anchors, num_classes):
        return nn.Conv2D(num_anchors * (num_classes + 1), kernel_size=3,
                         padding=1)


.. raw:: html

    </div>


.. raw:: html

    </div>

Bounding Box Prediction Layer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The design of the bounding box prediction layer is similar to that of
the class prediction layer. The only difference lies in the number of
outputs for each anchor box: here we need to predict four offsets rather
than :math:`q+1` classes.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-3-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-3-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-3-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def bbox_predictor(num_inputs, num_anchors):
        return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-3-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def bbox_predictor(num_anchors):
        return nn.Conv2D(num_anchors * 4, kernel_size=3, padding=1)


.. raw:: html

    </div>


.. raw:: html

    </div>

Concatenating Predictions for Multiple Scales
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As we mentioned, single-shot multibox detection uses multiscale feature
maps to generate anchor boxes and predict their classes and offsets. At
different scales, the shapes of feature maps or the numbers of anchor
boxes centered on the same unit may vary. Therefore, shapes of the
prediction outputs at different scales may vary.

In the following example, we construct feature maps at two different
scales, ``Y1`` and ``Y2``, for the same minibatch, where the height and
width of ``Y2`` are half of those of ``Y1``. Let’s take class prediction
as an example. Suppose that 5 and 3 anchor boxes are generated for every
unit in ``Y1`` and ``Y2``, respectively. Suppose further that the number
of object classes is 10. For feature maps ``Y1`` and ``Y2`` the numbers
of channels in the class prediction outputs are :math:`5\times(10+1)=55`
and :math:`3\times(10+1)=33`, respectively, where either output shape is
(batch size, number of channels, height, width).


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-5-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-5-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-5-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def forward(x, block):
        return block(x)
    
    Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))
    Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
    Y1.shape, Y2.shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (torch.Size([2, 55, 20, 20]), torch.Size([2, 33, 10, 10]))


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-5-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def forward(x, block):
        block.initialize()
        return block(x)
    
    Y1 = forward(np.zeros((2, 8, 20, 20)), cls_predictor(5, 10))
    Y2 = forward(np.zeros((2, 16, 10, 10)), cls_predictor(3, 10))
    Y1.shape, Y2.shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    [22:46:32] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    ((2, 55, 20, 20), (2, 33, 10, 10))


.. raw:: html

    </div>


.. raw:: html

    </div>

As we can see, except for the batch size dimension, the other three
dimensions all have different sizes. To concatenate these two prediction
outputs for more efficient computation, we will transform these tensors
into a more consistent format.

Note that the channel dimension holds the predictions for anchor boxes
with the same center. We first move this dimension to the innermost.
Since the batch size remains the same for different scales, we can
transform the prediction output into a two-dimensional tensor with shape
(batch size, height :math:`\times` width :math:`\times` number of
channels). Then we can concatenate such outputs at different scales
along dimension 1.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-7-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-7-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-7-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def flatten_pred(pred):
        return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)
    
    def concat_preds(preds):
        return torch.cat([flatten_pred(p) for p in preds], dim=1)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-7-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def flatten_pred(pred):
        return npx.batch_flatten(pred.transpose(0, 2, 3, 1))
    
    def concat_preds(preds):
        return np.concatenate([flatten_pred(p) for p in preds], axis=1)


.. raw:: html

    </div>


.. raw:: html

    </div>

In this way, even though ``Y1`` and ``Y2`` have different sizes in
channels, heights, and widths, we can still concatenate these two
prediction outputs at two different scales for the same minibatch.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-9-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-9-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-9-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    concat_preds([Y1, Y2]).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    torch.Size([2, 25300])


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-9-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    concat_preds([Y1, Y2]).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (2, 25300)


.. raw:: html

    </div>


.. raw:: html

    </div>

Downsampling Block
~~~~~~~~~~~~~~~~~~

In order to detect objects at multiple scales, we define the following
downsampling block ``down_sample_blk`` that halves the height and width
of input feature maps. In fact, this block applies the design of VGG
blocks in :numref:`subsec_vgg-blocks`. More concretely, each
downsampling block consists of two :math:`3\times3` convolutional layers
with padding of 1 followed by a :math:`2\times2` max-pooling layer with
stride of 2. As we know, :math:`3\times3` convolutional layers with
padding of 1 do not change the shape of feature maps. However, the
subsequent :math:`2\times2` max-pooling reduces the height and width of
input feature maps by half. For both input and output feature maps of
this downsampling block, because :math:`1\times 2+(3-1)+(3-1)=6`, each
unit in the output has a :math:`6\times6` receptive field on the input.
Therefore, the downsampling block enlarges the receptive field of each
unit in its output feature maps.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-11-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-11-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-11-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def down_sample_blk(in_channels, out_channels):
        blk = []
        for _ in range(2):
            blk.append(nn.Conv2d(in_channels, out_channels,
                                 kernel_size=3, padding=1))
            blk.append(nn.BatchNorm2d(out_channels))
            blk.append(nn.ReLU())
            in_channels = out_channels
        blk.append(nn.MaxPool2d(2))
        return nn.Sequential(*blk)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-11-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def down_sample_blk(num_channels):
        blk = nn.Sequential()
        for _ in range(2):
            blk.add(nn.Conv2D(num_channels, kernel_size=3, padding=1),
                    nn.BatchNorm(in_channels=num_channels),
                    nn.Activation('relu'))
        blk.add(nn.MaxPool2D(2))
        return blk


.. raw:: html

    </div>


.. raw:: html

    </div>

In the following example, our constructed downsampling block changes the
number of input channels and halves the height and width of the input
feature maps.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-13-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-13-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-13-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    torch.Size([2, 10, 10, 10])


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-13-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    forward(np.zeros((2, 3, 20, 20)), down_sample_blk(10)).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (2, 10, 10, 10)


.. raw:: html

    </div>


.. raw:: html

    </div>

Base Network Block
~~~~~~~~~~~~~~~~~~

The base network block is used to extract features from input images.
For simplicity, we construct a small base network consisting of three
downsampling blocks that double the number of channels at each block.
Given a :math:`256\times256` input image, this base network block
outputs :math:`32 \times 32` feature maps (:math:`256/2^3=32`).


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-15-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-15-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-15-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def base_net():
        blk = []
        num_filters = [3, 16, 32, 64]
        for i in range(len(num_filters) - 1):
            blk.append(down_sample_blk(num_filters[i], num_filters[i+1]))
        return nn.Sequential(*blk)
    
    forward(torch.zeros((2, 3, 256, 256)), base_net()).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    torch.Size([2, 64, 32, 32])


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-15-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def base_net():
        blk = nn.Sequential()
        for num_filters in [16, 32, 64]:
            blk.add(down_sample_blk(num_filters))
        return blk
    
    forward(np.zeros((2, 3, 256, 256)), base_net()).shape


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    (2, 64, 32, 32)


.. raw:: html

    </div>


.. raw:: html

    </div>

The Complete Model
~~~~~~~~~~~~~~~~~~

The complete single shot multibox detection model consists of five
blocks. The feature maps produced by each block are used for both (i)
generating anchor boxes and (ii) predicting classes and offsets of these
anchor boxes. Among these five blocks, the first one is the base network
block, the second to the fourth are downsampling blocks, and the last
block uses global max-pooling to reduce both the height and width to 1.
Technically, the second to the fifth blocks are all those multiscale
feature map blocks in :numref:`fig_ssd`.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-17-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-17-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-17-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def get_blk(i):
        if i == 0:
            blk = base_net()
        elif i == 1:
            blk = down_sample_blk(64, 128)
        elif i == 4:
            blk = nn.AdaptiveMaxPool2d((1,1))
        else:
            blk = down_sample_blk(128, 128)
        return blk


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-17-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def get_blk(i):
        if i == 0:
            blk = base_net()
        elif i == 4:
            blk = nn.GlobalMaxPool2D()
        else:
            blk = down_sample_blk(128)
        return blk


.. raw:: html

    </div>


.. raw:: html

    </div>

Now we define the forward propagation for each block. Different from in
image classification tasks, outputs here include (i) CNN feature maps
``Y``, (ii) anchor boxes generated using ``Y`` at the current scale, and
(iii) classes and offsets predicted (based on ``Y``) for these anchor
boxes.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-19-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-19-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-19-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
        Y = blk(X)
        anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
        cls_preds = cls_predictor(Y)
        bbox_preds = bbox_predictor(Y)
        return (Y, anchors, cls_preds, bbox_preds)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-19-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):
        Y = blk(X)
        anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
        cls_preds = cls_predictor(Y)
        bbox_preds = bbox_predictor(Y)
        return (Y, anchors, cls_preds, bbox_preds)


.. raw:: html

    </div>


.. raw:: html

    </div>

Recall that in :numref:`fig_ssd` a multiscale feature map block that
is closer to the top is for detecting larger objects; thus, it needs to
generate larger anchor boxes. In the above forward propagation, at each
multiscale feature map block we pass in a list of two scale values via
the ``sizes`` argument of the invoked ``multibox_prior`` function
(described in :numref:`sec_anchor`). In the following, the interval
between 0.2 and 1.05 is split evenly into five sections to determine the
smaller scale values at the five blocks: 0.2, 0.37, 0.54, 0.71, and
0.88. Then their larger scale values are given by
:math:`\sqrt{0.2 \times 0.37} = 0.272`,
:math:`\sqrt{0.37 \times 0.54} = 0.447`, and so on.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-21-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-21-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-21-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
             [0.88, 0.961]]
    ratios = [[1, 2, 0.5]] * 5
    num_anchors = len(sizes[0]) + len(ratios[0]) - 1


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-21-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],
             [0.88, 0.961]]
    ratios = [[1, 2, 0.5]] * 5
    num_anchors = len(sizes[0]) + len(ratios[0]) - 1


.. raw:: html

    </div>


.. raw:: html

    </div>

Now we can define the complete model ``TinySSD`` as follows.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-23-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-23-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-23-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    class TinySSD(nn.Module):
        def __init__(self, num_classes, **kwargs):
            super(TinySSD, self).__init__(**kwargs)
            self.num_classes = num_classes
            idx_to_in_channels = [64, 128, 128, 128, 128]
            for i in range(5):
                # Equivalent to the assignment statement `self.blk_i = get_blk(i)`
                setattr(self, f'blk_{i}', get_blk(i))
                setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
                                                        num_anchors, num_classes))
                setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
                                                          num_anchors))
    
        def forward(self, X):
            anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
            for i in range(5):
                # Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
                X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                    X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                    getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
            anchors = torch.cat(anchors, dim=1)
            cls_preds = concat_preds(cls_preds)
            cls_preds = cls_preds.reshape(
                cls_preds.shape[0], -1, self.num_classes + 1)
            bbox_preds = concat_preds(bbox_preds)
            return anchors, cls_preds, bbox_preds


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-23-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    class TinySSD(nn.Block):
        def __init__(self, num_classes, **kwargs):
            super(TinySSD, self).__init__(**kwargs)
            self.num_classes = num_classes
            for i in range(5):
                # Equivalent to the assignment statement `self.blk_i = get_blk(i)`
                setattr(self, f'blk_{i}', get_blk(i))
                setattr(self, f'cls_{i}', cls_predictor(num_anchors, num_classes))
                setattr(self, f'bbox_{i}', bbox_predictor(num_anchors))
    
        def forward(self, X):
            anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
            for i in range(5):
                # Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
                X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
                    X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
                    getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
            anchors = np.concatenate(anchors, axis=1)
            cls_preds = concat_preds(cls_preds)
            cls_preds = cls_preds.reshape(
                cls_preds.shape[0], -1, self.num_classes + 1)
            bbox_preds = concat_preds(bbox_preds)
            return anchors, cls_preds, bbox_preds


.. raw:: html

    </div>


.. raw:: html

    </div>

We create a model instance and use it to perform forward propagation on
a minibatch of :math:`256 \times 256` images ``X``.

As shown earlier in this section, the first block outputs
:math:`32 \times 32` feature maps. Recall that the second to fourth
downsampling blocks halve the height and width and the fifth block uses
global pooling. Since 4 anchor boxes are generated for each unit along
spatial dimensions of feature maps, at all the five scales a total of
:math:`(32^2 + 16^2 + 8^2 + 4^2 + 1)\times 4 = 5444` anchor boxes are
generated for each image.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-25-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-25-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-25-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net = TinySSD(num_classes=1)
    X = torch.zeros((32, 3, 256, 256))
    anchors, cls_preds, bbox_preds = net(X)
    
    print('output anchors:', anchors.shape)
    print('output class preds:', cls_preds.shape)
    print('output bbox preds:', bbox_preds.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    output anchors: torch.Size([1, 5444, 4])
    output class preds: torch.Size([32, 5444, 2])
    output bbox preds: torch.Size([32, 21776])


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-25-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    net = TinySSD(num_classes=1)
    net.initialize()
    X = np.zeros((32, 3, 256, 256))
    anchors, cls_preds, bbox_preds = net(X)
    
    print('output anchors:', anchors.shape)
    print('output class preds:', cls_preds.shape)
    print('output bbox preds:', bbox_preds.shape)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    output anchors: (1, 5444, 4)
    output class preds: (32, 5444, 2)
    output bbox preds: (32, 21776)


.. raw:: html

    </div>


.. raw:: html

    </div>

Training
--------

Now we will explain how to train the single shot multibox detection
model for object detection.

Reading the Dataset and Initializing the Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To begin with, let’s read the banana detection dataset described in
:numref:`sec_object-detection-dataset`.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-27-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-27-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-27-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    batch_size = 32
    train_iter, _ = d2l.load_data_bananas(batch_size)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    read 1000 training examples
    read 100 validation examples


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-27-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    batch_size = 32
    train_iter, _ = d2l.load_data_bananas(batch_size)


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    read 1000 training examples
    read 100 validation examples


.. raw:: html

    </div>


.. raw:: html

    </div>

There is only one class in the banana detection dataset. After defining
the model, we need to initialize its parameters and define the
optimization algorithm.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-29-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-29-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-29-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    device, net = d2l.try_gpu(), TinySSD(num_classes=1)
    trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-29-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    device, net = d2l.try_gpu(), TinySSD(num_classes=1)
    net.initialize(init=init.Xavier(), ctx=device)
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': 0.2, 'wd': 5e-4})


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    [22:46:43] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for GPU


.. raw:: html

    </div>


.. raw:: html

    </div>

Defining Loss and Evaluation Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Object detection has two types of losses. The first loss concerns
classes of anchor boxes: its computation can simply reuse the
cross-entropy loss function that we used for image classification. The
second loss concerns offsets of positive (non-background) anchor boxes:
this is a regression problem. For this regression problem, however, here
we do not use the squared loss described in
:numref:`subsec_normal_distribution_and_squared_loss`. Instead, we use
the :math:`\ell_1` norm loss, the absolute value of the difference
between the prediction and the ground-truth. The mask variable
``bbox_masks`` filters out negative anchor boxes and illegal (padded)
anchor boxes in the loss calculation. In the end, we sum up the anchor
box class loss and the anchor box offset loss to obtain the loss
function for the model.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-31-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-31-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-31-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    cls_loss = nn.CrossEntropyLoss(reduction='none')
    bbox_loss = nn.L1Loss(reduction='none')
    
    def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
        batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
        cls = cls_loss(cls_preds.reshape(-1, num_classes),
                       cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
        bbox = bbox_loss(bbox_preds * bbox_masks,
                         bbox_labels * bbox_masks).mean(dim=1)
        return cls + bbox


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-31-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    cls_loss = gluon.loss.SoftmaxCrossEntropyLoss()
    bbox_loss = gluon.loss.L1Loss()
    
    def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):
        cls = cls_loss(cls_preds, cls_labels)
        bbox = bbox_loss(bbox_preds * bbox_masks, bbox_labels * bbox_masks)
        return cls + bbox


.. raw:: html

    </div>


.. raw:: html

    </div>

We can use accuracy to evaluate the classification results. Due to the
used :math:`\ell_1` norm loss for the offsets, we use the *mean absolute
error* to evaluate the predicted bounding boxes. These prediction
results are obtained from the generated anchor boxes and the predicted
offsets for them.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-33-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-33-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-33-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def cls_eval(cls_preds, cls_labels):
        # Because the class prediction results are on the final dimension,
        # `argmax` needs to specify this dimension
        return float((cls_preds.argmax(dim=-1).type(
            cls_labels.dtype) == cls_labels).sum())
    
    def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
        return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-33-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def cls_eval(cls_preds, cls_labels):
        # Because the class prediction results are on the final dimension,
        # `argmax` needs to specify this dimension
        return float((cls_preds.argmax(axis=-1).astype(
            cls_labels.dtype) == cls_labels).sum())
    
    def bbox_eval(bbox_preds, bbox_labels, bbox_masks):
        return float((np.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())


.. raw:: html

    </div>


.. raw:: html

    </div>

Training the Model
~~~~~~~~~~~~~~~~~~

When training the model, we need to generate multiscale anchor boxes
(``anchors``) and predict their classes (``cls_preds``) and offsets
(``bbox_preds``) in the forward propagation. Then we label the classes
(``cls_labels``) and offsets (``bbox_labels``) of such generated anchor
boxes based on the label information ``Y``. Finally, we calculate the
loss function using the predicted and labeled values of the classes and
offsets. For concise implementations, evaluation of the test dataset is
omitted here.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-35-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-35-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-35-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    num_epochs, timer = 20, d2l.Timer()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['class error', 'bbox mae'])
    net = net.to(device)
    for epoch in range(num_epochs):
        # Sum of training accuracy, no. of examples in sum of training accuracy,
        # Sum of absolute error, no. of examples in sum of absolute error
        metric = d2l.Accumulator(4)
        net.train()
        for features, target in train_iter:
            timer.start()
            trainer.zero_grad()
            X, Y = features.to(device), target.to(device)
            # Generate multiscale anchor boxes and predict their classes and
            # offsets
            anchors, cls_preds, bbox_preds = net(X)
            # Label the classes and offsets of these anchor boxes
            bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
            # Calculate the loss function using the predicted and labeled values
            # of the classes and offsets
            l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                          bbox_masks)
            l.mean().backward()
            trainer.step()
            metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
                       bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                       bbox_labels.numel())
        cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
        animator.add(epoch + 1, (cls_err, bbox_mae))
    print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
    print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
          f'{str(device)}')


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    class err 3.27e-03, bbox mae 3.08e-03
    4279.7 examples/sec on cuda:0


.. figure:: output_ssd_739e1b_156_1.svg


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-35-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    num_epochs, timer = 20, d2l.Timer()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['class error', 'bbox mae'])
    for epoch in range(num_epochs):
        # Sum of training accuracy, no. of examples in sum of training accuracy,
        # Sum of absolute error, no. of examples in sum of absolute error
        metric = d2l.Accumulator(4)
        for features, target in train_iter:
            timer.start()
            X = features.as_in_ctx(device)
            Y = target.as_in_ctx(device)
            with autograd.record():
                # Generate multiscale anchor boxes and predict their classes and
                # offsets
                anchors, cls_preds, bbox_preds = net(X)
                # Label the classes and offsets of these anchor boxes
                bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors,
                                                                          Y)
                # Calculate the loss function using the predicted and labeled
                # values of the classes and offsets
                l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
                              bbox_masks)
            l.backward()
            trainer.step(batch_size)
            metric.add(cls_eval(cls_preds, cls_labels), cls_labels.size,
                       bbox_eval(bbox_preds, bbox_labels, bbox_masks),
                       bbox_labels.size)
        cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
        animator.add(epoch + 1, (cls_err, bbox_mae))
    print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
    print(f'{len(train_iter._dataset) / timer.stop():.1f} examples/sec on '
          f'{str(device)}')


.. raw:: latex

   \diilbookstyleoutputcell

.. parsed-literal::
    :class: output

    class err 3.48e-03, bbox mae 3.78e-03
    1968.6 examples/sec on gpu(0)


.. figure:: output_ssd_739e1b_159_1.svg


.. raw:: html

    </div>


.. raw:: html

    </div>

Prediction
----------

During prediction, the goal is to detect all the objects of interest on
the image. Below we read and resize a test image, converting it to a
four-dimensional tensor that is required by convolutional layers.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-37-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-37-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-37-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    X = torchvision.io.read_image('../img/banana.jpg').unsqueeze(0).float()
    img = X.squeeze(0).permute(1, 2, 0).long()


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-37-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    img = image.imread('../img/banana.jpg')
    feature = image.imresize(img, 256, 256).astype('float32')
    X = np.expand_dims(feature.transpose(2, 0, 1), axis=0)


.. raw:: html

    </div>


.. raw:: html

    </div>

Using the ``multibox_detection`` function below, the predicted bounding
boxes are obtained from the anchor boxes and their predicted offsets.
Then non-maximum suppression is used to remove similar predicted
bounding boxes.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-39-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-39-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-39-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def predict(X):
        net.eval()
        anchors, cls_preds, bbox_preds = net(X.to(device))
        cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
        output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
        idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
        return output[0, idx]
    
    output = predict(X)


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-39-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def predict(X):
        anchors, cls_preds, bbox_preds = net(X.as_in_ctx(device))
        cls_probs = npx.softmax(cls_preds).transpose(0, 2, 1)
        output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
        idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
        return output[0, idx]
    
    output = predict(X)


.. raw:: html

    </div>


.. raw:: html

    </div>

Finally, we display all the predicted bounding boxes with confidence 0.9
or above as output.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-41-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-41-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-41-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def display(img, output, threshold):
        d2l.set_figsize((5, 5))
        fig = d2l.plt.imshow(img)
        for row in output:
            score = float(row[1])
            if score < threshold:
                continue
            h, w = img.shape[:2]
            bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
            d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')
    
    display(img, output.cpu(), threshold=0.9)


.. figure:: output_ssd_739e1b_183_0.svg


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-41-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def display(img, output, threshold):
        d2l.set_figsize((5, 5))
        fig = d2l.plt.imshow(img.asnumpy())
        for row in output:
            score = float(row[1])
            if score < threshold:
                continue
            h, w = img.shape[:2]
            bbox = [row[2:6] * np.array((w, h, w, h), ctx=row.ctx)]
            d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')
    
    display(img, output, threshold=0.9)


.. figure:: output_ssd_739e1b_186_0.svg


.. raw:: html

    </div>


.. raw:: html

    </div>

Summary
-------

-  Single shot multibox detection is a multiscale object detection
   model. Via its base network and several multiscale feature map
   blocks, single-shot multibox detection generates a varying number of
   anchor boxes with different sizes, and detects varying-size objects
   by predicting classes and offsets of these anchor boxes (thus the
   bounding boxes).
-  When training the single-shot multibox detection model, the loss
   function is calculated based on the predicted and labeled values of
   the anchor box classes and offsets.

Exercises
---------

1. Can you improve the single-shot multibox detection by improving the
   loss function? For example, replace :math:`\ell_1` norm loss with
   smooth :math:`\ell_1` norm loss for the predicted offsets. This loss
   function uses a square function around zero for smoothness, which is
   controlled by the hyperparameter :math:`\sigma`:

.. math::


   f(x) =
       \begin{cases}
       (\sigma x)^2/2,& \textrm{if }|x| < 1/\sigma^2\\
       |x|-0.5/\sigma^2,& \textrm{otherwise}
       \end{cases}

When :math:`\sigma` is very large, this loss is similar to the
:math:`\ell_1` norm loss. When its value is smaller, the loss function
is smoother.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-43-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-43-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-43-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def smooth_l1(data, scalar):
        out = []
        for i in data:
            if abs(i) < 1 / (scalar ** 2):
                out.append(((scalar * i) ** 2) / 2)
            else:
                out.append(abs(i) - 0.5 / (scalar ** 2))
        return torch.tensor(out)
    
    sigmas = [10, 1, 0.5]
    lines = ['-', '--', '-.']
    x = torch.arange(-2, 2, 0.1)
    d2l.set_figsize()
    
    for l, s in zip(lines, sigmas):
        y = smooth_l1(x, scalar=s)
        d2l.plt.plot(x, y, l, label='sigma=%.1f' % s)
    d2l.plt.legend();


.. figure:: output_ssd_739e1b_192_0.svg


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-43-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    sigmas = [10, 1, 0.5]
    lines = ['-', '--', '-.']
    x = np.arange(-2, 2, 0.1)
    d2l.set_figsize()
    
    for l, s in zip(lines, sigmas):
        y = npx.smooth_l1(x, scalar=s)
        d2l.plt.plot(x.asnumpy(), y.asnumpy(), l, label='sigma=%.1f' % s)
    d2l.plt.legend();


.. figure:: output_ssd_739e1b_195_0.svg


.. raw:: html

    </div>


.. raw:: html

    </div>

Besides, in the experiment we used cross-entropy loss for class
prediction: denoting by :math:`p_j` the predicted probability for the
ground-truth class :math:`j`, the cross-entropy loss is
:math:`-\log p_j`. We can also use the focal loss
:cite:`Lin.Goyal.Girshick.ea.2017`: given hyperparameters
:math:`\gamma > 0` and :math:`\alpha > 0`, this loss is defined as:

.. math::  - \alpha (1-p_j)^{\gamma} \log p_j.

As we can see, increasing :math:`\gamma` can effectively reduce the
relative loss for well-classified examples (e.g., :math:`p_j > 0.5`) so
the training can focus more on those difficult examples that are
misclassified.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar code"><a href="#pytorch-45-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-45-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-45-0">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def focal_loss(gamma, x):
        return -(1 - x) ** gamma * torch.log(x)
    
    x = torch.arange(0.01, 1, 0.01)
    for l, gamma in zip(lines, [0, 1, 5]):
        y = d2l.plt.plot(x, focal_loss(gamma, x), l, label='gamma=%.1f' % gamma)
    d2l.plt.legend();


.. figure:: output_ssd_739e1b_201_0.svg


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-45-1">

.. raw:: latex

   \diilbookstyleinputcell

.. code:: python

    def focal_loss(gamma, x):
        return -(1 - x) ** gamma * np.log(x)
    
    x = np.arange(0.01, 1, 0.01)
    for l, gamma in zip(lines, [0, 1, 5]):
        y = d2l.plt.plot(x.asnumpy(), focal_loss(gamma, x).asnumpy(), l,
                         label='gamma=%.1f' % gamma)
    d2l.plt.legend();


.. figure:: output_ssd_739e1b_204_0.svg


.. raw:: html

    </div>


.. raw:: html

    </div>

2. Due to space limitations, we have omitted some implementation details
   of the single shot multibox detection model in this section. Can you
   further improve the model in the following aspects:

   1. When an object is much smaller compared with the image, the model
      could resize the input image bigger.
   2. There are typically a vast number of negative anchor boxes. To
      make the class distribution more balanced, we could downsample
      negative anchor boxes.
   3. In the loss function, assign different weight hyperparameters to
      the class loss and the offset loss.
   4. Use other methods to evaluate the object detection model, such as
      those in the single shot multibox detection paper
      :cite:`Liu.Anguelov.Erhan.ea.2016`.


.. raw:: html

    <div class="mdl-tabs mdl-js-tabs mdl-js-ripple-effect"><div class="mdl-tabs__tab-bar text"><a href="#pytorch-47-0" onclick="tagClick('pytorch'); return false;" class="mdl-tabs__tab is-active">pytorch</a><a href="#mxnet-47-1" onclick="tagClick('mxnet'); return false;" class="mdl-tabs__tab ">mxnet</a></div>


.. raw:: html

    <div class="mdl-tabs__panel is-active" id="pytorch-47-0">

`Discussions <https://discuss.d2l.ai/t/1604>`__


.. raw:: html

    </div>


.. raw:: html

    <div class="mdl-tabs__panel " id="mxnet-47-1">

`Discussions <https://discuss.d2l.ai/t/373>`__


.. raw:: html

    </div>


.. raw:: html

    </div>