14.5. Multiscale Object Detection¶ Open the notebook in SageMaker Studio Lab
In Section 14.4, we generated multiple anchor boxes centered
on each pixel of an input image. Essentially these anchor boxes
represent samples of different regions of the image. However, we may end
up with too many anchor boxes to compute if they are generated for
every pixel. Think of a
14.5.1. Multiscale Anchor Boxes¶
You may realize that it is not difficult to reduce anchor boxes on an
image. For instance, we can just uniformly sample a small portion of
pixels from the input image to generate anchor boxes centered on them.
In addition, at different scales we can generate different numbers of
anchor boxes of different sizes. Intuitively, smaller objects are more
likely to appear on an image than larger ones. As an example,
To demonstrate how to generate anchor boxes at multiple scales, let’s read an image. Its height and width are 561 and 728 pixels, respectively.
(561, 728)
Recall that in Section 7.2 we call a two-dimensional array output of a convolutional layer a feature map. By defining the feature map shape, we can determine centers of uniformly sampled anchor boxes on any image.
The display_anchors
function is defined below. We generate anchor
boxes (anchors
) on the feature map (fmap
) with each unit (pixel)
as the anchor box center. Since the anchors
) have been divided by the width
and height of the feature map (fmap
), these values are between 0 and
1, which indicate the relative positions of anchor boxes in the feature
map.
Since centers of the anchor boxes (anchors
) are spread over all
units on the feature map (fmap
), these centers must be uniformly
distributed on any input image in terms of their relative spatial
positions. More concretely, given the width and height of the feature
map fmap_w
and fmap_h
, respectively, the following function will
uniformly sample pixels in fmap_h
rows and fmap_w
columns on
any input image. Centered on these uniformly sampled pixels, anchor
boxes of scale s
(assuming the length of the list s
is 1) and
different aspect ratios (ratios
) will be generated.
def display_anchors(fmap_w, fmap_h, s):
d2l.set_figsize()
# Values on the first two dimensions do not affect the output
fmap = torch.zeros((1, 10, fmap_h, fmap_w))
anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
bbox_scale = torch.tensor((w, h, w, h))
d2l.show_bboxes(d2l.plt.imshow(img).axes,
anchors[0] * bbox_scale)
def display_anchors(fmap_w, fmap_h, s):
d2l.set_figsize()
# Values on the first two dimensions do not affect the output
fmap = np.zeros((1, 10, fmap_h, fmap_w))
anchors = npx.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
bbox_scale = np.array((w, h, w, h))
d2l.show_bboxes(d2l.plt.imshow(img.asnumpy()).axes,
anchors[0] * bbox_scale)
First, let’s consider detection of small objects. In order to make it easier to distinguish when displayed, the anchor boxes with different centers here do not overlap: the anchor box scale is set to 0.15 and the height and width of the feature map are set to 4. We can see that the centers of the anchor boxes in 4 rows and 4 columns on the image are uniformly distributed.
We move on to reduce the height and width of the feature map by half and use larger anchor boxes to detect larger objects. When the scale is set to 0.4, some anchor boxes will overlap with each other.
Finally, we further reduce the height and width of the feature map by half and increase the anchor box scale to 0.8. Now the center of the anchor box is the center of the image.
14.5.2. Multiscale Detection¶
Since we have generated multiscale anchor boxes, we will use them to detect objects of various sizes at different scales. In the following we introduce a CNN-based multiscale object detection method that we will implement in Section 14.7.
At some scale, say that we have
Assume that the
When the feature maps at different layers have varying-size receptive fields on the input image, they can be used to detect objects of different sizes. For example, we can design a neural network where units of feature maps that are closer to the output layer have wider receptive fields, so they can detect larger objects from the input image.
In a nutshell, we can leverage layerwise representations of images at multiple levels by deep neural networks for multiscale object detection. We will show how this works through a concrete example in Section 14.7.
14.5.3. Summary¶
At multiple scales, we can generate anchor boxes with different sizes to detect objects with different sizes.
By defining the shape of feature maps, we can determine centers of uniformly sampled anchor boxes on any image.
We use the information of the input image in a certain receptive field to predict the classes and offsets of the anchor boxes that are close to that receptive field on the input image.
Through deep learning, we can leverage its layerwise representations of images at multiple levels for multiscale object detection.
14.5.4. Exercises¶
According to our discussions in Section 8.1, deep neural networks learn hierarchical features with increasing levels of abstraction for images. In multiscale object detection, do feature maps at different scales correspond to different levels of abstraction? Why or why not?
At the first scale (
fmap_w=4, fmap_h=4
) in the experiments in Section 14.5.1, generate uniformly distributed anchor boxes that may overlap.Given a feature map variable with shape
, where , , and are the number of channels, height, and width of the feature maps, respectively. How can you transform this variable into the classes and offsets of anchor boxes? What is the shape of the output?