7.4. Multiple Input and Multiple Output Channels¶ Open the notebook in SageMaker Studio Lab
While we described the multiple channels that comprise each image (e.g., color images have the standard RGB channels to indicate the amount of red, green and blue) and convolutional layers for multiple channels in Section 7.1.4, until now, we simplified all of our numerical examples by working with just a single input and a single output channel. This allowed us to think of our inputs, convolution kernels, and outputs each as two-dimensional tensors.
When we add channels into the mix, our inputs and hidden representations
both become three-dimensional tensors. For example, each RGB input image
has shape
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
7.4.1. Multiple Input Channels¶
When the input data contains multiple channels, we need to construct a
convolution kernel with the same number of input channels as the input
data, so that it can perform cross-correlation with the input data.
Assuming that the number of channels for the input data is
However, when
Fig. 7.4.1 provides an example of a two-dimensional
cross-correlation with two input channels. The shaded portions are the
first output element as well as the input and kernel tensor elements
used for the output computation:
Fig. 7.4.1 Cross-correlation computation with two input channels.¶
To make sure we really understand what is going on here, we can implement cross-correlation operations with multiple input channels ourselves. Notice that all we are doing is performing a cross-correlation operation per channel and then adding up the results.
We can construct the input tensor X
and the kernel tensor K
corresponding to the values in Fig. 7.4.1 to validate
the output of the cross-correlation operation.
tensor([[ 56., 72.],
[104., 120.]])
[22:10:49] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
array([[ 56., 72.],
[104., 120.]])
Array([[ 56., 72.],
[104., 120.]], dtype=float32)
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 56., 72.],
[104., 120.]], dtype=float32)>
7.4.2. Multiple Output Channels¶
Regardless of the number of input channels, so far we always ended up with one output channel. However, as we discussed in Section 7.1.4, it turns out to be essential to have multiple channels at each layer. In the most popular neural network architectures, we actually increase the channel dimension as we go deeper in the neural network, typically downsampling to trade off spatial resolution for greater channel depth. Intuitively, you could think of each channel as responding to a different set of features. The reality is a bit more complicated than this. A naive interpretation would suggest that representations are learned independently per pixel or per channel. Instead, channels are optimized to be jointly useful. This means that rather than mapping a single channel to an edge detector, it may simply mean that some direction in channel space corresponds to detecting edges.
Denote by
We implement a cross-correlation function to calculate the output of multiple channels as shown below.
We construct a trivial convolution kernel with three output channels by
concatenating the kernel tensor for K
with K+1
and K+2
.
Below, we perform cross-correlation operations on the input tensor X
with the kernel tensor K
. Now the output contains three channels.
The result of the first channel is consistent with the result of the
previous input tensor X
and the multi-input channel, single-output
channel kernel.
tensor([[[ 56., 72.],
[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]])
array([[[ 56., 72.],
[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]])
Array([[[ 56., 72.],
[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]], dtype=float32)
7.4.3. Convolutional Layer¶
At first, a
Because the minimum window is used, the
Fig. 7.4.2 shows the cross-correlation computation using
the
Fig. 7.4.2 The cross-correlation computation uses the
Let’s check whether this works in practice: we implement a
When performing corr2d_multi_in_out
. Let’s check this with some sample data.
7.4.4. Discussion¶
Channels allow us to combine the best of both worlds: MLPs that allow for significant nonlinearities and convolutions that allow for localized analysis of features. In particular, channels allow the CNN to reason with multiple features, such as edge and shape detectors at the same time. They also offer a practical trade-off between the drastic parameter reduction arising from translation invariance and locality, and the need for expressive and diverse models in computer vision.
Note, though, that this flexibility comes at a price. Given an image of
size
7.4.5. Exercises¶
Assume that we have two convolution kernels of size
and , respectively (with no nonlinearity in between).Prove that the result of the operation can be expressed by a single convolution.
What is the dimensionality of the equivalent single convolution?
Is the converse true, i.e., can you always decompose a convolution into two smaller ones?
Assume an input of shape
and a convolution kernel of shape , padding of , and stride of .What is the computational cost (multiplications and additions) for the forward propagation?
What is the memory footprint?
What is the memory footprint for the backward computation?
What is the computational cost for the backpropagation?
By what factor does the number of calculations increase if we double both the number of input channels
and the number of output channels ? What happens if we double the padding?Are the variables
Y1
andY2
in the final example of this section exactly the same? Why?Express convolutions as a matrix multiplication, even when the convolution window is not
.Your task is to implement fast convolutions with a
kernel. One of the algorithm candidates is to scan horizontally across the source, reading a -wide strip and computing the -wide output strip one value at a time. The alternative is to read a wide strip and compute a -wide output strip. Why is the latter preferable? Is there a limit to how large you should choose ?Assume that we have a
matrix.How much faster is it to multiply with a block-diagonal matrix if the matrix is broken up into
blocks?What is the downside of having
blocks? How could you fix it, at least partly?