5.1. Layers and Blocks¶
When we first started talking about neural networks, we introduced linear models with a single output. Here, the entire model consists of just a single neuron. By itself, a single neuron takes some set of inputs, generates a corresponding (scalar) output, and has a set of associated parameters that can be updated to optimize some objective function of interest. Then, once we started thinking about networks with multiple outputs, we leveraged vectorized arithmetic, we showed how we could use linear algebra to efficiently express an entire layer of neurons. Layers too expect some inputs, generate corresponding outputs, and are described by a set of tunable parameters.
When we worked through softmax regression, a single layer was itself the model. However, when we subsequently introduced multilayer perceptrons, we developed models consisting of multiple layers. One interesting property of multilayer neural networks is that the entire model and its constituent layers share the same basic structure. The model takes the true inputs (as stated in the problem formulation), outputs predictions of the true outputs, and possesses parameters (the combined set of all parameters from all layers) Likewise any individual constituent layer in a multilayer perceptron ingests inputs (supplied by the previous layer) generates outputs (which form the inputs to the subsequent layer), and possesses a set of tunable parameters tht are updated with respect to the ultimate objective (using the signal that flows backwards through the subsequent layer).
While you might think that neurons, layers, and models give us enough abstractions to go about our business, it turns out that we will often want to express our model in terms of a components that are large than an indivudal layer. For example, when designing models, like ResNet-152, which possess hundreds (152, thus the name) of layers, implementing the network one layer at a time can grow tedious. Moreover, this concern is not just hypothetical—such deep networks dominate numerous application areas, especially when training data is abundant. For example the ResNet architecture mentioned above won the 2015 ImageNet and COCO computer vision competitions for both recognition and detection [He et al., 2016a]. Deep networks with many layers arranged into components with various repeating patterns are now ubiquitous in other domains including natural language processing and speech.
To facilitate the implementation of networks consisting of components of
arbitrary complexity, we introduce a new flexible concept: a neural
network block. A block could describe a single neuron, a
high-dimensional layer, or an arbitrarily-complex component consisting
of multiple layers. From a software development, a Block
is a class.
Any subclass of Block
must define a method called forward
that
transforms its input into output, and must store any necessary
parameters. Note that some Blocks do not require any parameters at all!
Finally a Block
must possess a backward
method, for purposes of
calculating gradients. Fortunately, due to some behind-the-scenes magic
supplied by the autograd autograd
package (introduced in
Section 2) when defining our own Block
typically requires only that we worry about parameters and the
forward
function.
One benefit of working with the Block
abstraction is that they can
be combined into larger artifacts, often recursively, e.g., as
illustrated in Fig. 5.1.1.
Fig. 5.1.1 Multiple layers are combined into blocks¶
By defining code to generate Blocks of arbitrary complexity on demand, we can write surprisingly compact code and still implement complex neural networks.
To begin, we revisit the Blocks that played a role in our implementation of the multilayer perceptron (Section 4.3). The following code generates a network with one fully-connected hidden layer containing 256 units followed by a ReLU activation, and then another fully-connected layer consisting of 10 units (with no activation function). Because there are no more layers, this last 10-unit layer is regarded as the output layer and its outputs are also the model’s output.
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
x = np.random.uniform(size=(2, 20))
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)
array([[ 0.06240272, -0.03268593, 0.02582653, 0.02254182, -0.03728798,
-0.04253786, 0.00540613, -0.01364186, -0.09915452, -0.02272738],
[ 0.02816677, -0.03341204, 0.03565666, 0.02506382, -0.04136416,
-0.04941845, 0.01738528, 0.01081961, -0.09932579, -0.01176298]])
In this example, as in previous chapters, our model consists of an
object returned by the nn.Sequential
constructor. After
instantiating a nn.Sequential
and storing the net
variable, we
repeatedly called its add
method, appending layers in the order that
they should be executed. We suspect that you might have already
understood more or less what was going on here the first time you saw
this code. You may even have understood it well enough to modify the
code and design your own networks. However, the details regarding what
exactly happens inside nn.Sequential
have remained mysterious so
far.
In short, nn.Sequential
just defines a special kind of Block.
Specifically, an nn.Sequential
maintains a list of constituent
Blocks
, stored in a particular order. You might think of
nnSequential
as your first meta-Block. The add
method simply
facilitates the addition of each successive Block
to the list. Note
that each our layers are instances of the Dense
class which is
itself a subclass of Block
. The forward
function is also
remarkably simple: it chains each Block in the list together, passing
the output of each as the input to the next.
Note that until now, we have been invoking our models via the
construction net(X)
to obtain their outputs. This is actually just
shorthand for net.forward(X)
, a slick Python trick achieved via the
Block class’s __call__
function.
Before we dive in to implementing our own custom Block
, we briefly
summarize the basic functionality that each Block
must perform the
following duties:
Ingest input data as arguments to its
forward
function.Generate an output via the value returned by its
forward
function. Note that the output may have a different shape from the input. For example, the first Dense layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.Calculate the gradient of its output with respect to its input, which can be accessed via its
backward
method. Typically this happens automatically.Store and provide access to those parameters necessary to execute the
forward
computation.Initialize these parameters as needed.
5.1.1. A Custom Block¶
Perhaps the easiest way to develop intuition about how nn.Block
works is to just dive right in and implement one ourselves. In the
following snippet, instead of relying on nn.Sequential
, we just code
up a Block from scratch that implements a multilayer perceptron with one
hidden layer, 256 hidden nodes, and 10 outputs.
Our MLP
class below inherits the Block
class. While we rely on
some predefined methods in the parent class, we need to supply our own
__init__
and forward
functions to uniquely define the behavior
of our model.
from mxnet.gluon import nn
class MLP(nn.Block):
# Declare a layer with model parameters. Here, we declare two fully
# connected layers
def __init__(self, **kwargs):
# Call the constructor of the MLP parent class Block to perform the
# necessary initialization. In this way, other function parameters can
# also be specified when constructing an instance, such as the model
# parameter, params, described in the following sections
super(MLP, self).__init__(**kwargs)
self.hidden = nn.Dense(256, activation='relu') # Hidden layer
self.output = nn.Dense(10) # Output layer
# Define the forward computation of the model, that is, how to return the
# required model output based on the input x
def forward(self, x):
return self.output(self.hidden(x))
This code may be easiest to understand by working backwards from
forward
. Note that the forward
method takes as input x
. The
forward method first evaluates self.hidden(x)
to produce the hidden
representation, passing this output as the input to the output layer
self.output( ... )
.
The constituent layers of each MLP
must be instance-level variables.
After all, if we instantiated two such models net1
and net2
and
trained them on different data, we would expect them to them to
represent two different learned models.
The __init__
method is the most natural place to instantiate the
layers that we subsequently invoke on each call to the forward
method. Note that before getting on with the interesting parts, our
customized __init__
method must invoke the parent class’s init
method: super(MLP, self).__init__(**kwargs)
to save us from
reimplementing boilerplate code applicable to most Blocks. Then, all
that is left is to instantiate our two Dense
layers, assigning them
to self.hidden
and self.output
, respectively. Again note that
when dealing with standard functionality like this, we do not have to
worry about backpropagation, since the backward
method is generated
for us automatically. The same goes for the initialize
method. Let’s
try this out:
net = MLP()
net.initialize()
net(x)
array([[-0.03989594, -0.1041471 , 0.06799038, 0.05245074, 0.02526059,
-0.00640342, 0.04182098, -0.01665319, -0.02067346, -0.07863817],
[-0.03612847, -0.07210436, 0.09159479, 0.07890771, 0.02494172,
-0.01028665, 0.01732428, -0.02843242, 0.03772651, -0.06671704]])
As we argued earlier, the primary virtue of the Block
abstraction is
its versatility. We can subclass Block
to create layers (such as the
Dense
class provided by Gluon), entire models (such as the MLP
class implemented above), or various components of intermediate
complexity, a pattern that we will lean on heavily throughout the next
chapters on convolutinoal neural networks.
5.1.2. The Sequential Block¶
As we described earlier, the Sequential
class itself is also just a
subclass of Block
, designed specifically for daisy-chaining other
Blocks together. All we need to do to implement our own MySequential
block is to define a few convenience functions: 1. An add
method for
appending Blocks one by one to a list. 2. A forward
method to pass
inputs through the chain of Blocks (in the order of addition).
The following MySequential
class delivers the same functionality as
Gluon’s default Sequential class:
class MySequential(nn.Block):
def __init__(self, **kwargs):
super(MySequential, self).__init__(**kwargs)
def add(self, block):
# Here, block is an instance of a Block subclass, and we assume it has
# a unique name. We save it in the member variable _children of the
# Block class, and its type is OrderedDict. When the MySequential
# instance calls the initialize function, the system automatically
# initializes all members of _children
self._children[block.name] = block
def forward(self, x):
# OrderedDict guarantees that members will be traversed in the order
# they were added
for block in self._children.values():
x = block(x)
return x
At its core is the add
method. It adds any block to the ordered
dictionary of children. These are then executed in sequence when forward
propagation is invoked. Let’s see what the MLP looks like now.
net = MySequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)
array([[-0.07645682, -0.01130233, 0.04952145, -0.04651389, -0.04131573,
-0.05884133, -0.0621381 , 0.01311472, -0.01379425, -0.02514282],
[-0.05124625, 0.00711231, -0.00155935, -0.07555379, -0.06675334,
-0.01762914, 0.00589084, 0.01447191, -0.04330775, 0.03317726]])
Indeed, it can be observed that the use of the MySequential
class is
no different from the use of the Sequential class described in
Section 4.3.
5.1.3. Blocks with Code¶
Although the Sequential class can make model construction easier, and
you do not need to define the forward
method, directly inheriting
the Block class can greatly expand the flexibility of model
construction. In particular, we will use Python’s control flow within
the forward method. While we are at it, we need to introduce another
concept, that of the constant parameter. These are parameters that are
not used when invoking backprop. This sounds very abstract but here’s
what is really going on. Assume that we have some function
In this case 3 is a constant parameter. We could change 3 to something else, say \(c\) via
Nothing has really changed, except that we can adjust the value of
\(c\). It is still a constant as far as \(\mathbf{w}\) and
\(\mathbf{x}\) are concerned. However, since Gluon does not know
about this beforehand, it is worth while to give it a hand (this makes
the code go faster, too, since we are not sending the Gluon engine on a
wild goose chase after a parameter that does not change).
get_constant
is the method that can be used to accomplish this.
Let’s see what this looks like in practice.
class FancyMLP(nn.Block):
def __init__(self, **kwargs):
super(FancyMLP, self).__init__(**kwargs)
# Random weight parameters created with the get_constant are not
# iterated during training (i.e., constant parameters)
self.rand_weight = self.params.get_constant(
'rand_weight', np.random.uniform(size=(20, 20)))
self.dense = nn.Dense(20, activation='relu')
def forward(self, x):
x = self.dense(x)
# Use the constant parameters created, as well as the relu
# and dot functions
x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
# Reuse the fully connected layer. This is equivalent to sharing
# parameters with two fully connected layers
x = self.dense(x)
# Here in Control flow, we need to call asscalar to return the scalar
# for comparison
while np.abs(x).sum() > 1:
x /= 2
if np.abs(x).sum() < 0.8:
x *= 10
return x.sum()
In this FancyMLP
model, we used constant weight Rand_weight
(note that it is not a model parameter), performed a matrix
multiplication operation (np.dot<
), and reused the same Dense
layer. Note that this is very different from using two dense layers with
different sets of parameters. Instead, we used the same network twice.
Quite often in deep networks one also says that the parameters are
tied when one wants to express that multiple parts of a network share
the same parameters. Let’s see what happens if we construct it and feed
data through it.
net = FancyMLP()
net.initialize()
net(x)
array(5.2637568)
There is no reason why we couldn’t mix and match these ways of build a network. Obviously the example below resembles more a chimera, or less charitably, a Rube Goldberg Machine. That said, it combines examples for building a block from individual blocks, which in turn, may be blocks themselves. Furthermore, we can even combine multiple strategies inside the same forward function. To demonstrate this, here’s the network.
class NestMLP(nn.Block):
def __init__(self, **kwargs):
super(NestMLP, self).__init__(**kwargs)
self.net = nn.Sequential()
self.net.add(nn.Dense(64, activation='relu'),
nn.Dense(32, activation='relu'))
self.dense = nn.Dense(16, activation='relu')
def forward(self, x):
return self.dense(self.net(x))
chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FancyMLP())
chimera.initialize()
chimera(x)
array(0.97720534)
5.1.4. Compilation¶
The avid reader is probably starting to worry about the efficiency of this. After all, we have lots of dictionary lookups, code execution, and lots of other Pythonic things going on in what is supposed to be a high performance deep learning library. The problems of Python’s Global Interpreter Lock are well known. In the context of deep learning it means that we have a super fast GPU (or multiple of them) which might have to wait until a puny single CPU core running Python gets a chance to tell it what to do next. This is clearly awful and there are many ways around it. The best way to speed up Python is by avoiding it altogether.
Gluon does this by allowing for Hybridization (Section 12.1). In it, the Python interpreter executes the block the first time it is invoked. The Gluon runtime records what is happening and the next time around it short circuits any calls to Python. This can accelerate things considerably in some cases but care needs to be taken with control flow. We suggest that the interested reader skip forward to the section covering hybridization and compilation after finishing the current chapter.
5.1.5. Summary¶
Layers are blocks
Many layers can be a block
Many blocks can be a block
Code can be a block
Blocks take are of a lot of housekeeping, such as parameter initialization, backprop and related issues.
Sequential concatenations of layers and blocks are handled by the eponymous
Sequential
block.
5.1.6. Exercises¶
What kind of error message will you get when calling an
__init__
method whose parent class not in the__init__
function of the parent class?What kinds of problems will occur if you remove the
asscalar
function in theFancyMLP
class?What kinds of problems will occur if you change
self.net
defined by the Sequential instance in theNestMLP
class toself.net = [nn.Dense(64, activation='relu'), nn. Dense(32, activation='relu')]
?Implement a block that takes two blocks as an argument, say
net1
andnet2
and returns the concatenated output of both networks in the forward pass (this is also called a parallel block).Assume that you want to concatenate multiple instances of the same network. Implement a factory function that generates multiple instances of the same block and build a larger network from it.