.. _sec_deep_rnn:
Deep Recurrent Neural Networks
==============================
Up until now, we have focused on defining networks consisting of a
sequence input, a single hidden RNN layer, and an output layer. Despite
having just one hidden layer between the input at any time step and the
corresponding output, there is a sense in which these networks are deep.
Inputs from the first time step can influence the outputs at the final
time step :math:`T` (often 100s or 1000s of steps later). These inputs
pass through :math:`T` applications of the recurrent layer before
reaching the final output. However, we often also wish to retain the
ability to express complex relationships between the inputs at a given
time step and the outputs at that same time step. Thus we often
construct RNNs that are deep not only in the time direction but also in
the input-to-output direction. This is precisely the notion of depth
that we have already encountered in our development of MLPs and deep
CNNs.
The standard method for building this sort of deep RNN is strikingly
simple: we stack the RNNs on top of each other. Given a sequence of
length :math:`T`, the first RNN produces a sequence of outputs, also of
length :math:`T`. These, in turn, constitute the inputs to the next RNN
layer. In this short section, we illustrate this design pattern and
present a simple example for how to code up such stacked RNNs. Below, in
:numref:`fig_deep_rnn`, we illustrate a deep RNN with :math:`L` hidden
layers. Each hidden state operates on a sequential input and produces a
sequential output. Moreover, any RNN cell (white box in
:numref:`fig_deep_rnn`) at each time step depends on both the same
layer’s value at the previous time step and the previous layer’s value
at the same time step.
.. _fig_deep_rnn:
.. figure:: ../img/deep-rnn.svg
Architecture of a deep RNN.
Formally, suppose that we have a minibatch input
:math:`\mathbf{X}_t \in \mathbb{R}^{n \times d}` (number of examples
:math:`=n`; number of inputs in each example :math:`=d`) at time step
:math:`t`. At the same time step, let the hidden state of the
:math:`l^\textrm{th}` hidden layer (:math:`l=1,\ldots,L`) be
:math:`\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}` (number of hidden
units :math:`=h`) and the output layer variable be
:math:`\mathbf{O}_t \in \mathbb{R}^{n \times q}` (number of outputs:
:math:`q`). Setting :math:`\mathbf{H}_t^{(0)} = \mathbf{X}_t`, the
hidden state of the :math:`l^\textrm{th}` hidden layer that uses the
activation function :math:`\phi_l` is calculated as follows:
.. math:: \mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{\textrm{xh}}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{\textrm{hh}}^{(l)} + \mathbf{b}_\textrm{h}^{(l)}),
:label: eq_deep_rnn_H
where the weights
:math:`\mathbf{W}_{\textrm{xh}}^{(l)} \in \mathbb{R}^{h \times h}` and
:math:`\mathbf{W}_{\textrm{hh}}^{(l)} \in \mathbb{R}^{h \times h}`,
together with the bias
:math:`\mathbf{b}_\textrm{h}^{(l)} \in \mathbb{R}^{1 \times h}`, are the
model parameters of the :math:`l^\textrm{th}` hidden layer.
At the end, the calculation of the output layer is only based on the
hidden state of the final :math:`L^\textrm{th}` hidden layer:
.. math:: \mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{\textrm{hq}} + \mathbf{b}_\textrm{q},
where the weight
:math:`\mathbf{W}_{\textrm{hq}} \in \mathbb{R}^{h \times q}` and the
bias :math:`\mathbf{b}_\textrm{q} \in \mathbb{R}^{1 \times q}` are the
model parameters of the output layer.
Just as with MLPs, the number of hidden layers :math:`L` and the number
of hidden units :math:`h` are hyperparameters that we can tune. Common
RNN layer widths (:math:`h`) are in the range :math:`(64, 2056)`, and
common depths (:math:`L`) are in the range :math:`(1, 8)`. In addition,
we can easily get a deep-gated RNN by replacing the hidden state
computation in :eq:`eq_deep_rnn_H` with that from an LSTM or a GRU.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import np, npx
from mxnet.gluon import rnn
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import jax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
Implementation from Scratch
---------------------------
To implement a multilayer RNN from scratch, we can treat each layer as
an ``RNNScratch`` instance with its own learnable parameters.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class StackedRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.rnns = nn.Sequential(*[d2l.RNNScratch(
num_inputs if i==0 else num_hiddens, num_hiddens, sigma)
for i in range(num_layers)])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class StackedRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.rnns = [d2l.RNNScratch(num_inputs if i==0 else num_hiddens,
num_hiddens, sigma)
for i in range(num_layers)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class StackedRNNScratch(d2l.Module):
num_inputs: int
num_hiddens: int
num_layers: int
sigma: float = 0.01
def setup(self):
self.rnns = [d2l.RNNScratch(self.num_inputs if i==0 else self.num_hiddens,
self.num_hiddens, self.sigma)
for i in range(self.num_layers)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class StackedRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):
super().__init__()
self.save_hyperparameters()
self.rnns = [d2l.RNNScratch(num_inputs if i==0 else num_hiddens,
num_hiddens, sigma)
for i in range(num_layers)]
.. raw:: html
.. raw:: html
The multilayer forward computation simply performs forward computation
layer by layer.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):
outputs = inputs
if Hs is None: Hs = [None] * self.num_layers
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
outputs = torch.stack(outputs, 0)
return outputs, Hs
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):
outputs = inputs
if Hs is None: Hs = [None] * self.num_layers
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
outputs = np.stack(outputs, 0)
return outputs, Hs
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):
outputs = inputs
if Hs is None: Hs = [None] * self.num_layers
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
outputs = jnp.stack(outputs, 0)
return outputs, Hs
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):
outputs = inputs
if Hs is None: Hs = [None] * self.num_layers
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
outputs = tf.stack(outputs, 0)
return outputs, Hs
.. raw:: html
.. raw:: html
As an example, we train a deep GRU model on *The Time Machine* dataset
(same as in :numref:`sec_rnn-scratch`). To keep things simple we set
the number of layers to 2.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_48_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_51_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_54_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
with d2l.try_gpu():
rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_57_0.svg
.. raw:: html
.. raw:: html
Concise Implementation
----------------------
.. raw:: html
.. raw:: html
Fortunately many of the logistical details required to implement
multiple layers of an RNN are readily available in high-level APIs. Our
concise implementation will use such built-in functionalities. The code
generalizes the one we used previously in :numref:`sec_gru`, letting
us specify the number of layers explicitly rather than picking the
default of only one layer.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
def __init__(self, num_inputs, num_hiddens, num_layers, dropout=0):
d2l.Module.__init__(self)
self.save_hyperparameters()
self.rnn = nn.GRU(num_inputs, num_hiddens, num_layers,
dropout=dropout)
.. raw:: html
.. raw:: html
Fortunately many of the logistical details required to implement
multiple layers of an RNN are readily available in high-level APIs. Our
concise implementation will use such built-in functionalities. The code
generalizes the one we used previously in :numref:`sec_gru`, letting
us specify the number of layers explicitly rather than picking the
default of only one layer.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
def __init__(self, num_hiddens, num_layers, dropout=0):
d2l.Module.__init__(self)
self.save_hyperparameters()
self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout)
.. raw:: html
.. raw:: html
Flax takes a minimalistic approach while implementing RNNs. Defining the
number of layers in an RNN or combining it with dropout is not available
out of the box. Our concise implementation will use all built-in
functionalities and add ``num_layers`` and ``dropout`` features on top.
The code generalizes the one we used previously in :numref:`sec_gru`,
allowing specification of the number of layers explicitly rather than
picking the default of a single layer.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
num_hiddens: int
num_layers: int
dropout: float = 0
@nn.compact
def __call__(self, X, state=None, training=False):
outputs = X
new_state = []
if state is None:
batch_size = X.shape[1]
state = [nn.GRUCell.initialize_carry(jax.random.PRNGKey(0),
(batch_size,), self.num_hiddens)] * self.num_layers
GRU = nn.scan(nn.GRUCell, variable_broadcast="params",
in_axes=0, out_axes=0, split_rngs={"params": False})
# Introduce a dropout layer after every GRU layer except last
for i in range(self.num_layers - 1):
layer_i_state, X = GRU()(state[i], outputs)
new_state.append(layer_i_state)
X = nn.Dropout(self.dropout, deterministic=not training)(X)
# Final GRU layer without dropout
out_state, X = GRU()(state[-1], X)
new_state.append(out_state)
return X, jnp.array(new_state)
.. raw:: html
.. raw:: html
Fortunately many of the logistical details required to implement
multiple layers of an RNN are readily available in high-level APIs. Our
concise implementation will use such built-in functionalities. The code
generalizes the one we used previously in :numref:`sec_gru`, letting
us specify the number of layers explicitly rather than picking the
default of only one layer.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class GRU(d2l.RNN): #@save
"""The multilayer GRU model."""
def __init__(self, num_hiddens, num_layers, dropout=0):
d2l.Module.__init__(self)
self.save_hyperparameters()
gru_cells = [tf.keras.layers.GRUCell(num_hiddens, dropout=dropout)
for _ in range(num_layers)]
self.rnn = tf.keras.layers.RNN(gru_cells, return_sequences=True,
return_state=True, time_major=True)
def forward(self, X, state=None):
outputs, *state = self.rnn(X, state)
return outputs, state
.. raw:: html
.. raw:: html
The architectural decisions such as choosing hyperparameters are very
similar to those of :numref:`sec_gru`. We pick the same number of
inputs and outputs as we have distinct tokens, i.e., ``vocab_size``. The
number of hidden units is still 32. The only difference is that we now
select a nontrivial number of hidden layers by specifying the value of
``num_layers``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
gru = GRU(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2)
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_82_0.svg
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model.predict('it has', 20, data.vocab, d2l.try_gpu())
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
'it has for and the time th'
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
gru = GRU(num_hiddens=32, num_layers=2)
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
# Running takes > 1h (pending fix from MXNet)
# trainer.fit(model, data)
# model.predict('it has', 20, data.vocab, d2l.try_gpu())
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
gru = GRU(num_hiddens=32, num_layers=2)
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_89_0.svg
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model.predict('it has', 20, data.vocab, trainer.state.params)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
'it has is the prough said '
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
gru = GRU(num_hiddens=32, num_layers=2)
with d2l.try_gpu():
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
trainer.fit(model, data)
.. figure:: output_deep-rnn_d70a11_93_0.svg
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model.predict('it has', 20, data.vocab)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
'it has the time traveller '
.. raw:: html
.. raw:: html
Summary
-------
In deep RNNs, the hidden state information is passed to the next time
step of the current layer and the current time step of the next layer.
There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or
vanilla RNNs. Conveniently, these models are all available as parts of
the high-level APIs of deep learning frameworks. Initialization of
models requires care. Overall, deep RNNs require considerable amount of
work (such as learning rate and clipping) to ensure proper convergence.
Exercises
---------
1. Replace the GRU by an LSTM and compare the accuracy and training
speed.
2. Increase the training data to include multiple books. How low can you
go on the perplexity scale?
3. Would you want to combine sources of different authors when modeling
text? Why is this a good idea? What could go wrong?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html