# 6.5. Building a Recurrent Neural Network from Scratch¶

In this section, we will implement a language model from scratch. It is based on a character-level recurrent neural network that is trained on H. G. Wells’ ‘The Time Machine’. As before, we start by reading the dataset first.

```
In [1]:
```

```
import sys
sys.path.insert(0, '..')
import d2l
import math
from mxnet import autograd, nd
from mxnet.gluon import loss as gloss
import time
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = \
d2l.load_data_time_machine()
```

## 6.5.1. One-hot Encoding¶

One-hot encoding vectors provide an easy way to express words as vectors
in order to process them in a deep network. In a nutshell, we map each
word to a different unit vector: assume that the number of different
characters in the dictionary is \(N\) (the `vocab_size`

) and each
character has a one-to-one correspondence with a single value in the
index of successive integers from 0 to \(N-1\). If the index of a
character is the integer \(i\), then we create a vector
\(\mathbf{e}_i\) of all 0s with a length of \(N\) and set the
element at position \(i\) to 1. This vector is the one-hot vector of
the original character. The one-hot vectors with indices 0 and 2 are
shown below (the length of the vector is equal to the dictionary size).

```
In [2]:
```

```
nd.one_hot(nd.array([0, 2]), vocab_size)
```

```
Out[2]:
```

```
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x43 @cpu(0)>
```

The shape of the mini-batch we sample each time is (batch size, time step). The following function transforms such mini-batches into a number of matrices with the shape of (batch size, dictionary size) that can be entered into the network. The total number of vectors is equal to the number of time steps. That is, the input of time step \(t\) is \(\boldsymbol{X}_t \in \mathbb{R}^{n \times d}\), where \(n\) is the batch size and \(d\) is the number of inputs. That is the one-hot vector length (the dictionary size).

```
In [3]:
```

```
# This function is saved in the d2l package for future use.
def to_onehot(X, size):
return [nd.one_hot(x, size) for x in X.T]
X = nd.arange(10).reshape((2, 5))
inputs = to_onehot(X, vocab_size)
len(inputs), inputs[0].shape
```

```
Out[3]:
```

```
(5, (2, 43))
```

The code above generates 5 minibatches containing 2 vectors each. Since we have a total of 43 distinct symbols in “The Time Machine” we get 43-dimensional vectors.

## 6.5.2. Initializing the Model Parameters¶

Next, we initialize the model parameters. The number of hidden units
`num_hiddens`

is a tunable parameter.

```
In [4]:
```

```
num_inputs, num_hiddens, num_outputs = vocab_size, 512, vocab_size
ctx = d2l.try_gpu()
print('Using', ctx)
# Create the parameters of the model, initialize them and attach gradients
def get_params():
def _one(shape):
return nd.random.normal(scale=0.01, shape=shape, ctx=ctx)
# Hidden layer parameters
W_xh = _one((num_inputs, num_hiddens))
W_hh = _one((num_hiddens, num_hiddens))
b_h = nd.zeros(num_hiddens, ctx=ctx)
# Output layer parameters
W_hq = _one((num_hiddens, num_outputs))
b_q = nd.zeros(num_outputs, ctx=ctx)
# Attach a gradient
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
param.attach_grad()
return params
```

```
Using gpu(0)
```

## 6.5.3. Sequence Modeling¶

### 6.5.3.1. RNN Model¶

We implement this model based on the definition of an RNN. First, we
need an `init_rnn_state`

function to return the hidden state at
initialization. It returns a tuple consisting of an NDArray with a value
of 0 and a shape of (batch size, number of hidden units). Using tuples
makes it easier to handle situations where the hidden state contains
multiple NDArrays (e.g. when combining multiple layers in an RNN).

```
In [5]:
```

```
def init_rnn_state(batch_size, num_hiddens, ctx):
return (nd.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )
```

The following `rnn`

function defines how to compute the hidden state
and output in a time step. The activation function here uses the tanh
function. As described in the “Multilayer
Perceptron” section, the
mean value of tanh function values is 0 when the elements are evenly
distributed over the real number field.

```
In [6]:
```

```
def rnn(inputs, state, params):
# Both inputs and outputs are composed of num_steps matrices
# of the shape (batch_size, vocab_size).
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
H = nd.tanh(nd.dot(X, W_xh) + nd.dot(H, W_hh) + b_h)
Y = nd.dot(H, W_hq) + b_q
outputs.append(Y)
return outputs, (H,)
```

Let’s run a simple test to check whether inputs and outputs are accurate. In particular, we check output dimensions, the number of outputs and ensure that the hidden state hasn’t changed.

```
In [7]:
```

```
state = init_rnn_state(X.shape[0], num_hiddens, ctx)
inputs = to_onehot(X.as_in_context(ctx), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
len(outputs), outputs[0].shape, state_new[0].shape
```

```
Out[7]:
```

```
(5, (2, 43), (2, 512))
```

### 6.5.3.2. Prediction Function¶

The following function predicts the next `num_chars`

characters based
on the `prefix`

(a string containing several characters). This
function is a bit more complicated. In it, we set the recurrent neural
unit `rnn`

as a function parameter, so that this function can be
reused in the other recurrent neural networks described in following
sections.

```
In [8]:
```

```
# This function is saved in the d2l package for future use.
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx):
state = init_rnn_state(1, num_hiddens, ctx)
output = [char_to_idx[prefix[0]]]
for t in range(num_chars + len(prefix) - 1):
# The output of the previous time step is taken
# as the input of the current time step.
X = to_onehot(nd.array([output[-1]], ctx=ctx), vocab_size)
# Calculate the output and update the hidden state.
(Y, state) = rnn(X, state, params)
# The input to the next time step is the character in
# the prefix or the current best predicted character.
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
# This is maximum likelihood decoding, not sampling
output.append(int(Y[0].argmax(axis=1).asscalar()))
return ''.join([idx_to_char[i] for i in output])
```

We test the `predict_rnn`

function first. We will create a lyric with
a length of 10 characters (regardless of the prefix length) based on the
prefix “separate”. Because the model parameters are random values, the
prediction results are also random.

```
In [9]:
```

```
predict_rnn('traveller', 10, rnn, params, init_rnn_state, num_hiddens,
vocab_size, ctx, idx_to_char, char_to_idx)
```

```
Out[9]:
```

```
'travellerefa_)voz c'
```

## 6.5.4. Gradient Clipping¶

When solving an optimization problem we take update steps for the weights \(\mathbf{w}\) in the general direction of the negative gradient \(\mathbf{g}_t\) on a minibatch, say \(\mathbf{w} - \eta \cdot \mathbf{g}_t\). Let’s further assume that the objective is well behaved, i.e. it is Lipschitz continuous with constant \(L\), i.e.

In this case we can safely assume that if we update the weight vector by \(\eta \cdot \mathbf{g}_t\) we will not observe a change by more than \(L \eta \|\mathbf{g}_t\|\). This is both a curse and a blessing. A curse since it limits the speed with which we can make progress, a blessing since it limits the extent to which things can go wrong if we move in the wrong direction.

Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. We could address this by reducing the learning rate \(\eta\) or by some other higher order trick. But what if we only rarely get large gradients? In this case such an approach may appear entirely unwarranted. One alternative is to clip the gradients by projecting them back to a ball of a given radius, say \(\theta\) via

By doing so we know that the gradient norm never exceeds \(\theta\) and that the updated gradient is entirely aligned with the original direction \(\mathbf{g}\). Back to the case at hand - optimization in RNNs. One of the issues is that the gradients in an RNN may either explode or vanish. Consider the chain of matrix-products involved in backpropagation. If the largest eigenvalue of the matrices is typically larger than \(1\), then the product of many such matrices can be much larger than \(1\). As a result, the aggregate gradient might explode. Gradient clipping provides a quick fix. While it doesn’t entire solve the problem, it is one of the many techniques to alleviate it.

```
In [10]:
```

```
# This function is saved in the d2l package for future use.
def grad_clipping(params, theta, ctx):
norm = nd.array([0], ctx)
for param in params:
norm += (param.grad ** 2).sum()
norm = norm.sqrt().asscalar()
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
```

## 6.5.5. Perplexity¶

One way of measuring how well a sequence model works is to check how
surprising the text is. A good language model is able to predict with
high accuracy what we will see next. Consider the following
continuations of the phrase `It is raining`

, as proposed by different
language models:

`It is raining outside`

`It is raining banana tree`

`It is raining piouw;kcj pwepoiut`

In terms of quality, example 1 is clearly the best. The words are
sensible and logically coherent. While it might not quite so accurately
reflect which word follows (`in San Francisco`

and `in winter`

would
have been perfectly reasonable extensions), the model is able to capture
which kind of word follows. Example 2 is considerably worse by producing
a nonsensical and borderline dysgrammatical extension. Nonetheless, at
least the model has learned how to spell words and some degree of
correlation between words. Lastly, example 3 indicates a poorly trained
model that doesn’t fit data.

One way of measuring the quality of the model is to compute
\(p(w)\), i.e. the likelihood of the sequence. Unfortunately this is
a number that is hard to understand and difficult to compare. After all,
shorter sequences are *much* more likely than long ones, hence
evaluating the model on Tolstoy’s magnum opus ‘War and
Peace’ will
inevitably produce a much smaller likelihood than, say, on
Saint-Exupery’s novella ‘The Little
Prince’. What is
missing is the equivalent of an average.

Information Theory comes handy here. If we want to compress text we can ask about estimating the next symbol given the current set of symbols. A lower bound on the number of bits is given by \(-\log_2 p(w_t|w_{t-1}, \ldots w_1)\). A good language model should allow us to predict the next word quite accurately and thus it should allow us to spend very few bits on compressing the sequence. One way of measuring it is by the average number of bits that we need to spend.

This makes the performance on documents of different lengths comparable. For historical reasons scientists in natural language processing prefer to use a quantity called perplexity rather than bitrate. In a nutshell it is the exponential of the above:

It can be best understood as the harmonic mean of the number of real choices that we have when deciding which word to pick next. Note that Perplexity naturally generalizes the notion of the cross entropy loss defined when we introduced Softmax Regression. That is, for a single symbol both definitions are identical bar the fact that one is the exponential of the other. Let’s look at a number of cases:

- In the best case scenario, the model always estimates the probability of the next symbol as \(1\). In this case the perplexity of the model is \(1\).
- In the worst case scenario, the model always predicts the probability of the label category as 0. In this situation, the perplexity is infinite.
- At the baseline, the model predicts a uniform distribution over all
tokens. In this case the perplexity equals the size of the dictionary
`vocab_size`

. In fact, if we were to store the sequence without any compression this would be the best we could do to encode it. Hence this provides a nontrivial upper bound that any model must satisfy.

## 6.5.6. Training the Model¶

Training a sequence model proceeds quite different from previous codes. In particular we need to take care of the following changes due to the fact that the tokens appear in order:

- We use perplexity to evaluate the model. This ensures that different tests are comparable.
- We clip the gradient before updating the model parameters. This ensures that the model doesn’t diverge even when gradients blow up at some point during the training process (effectively it reduces the stepsize automatically).
- Different sampling methods for sequential data (independent sampling and sequential partitioning) will result in differences in the initialization of hidden states. We discussed these issues in detail when we covered data processing.

### 6.5.6.1. Optimization Loop¶

To allow for more flexibility the call signature and the code are slightly longer. This will allow us to replace the various pieces by a Gluon implementation subsequently without the need to change the training logic.

```
In [11]:
```

```
# This function is saved in the d2l package for future use.
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, is_random_iter, num_epochs, num_steps,
lr, clipping_theta, batch_size, pred_period,
pred_len, prefixes):
if is_random_iter:
data_iter_fn = d2l.data_iter_random
else:
data_iter_fn = d2l.data_iter_consecutive
params = get_params()
loss = gloss.SoftmaxCrossEntropyLoss()
for epoch in range(num_epochs):
if not is_random_iter:
# If adjacent sampling is used, the hidden state is initialized
# at the beginning of the epoch.
state = init_rnn_state(batch_size, num_hiddens, ctx)
l_sum, n, start = 0.0, 0, time.time()
data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, ctx)
for X, Y in data_iter:
if is_random_iter:
# If random sampling is used, the hidden state is initialized
# before each mini-batch update.
state = init_rnn_state(batch_size, num_hiddens, ctx)
else:
# Otherwise, the detach function needs to be used to separate
# the hidden state from the computational graph to avoid
# backpropagation beyond the current sample.
for s in state:
s.detach()
with autograd.record():
inputs = to_onehot(X, vocab_size)
# outputs is num_steps terms of shape (batch_size, vocab_size)
(outputs, state) = rnn(inputs, state, params)
# after stitching it is (num_steps * batch_size, vocab_size).
outputs = nd.concat(*outputs, dim=0)
# The shape of Y is (batch_size, num_steps), and then becomes
# a vector with a length of batch * num_steps after
# transposition. This gives it a one-to-one correspondence
# with output rows.
y = Y.T.reshape((-1,))
# Average classification error via cross entropy loss.
l = loss(outputs, y).mean()
l.backward()
grad_clipping(params, clipping_theta, ctx) # Clip the gradient.
d2l.sgd(params, lr, 1)
# Since the error is the mean, no need to average gradients here.
l_sum += l.asscalar() * y.size
n += y.size
if (epoch + 1) % pred_period == 0:
print('epoch %d, perplexity %f, time %.2f sec' % (
epoch + 1, math.exp(l_sum / n), time.time() - start))
for prefix in prefixes:
print(' -', predict_rnn(
prefix, pred_len, rnn, params, init_rnn_state,
num_hiddens, vocab_size, ctx, idx_to_char, char_to_idx))
```

### 6.5.6.2. Experiments with a Sequence Model¶

Now we can train the model. First, we need to set the model hyper-parameters. To allow for some meaningful amount of context we set the sequence length to 64. To get some intuition of how well the model works, we will have it generate 50 characters every 50 epochs of the training phase. In particular, we will see how training using the ‘separate’ and ‘sequential’ term generation will affect the performance of the model.

```
In [12]:
```

```
num_epochs, num_steps, batch_size, lr, clipping_theta = 500, 64, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, ['traveller', 'time traveller']
```

Let’s use random sampling to train the model and produce some text.

```
In [13]:
```

```
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, True, num_epochs, num_steps, lr,
clipping_theta, batch_size, pred_period, pred_len,
prefixes)
```

```
epoch 50, perplexity 8.846857, time 0.23 sec
- traveller andimension that inger andimension that inger and
- time travellere that inger andimension that inger andimension th
epoch 100, perplexity 7.302852, time 0.22 sec
- traveller somentions of the theng the prong the prought in
- time traveller somentions of the theng the prong the prought in
epoch 150, perplexity 5.538624, time 0.22 sec
- traveller said the time traveller said the time traveller s
- time traveller said the time traveller said the time traveller s
epoch 200, perplexity 3.614842, time 0.23 sec
- traveller said but ing tha knother the presentint i a move
- time traveller the ind in a mure the trient ins of staying that
epoch 250, perplexity 2.297549, time 0.23 sec
- traveller proceeded the time traveller. ''se's and have exp
- time traveller pacenthing to exo an wal experind fourth dimensio
epoch 300, perplexity 1.752673, time 0.23 sec
- traveller here im nispledithey se tal onstre the folled mon
- time traveller here imen some pareacong now on a far and any cau
epoch 350, perplexity 1.489119, time 0.22 sec
- traveller. 'you can show black is white by argument,' said
- time traveller. 'you can show black is white by argument,' said
epoch 400, perplexity 1.416642, time 0.22 sec
- traveller (mones of that ureand he the bert and fas owes sh
- time traveller smiled. 'are yeinermokkin woled by ooknot dote th
epoch 450, perplexity 1.371737, time 0.22 sec
- traveller (for so it will be convenient to speak of him) wa
- time traveller came back, and filby's anecdote collapsed. the t
epoch 500, perplexity 1.285687, time 0.23 sec
- traveller (for so it will be convenient to speak of him) wa
- time traveller cfor so it will be convenient to speak of him) wa
```

Even though our model was rather primitive, it is nonetheless able to produce text that resembles language. In particular it learns some notion of quotations, punctuation and a basic sense of grammar, at least for frequent words. Now let’s compare this with sequential partitioning.

```
In [14]:
```

```
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
vocab_size, ctx, corpus_indices, idx_to_char,
char_to_idx, False, num_epochs, num_steps, lr,
clipping_theta, batch_size, pred_period, pred_len,
prefixes)
```

```
epoch 50, perplexity 8.899938, time 0.22 sec
- travellereat ing the the the the the the the the the the th
- time travellereat ing the the the the the the the the the the th
epoch 100, perplexity 6.988937, time 0.23 sec
- traveller asmention to ast all ofre the thave tou tome thav
- time traveller asmention to ast all ofre the thave tou tome thav
epoch 150, perplexity 4.438944, time 0.23 sec
- traveller space. our and the psesing dif cumensthac it ally
- time traveller space. all that it arave the the foul thas genten
epoch 200, perplexity 2.534338, time 0.23 sec
- traveller. 'you can so mine dof in aswer ald and the phoch
- time traveller coflo hap an at are areeme an and the verime sard
epoch 250, perplexity 1.542069, time 0.22 sec
- traveller to the dimentions loust of cour end whot hrac wha
- time traveller he dof count rexpstincenol by andid a fouly thang
epoch 300, perplexity 1.301869, time 0.23 sec
- traveller. 'it sovenithere are resply hewhr_ a ficulby ar
- time traveller hald is his raclld mavo. 'ack, is tlo eare trearl
epoch 350, perplexity 1.192678, time 0.22 sec
- traveller. 'it would be remarkably convenient for the hist
- time traveller held in his hand was a glittering metallic framew
epoch 400, perplexity 1.179143, time 0.23 sec
- traveller, with a slight accession of cheerfulness. 'really
- time traveller hald ro no thavk beat our a mean thoved the oldi
epoch 450, perplexity 1.132937, time 0.22 sec
- traveller. 'it would be remarkably convenient for the hitt
- time traveller sfion of shage te spalk har oumt wasceand his eri
epoch 500, perplexity 1.168394, time 0.23 sec
- traveller ffor sh il pasint 'o the ktimat ical monscone and
- time traveller (for so it will be convenient to speak of him) wa
```

The perplexity is quite a bit lower. In fact, both models are pretty close to \(1\). This means that if we were to compress the text using this simple character-based language model we would needs less than 1 bit per character to encode a symbol. In the following we will see how to improve significantly on the current model and how to make it faster and easier to implement.

## 6.5.7. Summary¶

- Sequence models need state initialization for training.
- Between sequential models you need to ensure to detach the gradient, to ensure that the automatic differentiation does not propagate effects beyond the current sample.
- A simple RNN language model consists of an encoder, an RNN model and a decoder.
- Gradient clipping prevents gradient explosion (but it cannot fix vanishing gradients).
- Perplexity calibrates model performance across variable sequence length. It is the exponentiated average of the cross-entropy loss.
- Sequential partitioning typically leads to better models.

## 6.5.8. Problems¶

- Show that one-hot encoding is equivalent to picking a different embedding for each object.
- Adjust the hyperparameters to improve the perplexity.
- How low can you go? Adjust embeddings, hidden units, learning rate, etc.
- How well will it work on other books by H. G. Wells, e.g. The War of the Worlds.

- Run the code in this section without clipping the gradient. What happens?
- Set the
`pred_period`

variable to 1 to observe how the under-trained model (high perplexity) writes lyrics. What can you learn from this? - Change adjacent sampling so that it does not separate hidden states from the computational graph. Does the running time change? How about the accuracy?
- Replace the activation function used in this section with ReLU and repeat the experiments in this section.
- Prove that the perplexity is the inverse of the harmonic mean of the conditional word probabilities.