# 8.5. Implementation of Recurrent Neural Networks from Scratch¶ Open the notebook in Colab Open the notebook in Colab Open the notebook in Colab

In this section we implement a language model introduced in Section 8 from scratch. It is based on a character-level recurrent neural network trained on H. G. Wells’ The Time Machine. As before, we start by reading the dataset first, which is introduced in Section 8.3.

%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import autograd, np, npx, gluon
npx.set_np()

batch_size, num_steps = 32, 35

%matplotlib inline
from d2l import torch as d2l
import math
import torch
from torch import nn
from torch.nn import functional as F

batch_size, num_steps = 32, 35


## 8.5.1. One-hot Encoding¶

Remember that each token is presented as a numerical index in train_iter. Feeding these indices directly to the neural network might make it hard to learn. We often present each token as a more expressive feature vector. The easiest representation is called one-hot encoding.

In a nutshell, we map each index to a different unit vector: assume that the number of different tokens in the vocabulary is $$N$$ (the len(vocab)) and the token indices range from 0 to $$N-1$$. If the index of a token is the integer $$i$$, then we create a vector $$\mathbf{e}_i$$ of all 0s with a length of $$N$$ and set the element at position $$i$$ to 1. This vector is the one-hot vector of the original token. The one-hot vectors with indices 0 and 2 are shown below.

npx.one_hot(np.array([0, 2]), len(vocab))

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

F.one_hot(torch.tensor([0, 2]), len(vocab))

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]])


The shape of the minibatch we sample each time is (batch size, timestep). The one_hot function transforms such a minibatch into a 3-D tensor with the last dimension equals to the vocabulary size. We often transpose the input so that we will obtain a (timestep, batch size, vocabulary size) output that fits into a sequence model easier.

X = np.arange(10).reshape(2, 5)
npx.one_hot(X.T, 28).shape

(5, 2, 28)

X = torch.arange(10).reshape(2, 5)
F.one_hot(X.T, 28).shape

torch.Size([5, 2, 28])


## 8.5.2. Initializing the Model Parameters¶

Next, we initialize the model parameters for a RNN model. The number of hidden units num_hiddens is a tunable parameter.

def get_params(vocab_size, num_hiddens, ctx):
num_inputs = num_outputs = vocab_size

def normal(shape):
return np.random.normal(scale=0.01, size=shape, ctx=ctx)
# Hidden layer parameters
W_xh = normal((num_inputs, num_hiddens))
W_hh = normal((num_hiddens, num_hiddens))
b_h = d2l.zeros(num_hiddens, ctx=ctx)
# Output layer parameters
W_hq = normal((num_hiddens, num_outputs))
b_q = d2l.zeros(num_outputs, ctx=ctx)
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
return params

def get_params(vocab_size, num_hiddens, device):
num_inputs = num_outputs = vocab_size

def normal(shape):
# Hidden layer parameters
W_xh = normal((num_inputs, num_hiddens))
W_hh = normal((num_hiddens, num_hiddens))
b_h = d2l.zeros(num_hiddens, device=device)
# Output layer parameters
W_hq = normal((num_hiddens, num_outputs))
b_q = d2l.zeros(num_outputs, device=device)
params = [W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
return params


## 8.5.3. RNN Model¶

First, we need an init_rnn_state function to return the hidden state at initialization. It returns a tensor filled with 0 and with a shape of (batch size, number of hidden units). Using tuples makes it easier to handle situations where the hidden state contains multiple variables (e.g., when combining multiple layers in an RNN where each layer requires initializing).

def init_rnn_state(batch_size, num_hiddens, ctx):
return (d2l.zeros((batch_size, num_hiddens), ctx=ctx), )

def init_rnn_state(batch_size, num_hiddens, device):
return (d2l.zeros((batch_size, num_hiddens), device=device), )


The following rnn function defines how to compute the hidden state and output in a timestep. The activation function here uses the $$\tanh$$ function. As described in :numref:sec_mlp, the mean value of the $$\tanh$$ function is 0, when the elements are evenly distributed over the real numbers.

def rnn(inputs, state, params):
# Inputs shape: (num_steps, batch_size, vocab_size)
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
H = np.tanh(np.dot(X, W_xh) + np.dot(H, W_hh) + b_h)
Y = np.dot(H, W_hq) + b_q
outputs.append(Y)
return np.concatenate(outputs, axis=0), (H,)

def rnn(inputs, state, params):
# Inputs shape: (num_steps, batch_size, vocab_size)
W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
Y = torch.mm(H, W_hq) + b_q
outputs.append(Y)


Now we have all functions defined, next we create a class to wrap these functions and store parameters.

#@save
class RNNModelScratch:
"""A RNN Model based on scratch implementations."""

def __init__(self, vocab_size, num_hiddens, ctx,
get_params, init_state, forward):
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
self.params = get_params(vocab_size, num_hiddens, ctx)
self.init_state, self.forward_fn = init_state, forward

def __call__(self, X, state):
X = npx.one_hot(X.T, self.vocab_size)
return self.forward_fn(X, state, self.params)

def begin_state(self, batch_size, ctx):
return self.init_state(batch_size, self.num_hiddens, ctx)

#@save
class RNNModelScratch:
"""A RNN Model based on scratch implementations."""

def __init__(self, vocab_size, num_hiddens, device,
get_params, init_state, forward):
self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
self.params = get_params(vocab_size, num_hiddens, device)
self.init_state, self.forward_fn = init_state, forward

def __call__(self, X, state):
X = F.one_hot(X.T.long(), self.vocab_size).type(torch.float32)
return self.forward_fn(X, state, self.params)

def begin_state(self, batch_size, device):
return self.init_state(batch_size, self.num_hiddens, device)


Let us do a sanity check whether inputs and outputs have the correct dimensions, e.g., to ensure that the dimensionality of the hidden state has not changed.

num_hiddens, ctx = 512, d2l.try_gpu()
model = RNNModelScratch(len(vocab), num_hiddens, ctx, get_params,
init_rnn_state, rnn)
state = model.begin_state(X.shape, ctx)
Y, new_state = model(X.as_in_ctx(ctx), state)
Y.shape, len(new_state), new_state.shape

((10, 28), 1, (2, 512))

num_hiddens, device = 512, d2l.try_gpu()
model = RNNModelScratch(len(vocab), num_hiddens, device, get_params,
init_rnn_state, rnn)
state = model.begin_state(X.shape, device)
Y, new_state = model(X.to(device), state)
Y.shape, len(new_state), new_state.shape

(torch.Size([10, 28]), 1, torch.Size([2, 512]))


We can see that the output shape is (number steps $$\times$$ batch size, vocabulary size), while the hidden state shape remains the same, i.e., (batch size, number of hidden units).

## 8.5.4. Prediction¶

We first explain the predicting function so we can regularly check the prediction during training. This function predicts the next num_predicts characters based on the prefix (a string containing several characters). For the beginning of the sequence, we only update the hidden state. After that we begin generating new characters and emitting them.

#@save
def predict_ch8(prefix, num_predicts, model, vocab, ctx):
state = model.begin_state(batch_size=1, ctx=ctx)
outputs = [vocab[prefix]]

def get_input():
return np.array([outputs[-1]], ctx=ctx).reshape(1, 1)
for y in prefix[1:]:  # Warmup state with prefix
_, state = model(get_input(), state)
outputs.append(vocab[y])
for _ in range(num_predicts):  # Predict num_predicts steps
Y, state = model(get_input(), state)
outputs.append(int(Y.argmax(axis=1).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])

#@save
def predict_ch8(prefix, num_predicts, model, vocab, device):
state = model.begin_state(batch_size=1, device=device)
outputs = [vocab[prefix]]

def get_input():
for y in prefix[1:]:  # Warmup state with prefix
_, state = model(get_input(), state)
outputs.append(vocab[y])
for _ in range(num_predicts):  # Predict num_predicts steps
Y, state = model(get_input(), state)
outputs.append(int(Y.argmax(dim=1).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])


We test the predict_ch8 function first. Given that we did not train the network, it will generate nonsensical predictions. We initialize it with the sequence traveller and have it generate 10 additional characters.

predict_ch8('time traveller ', 10, model, vocab, ctx)

'time traveller iiiiiiiiii'

predict_ch8('time traveller ', 10, model, vocab, device)

'time traveller lbudyudyud'


For a sequence of length $$T$$, we compute the gradients over these $$T$$ timesteps in an iteration, which results in a chain of matrix-products with length $$\mathcal{O}(T)$$ during backpropagating. As mentioned in Section 4.8, it might result in numerical instability, e.g., the gradients may either explode or vanish, when $$T$$ is large. Therefore, RNN models often need extra help to stabilize the training.

Recall that when solving an optimization problem, we take update steps for the weights $$\mathbf{w}$$ in the general direction of the negative gradient $$\mathbf{g}_t$$ on a minibatch, say $$\mathbf{w} - \eta \cdot \mathbf{g}_t$$. Let us further assume that the objective is well behaved, i.e., it is Lipschitz continuous with constant $$L$$, i.e.,

(8.5.1)$|l(\mathbf{w}) - l(\mathbf{w}')| \leq L \|\mathbf{w} - \mathbf{w}'\|.$

In this case we can safely assume that if we update the weight vector by $$\eta \cdot \mathbf{g}_t$$, we will not observe a change by more than $$L \eta \|\mathbf{g}_t\|$$. This is both a curse and a blessing. A curse since it limits the speed of making progress, whereas a blessing since it limits the extent to which things can go wrong if we move in the wrong direction.

Sometimes the gradients can be quite large and the optimization algorithm may fail to converge. We could address this by reducing the learning rate $$\eta$$ or by some other higher order trick. But what if we only rarely get large gradients? In this case such an approach may appear entirely unwarranted. One alternative is to clip the gradients by projecting them back to a ball of a given radius, say $$\theta$$ via

(8.5.2)$\mathbf{g} \leftarrow \min\left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}.$

By doing so we know that the gradient norm never exceeds $$\theta$$ and that the updated gradient is entirely aligned with the original direction $$\mathbf{g}$$. It also has the desirable side-effect of limiting the influence any given minibatch (and within it any given sample) can exert on the weight vectors. This bestows a certain degree of robustness to the model. Gradient clipping provides a quick fix to the gradient exploding. While it does not entirely solve the problem, it is one of the many techniques to alleviate it.

Below we define a function to clip the gradients of a model that is either a RNNModelScratch instance or a Gluon model. Also note that we compute the gradient norm over all parameters.

#@save
if isinstance(model, gluon.Block):
params = [p.data() for p in model.collect_params().values()]
else:
params = model.params
norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
if norm > theta:
for param in params:

#@save
if isinstance(model, nn.Module):
params = [p.data for p in model.parameters() if p.requires_grad]
else:
params = model.params

norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
if norm > theta:
for param in params:


## 8.5.6. Training¶

Let us first define the function to train the model on one data epoch. It differs from the models training of Section 3.6 in three places:

1. Different sampling methods for sequential data (random sampling and sequential partitioning) will result in differences in the initialization of hidden states.

2. We clip the gradients before updating the model parameters. This ensures that the model does not diverge even when gradients blow up at some point during the training process, and it effectively reduces the step size automatically.

3. We use perplexity to evaluate the model. This ensures that sequences of different length are comparable.

When the sequential partitioning is used, we initialize the hidden state at the beginning of each epoch. Since the $$i^\mathrm{th}$$ example in the next minibatch is adjacent to the current $$i^\mathrm{th}$$ example, so the next minibatch can use the current hidden state directly, we only detach the gradient so that we compute the gradients within a minibatch. When using the random sampling, we need to re-initialize the hidden state for each iteration since each example is sampled with a random position. Same as the train_epoch_ch3 function in Section 3.6, we use generalized updater, which could be either a Gluon trainer or a scratched implementation.

#@save
def train_epoch_ch8(model, train_iter, loss, updater, ctx, use_random_iter):
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2)  # loss_sum, num_examples
for X, Y in train_iter:
if state is None or use_random_iter:
# Initialize state when either it is the first iteration or
# using random sampling.
state = model.begin_state(batch_size=X.shape, ctx=ctx)
else:
for s in state:
s.detach()
y = Y.T.reshape(-1)
X, y = X.as_in_ctx(ctx), y.as_in_ctx(ctx)
py, state = model(X, state)
l = loss(py, y).mean()
l.backward()
updater(batch_size=1)  # Since used mean already
return math.exp(metric/metric), metric/timer.stop()

#@save
def train_epoch_ch8(model, train_iter, loss, updater, device,
use_random_iter):
state, timer = None, d2l.Timer()
metric = d2l.Accumulator(2)  # loss_sum, num_examples
for X, Y in train_iter:
if state is None or use_random_iter:
# Initialize state when either it is the first iteration or
# using random sampling.
state = model.begin_state(batch_size=X.shape, device=device)
else:
for s in state:
s.detach_()
y = Y.T.reshape(-1)
X, y = X.to(device), y.to(device)
py, state = model(X, state)
l = loss(py, y.long()).mean()

if isinstance(updater, torch.optim.Optimizer):
l.backward()
updater.step()
else:
l.backward()
updater(batch_size=1)  # Since used mean already

return math.exp(metric/metric), metric/timer.stop()


The training function again supports either we implement the model from scratch or using Gluon.

#@save
def train_ch8(model, train_iter, vocab, lr, num_epochs, ctx,
use_random_iter=False):
# Initialize
loss = gluon.loss.SoftmaxCrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
legend=['train'], xlim=[1, num_epochs])
if isinstance(model, gluon.Block):
model.initialize(ctx=ctx, force_reinit=True, init=init.Normal(0.01))
trainer = gluon.Trainer(model.collect_params(),
'sgd', {'learning_rate': lr})

def updater(batch_size):
return trainer.step(batch_size)
else:
def updater(batch_size):
return d2l.sgd(model.params, lr, batch_size)

def predict(prefix):
return predict_ch8(prefix, 50, model, vocab, ctx)

# Train and check the progress.
for epoch in range(num_epochs):
ppl, speed = train_epoch_ch8(
model, train_iter, loss, updater, ctx, use_random_iter)
if epoch % 10 == 0:
print(predict('time traveller'))
print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(ctx)}')
print(predict('time traveller'))
print(predict('traveller'))

#@save
def train_ch8(model, train_iter, vocab, lr, num_epochs, device,
use_random_iter=False):
# Initialize
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
legend=['train'], xlim=[1, num_epochs])
if isinstance(model, nn.Module):
trainer = torch.optim.SGD(model.parameters(), lr)

def updater(batch_size):
return trainer.step()
else:
def updater(batch_size):
return d2l.sgd(model.params, lr, batch_size)

def predict(prefix):
return predict_ch8(prefix, 50, model, vocab, device)

# Train and check the progress.
for epoch in range(num_epochs):
ppl, speed = train_epoch_ch8(
model, train_iter, loss, updater, device, use_random_iter)
if epoch % 10 == 0:
print(predict('time traveller'))
print(f'perplexity {ppl:.1f}, {speed:.1f} tokens/sec on {str(device)}')
print(predict('time traveller'))
print(predict('traveller'))


Now we can train a model. Since we only use $$10,000$$ tokens in the dataset, the model needs more epochs to converge.

num_epochs, lr = 500, 1
train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)

perplexity 1.1, 39021.3 tokens/sec on gpu(0)
time traveller  it s against reason said filby  what reason said
traveller sit save bon tak it asforeficger deew shell tato num_epochs, lr = 500, 1
train_ch8(model, train_iter, vocab, lr, num_epochs, device)

perplexity 1.0, 63018.9 tokens/sec on cuda:0
time traveller  it s against reason said filby  what reason said
traveller  it s against reason said filby  what reason said Finally let us check the results to use a random sampling iterator.

train_ch8(model, train_iter, vocab, lr, num_epochs, ctx,
use_random_iter=True)

perplexity 1.3, 37791.5 tokens/sec on gpu(0)
time traveller  you can show black is white by argument said fil
traveller for so it will be convenient to speak of him was train_ch8(model, train_iter, vocab, lr, num_epochs, device,
use_random_iter=True)

perplexity 1.4, 63330.2 tokens/sec on cuda:0
time traveller  it s against reason said filby  what reason said
traveller smiled round at us then still smiling faintly and While implementing the above RNN model from scratch is instructive, it is not convenient. In the next section we will see how to improve significantly on the current model and how to make it faster and easier to implement.

## 8.5.7. Summary¶

• Sequence models need state initialization for training.

• Between sequential models you need to ensure to detach the gradients, to ensure that the automatic differentiation does not propagate effects beyond the current sample.

• A simple RNN language model consists of an encoder, an RNN model, and a decoder.

• Perplexity calibrates model performance across different sequence length. It is the exponentiated average of the cross-entropy loss.

• Sequential partitioning typically leads to better models.

## 8.5.8. Exercises¶

1. Show that one-hot encoding is equivalent to picking a different embedding for each object.

2. Adjust the hyperparameters to improve the perplexity.

• How low can you go? Adjust embeddings, hidden units, learning rate, etc.

• How well will it work on other books by H. G. Wells, e.g., The War of the Worlds.

3. Modify the predict function such as to use sampling rather than picking the most likely next character.

• What happens?

• Bias the model towards more likely outputs, e.g., by sampling from $$q(w_t \mid w_{t-1}, \ldots, w_1) \propto p^\alpha(w_t \mid w_{t-1}, \ldots, w_1)$$ for $$\alpha > 1$$.

4. Run the code in this section without clipping the gradient. What happens?

5. Change sequential partitioning so that it does not separate hidden states from the computational graph. Does the running time change? How about the accuracy?

6. Replace the activation function used in this section with ReLU and repeat the experiments in this section.

7. Prove that the perplexity is the inverse of the harmonic mean of the conditional word probabilities.