# 9.1. Gated Recurrent Units (GRU)¶

In the previous section, we discussed how gradients are calculated in a recurrent neural network. In particular we found that long products of matrices can lead to vanishing or divergent gradients. Let us briefly think about what such gradient anomalies mean in practice:

We might encounter a situation where an early observation is highly significant for predicting all future observations. Consider the somewhat contrived case where the first observation contains a checksum and the goal is to discern whether the checksum is correct at the end of the sequence. In this case, the influence of the first token is vital. We would like to have some mechanisms for storing vital early information in a

*memory cell*. Without such a mechanism, we will have to assign a very large gradient to this observation, since it affects all subsequent observations.We might encounter situations where some symbols carry no pertinent observation. For instance, when parsing a web page there might be auxiliary HTML code that is irrelevant for the purpose of assessing the sentiment conveyed on the page. We would like to have some mechanism for

*skipping such symbols*in the latent state representation.We might encounter situations where there is a logical break between parts of a sequence. For instance, there might be a transition between chapters in a book, or a transition between a bear and a bull market for securities. In this case it would be nice to have a means of

*resetting*our internal state representation.

A number of methods have been proposed to address this. One of the earliest is Long Short Term Memory (LSTM) [Hochreiter & Schmidhuber, 1997] which we will discuss in Section 9.2. Gated Recurrent Unit (GRU) [Cho et al., 2014] is a slightly more streamlined variant that often offers comparable performance and is significantly faster to compute. See also [Chung et al., 2014] for more details. Due to its simplicity, let us start with the GRU.

## 9.1.2. Implementation from Scratch¶

To gain a better understanding of the model, let us implement a GRU from scratch.

### 9.1.2.1. Reading the Dataset¶

We begin by reading *The Time Machine* corpus that we used in
Section 8.5. The code for reading the dataset is given
below:

```
from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import rnn
npx.set_np()
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
```

### 9.1.2.2. Initializing Model Parameters¶

The next step is to initialize the model parameters. We draw the weights
from a Gaussian with variance to be \(0.01\) and set the bias to
\(0\). The hyperparameter `num_hiddens`

defines the number of
hidden units. We instantiate all weights and biases relating to the
update gate, the reset gate, and the candidate hidden state itself.
Subsequently, we attach gradients to all the parameters.

```
def get_params(vocab_size, num_hiddens, ctx):
num_inputs = num_outputs = vocab_size
def normal(shape):
return np.random.normal(scale=0.01, size=shape, ctx=ctx)
def three():
return (normal((num_inputs, num_hiddens)),
normal((num_hiddens, num_hiddens)),
np.zeros(num_hiddens, ctx=ctx))
W_xz, W_hz, b_z = three() # Update gate parameter
W_xr, W_hr, b_r = three() # Reset gate parameter
W_xh, W_hh, b_h = three() # Candidate hidden state parameter
# Output layer parameters
W_hq = normal((num_hiddens, num_outputs))
b_q = np.zeros(num_outputs, ctx=ctx)
# Attach gradients
params = [W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q]
for param in params:
param.attach_grad()
return params
```

### 9.1.2.3. Defining the Model¶

Now we will define the hidden state initialization function
`init_gru_state`

. Just like the `init_rnn_state`

function defined in
Section 8.5, this function returns a tensor with a shape
(batch size, number of hidden units) whose values are all zeros.

```
def init_gru_state(batch_size, num_hiddens, ctx):
return (np.zeros(shape=(batch_size, num_hiddens), ctx=ctx), )
```

Now we are ready to define the GRU model. Its structure is the same as the basic RNN cell, except that the update equations are more complex.

```
def gru(inputs, state, params):
W_xz, W_hz, b_z, W_xr, W_hr, b_r, W_xh, W_hh, b_h, W_hq, b_q = params
H, = state
outputs = []
for X in inputs:
Z = npx.sigmoid(np.dot(X, W_xz) + np.dot(H, W_hz) + b_z)
R = npx.sigmoid(np.dot(X, W_xr) + np.dot(H, W_hr) + b_r)
H_tilda = np.tanh(np.dot(X, W_xh) + np.dot(R * H, W_hh) + b_h)
H = Z * H + (1 - Z) * H_tilda
Y = np.dot(H, W_hq) + b_q
outputs.append(Y)
return np.concatenate(outputs, axis=0), (H,)
```

### 9.1.2.4. Training and Prediction¶

Training and prediction work in exactly the same manner as before. After training for one epoch, the perplexity and the output sentence will be like the following.

```
vocab_size, num_hiddens, ctx = len(vocab), 256, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, ctx, get_params,
init_gru_state, gru)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
```

```
perplexity 1.1, 13498.8 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller it s against reason said filby what reason said
```

## 9.1.3. Concise Implementation¶

In Gluon, we can directly call the `GRU`

class in the `rnn`

module.
This encapsulates all the configuration detail that we made explicit
above. The code is significantly faster as it uses compiled operators
rather than Python for many details that we spelled out in detail
before.

```
gru_layer = rnn.GRU(num_hiddens)
model = d2l.RNNModel(gru_layer, len(vocab))
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, ctx)
```

```
perplexity 1.1, 191591.0 tokens/sec on gpu(0)
time traveller it s against reason said filby what reason said
traveller said the time traveller you can show black is wh
```

## 9.1.4. Summary¶

Gated recurrent neural networks are better at capturing dependencies for time series with large timestep distances.

Reset gates help capture short-term dependencies in time series.

Update gates help capture long-term dependencies in time series.

GRUs contain basic RNNs as their extreme case whenever the reset gate is switched on. They can ignore sequences as needed.

## 9.1.5. Exercises¶

Compare runtime, perplexity, and the output strings for

`rnn.RNN`

and`rnn.GRU`

implementations with each other.Assume that we only want to use the input for timestep \(t'\) to predict the output at timestep \(t > t'\). What are the best values for the reset and update gates for each timestep?

Adjust the hyperparameters and observe and analyze the impact on running time, perplexity, and the written lyrics.

What happens if you implement only parts of a GRU? That is, implement a recurrent cell that only has a reset gate. Likewise, implement a recurrent cell only with an update gate.