9.4. Recurrent Neural Networks¶ Open the notebook in SageMaker Studio Lab
In Section 9.3 we described Markov models and \(n\)-grams for language modeling, where the conditional probability of token \(x_t\) at time step \(t\) only depends on the \(n-1\) previous tokens. If we want to incorporate the possible effect of tokens earlier than time step \(t-(n-1)\) on \(x_t\), we need to increase \(n\). However, the number of model parameters would also increase exponentially with it, as we need to store \(|\mathcal{V}|^n\) numbers for a vocabulary set \(\mathcal{V}\). Hence, rather than modeling \(P(x_t \mid x_{t-1}, \ldots, x_{t-n+1})\) it is preferable to use a latent variable model,
where \(h_{t-1}\) is a hidden state that stores the sequence information up to time step \(t-1\). In general, the hidden state at any time step \(t\) could be computed based on both the current input \(x_{t}\) and the previous hidden state \(h_{t-1}\):
For a sufficiently powerful function \(f\) in (9.4.2), the latent variable model is not an approximation. After all, \(h_t\) may simply store all the data it has observed so far. However, it could potentially make both computation and storage expensive.
Recall that we have discussed hidden layers with hidden units in Section 5. It is noteworthy that hidden layers and hidden states refer to two very different concepts. Hidden layers are, as explained, layers that are hidden from view on the path from input to output. Hidden states are technically speaking inputs to whatever we do at a given step, and they can only be computed by looking at data at previous time steps.
Recurrent neural networks (RNNs) are neural networks with hidden states. Before introducing the RNN model, we first revisit the MLP model introduced in Section 5.1.
import torch
from d2l import torch as d2l
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
import jax
from jax import numpy as jnp
from d2l import jax as d2l
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
import tensorflow as tf
from d2l import tensorflow as d2l
9.4.3. RNN-Based Character-Level Language Models¶
Recall that for language modeling in Section 9.3, we aim to predict the next token based on the current and past tokens; thus we shift the original sequence by one token as the targets (labels). Bengio et al. (2003) first proposed to use a neural network for language modeling. In the following we illustrate how RNNs can be used to build a language model. Let the minibatch size be one, and the sequence of the text be “machine”. To simplify training in subsequent sections, we tokenize text into characters rather than words and consider a character-level language model. Fig. 9.4.2 demonstrates how to predict the next character based on the current and previous characters via an RNN for character-level language modeling.
Fig. 9.4.2 A character-level language model based on the RNN. The input and target sequences are “machin” and “achine”, respectively.¶
During the training process, we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss to compute the error between the model output and the target. Because of the recurrent computation of the hidden state in the hidden layer, the output, \(\mathbf{O}_3\), of time step 3 in Fig. 9.4.2 is determined by the text sequence “m”, “a”, and “c”. Since the next character of the sequence in the training data is “h”, the loss of time step 3 will depend on the probability distribution of the next character generated based on the feature sequence “m”, “a”, “c” and the target “h” of this time step.
In practice, each token is represented by a \(d\)-dimensional vector, and we use a batch size \(n>1\). Therefore, the input \(\mathbf X_t\) at time step \(t\) will be an \(n\times d\) matrix, which is identical to what we discussed in Section 9.4.2.
In the following sections, we will implement RNNs for character-level language models.
9.4.4. Summary¶
A neural network that uses recurrent computation for hidden states is called a recurrent neural network (RNN). The hidden state of an RNN can capture historical information of the sequence up to the current time step. With recurrent computation, the number of RNN model parameters does not grow as the number of time steps increases. As for applications, an RNN can be used to create character-level language models.
9.4.5. Exercises¶
If we use an RNN to predict the next character in a text sequence, what is the required dimension for any output?
Why can RNNs express the conditional probability of a token at some time step based on all the previous tokens in the text sequence?
What happens to the gradient if you backpropagate through a long sequence?
What are some of the problems associated with the language model described in this section?