10. Modern Recurrent Neural Networks

The previous chapter introduced the key ideas behind recurrent neural networks (RNNs). However, just as with convolutional neural networks, there has been a tremendous amount of innovation in RNN architectures, culminating in several complex designs that have proven successful in practice. In particular, the most popular designs feature mechanisms to mitigate the notorious numerical instability faced by RNNs, as typified by vanishing and exploding gradients. Recall that in Section 9 we dealt with exploding gradient by applying a blunt gradient clipping heuristic. Despite the efficacy of this hack, it leaves open the problem of vanishing gradients.

In this chapter, we introduce the key ideas behind the most successful RNN architectures for sequence, which stem from two papers published in 1997. The first paper, Long Short-Term Memory (Hochreiter and Schmidhuber, 1997), introduces the memory cell, a unit of computation that replaces traditional nodes in the hidden layer of a network. With these memory cells, networks are able to overcome difficulties with training encountered by earlier recurrent networks. Intuitively, the memory cell avoids the vanishing gradient problem by keeping values in each memory cell’s internal state cascading along a recurrent edge with weight 1 across many successive time steps. A set of multiplicative gates help the network to determine both which inputs to allow into the memory state, and when the content of the memory state should influence the model’s output.

The second paper, Bidirectional Recurrent Neural Networks (Schuster and Paliwal, 1997), introduces an architecture in which information from both the future (subsequent time steps) and the past (preceding time steps) are used to determine the output at any point in the sequence. This is in contrast to previous networks, in which only past input can affect the output. Bidirectional RNNs have become a mainstay for sequence labeling tasks in natural language processing, among myriad other tasks. Fortunately, the two innovations are not mutually exclusive, and have been successfully combined for phoneme classification (Graves and Schmidhuber, 2005) and handwriting recognition (Graves et al., 2008).

The first sections in this chapter will explain the LSTM architecture, a lighter-weight version called the gated recurrent unit (GRU), the key ideas behind bidirectional RNNs and a brief explanation of how RNN layers are stacked together to form deep RNNs. Subsequently, we will explore the application of RNNs in sequence-to-sequence tasks, introducing machine translation along with key ideas such as encoder-decoder architectures and beam search.