.. _chap_modern_rnn:

Modern Recurrent Neural Networks
================================


The previous chapter introduced the key ideas behind recurrent neural
networks (RNNs). However, just as with convolutional neural networks,
there has been a tremendous amount of innovation in RNN architectures,
culminating in several complex designs that have proven successful in
practice. In particular, the most popular designs feature mechanisms for
mitigating the notorious numerical instability faced by RNNs, as
typified by vanishing and exploding gradients. Recall that in
:numref:`chap_rnn` we dealt with exploding gradients by applying a
blunt gradient clipping heuristic. Despite the efficacy of this hack, it
leaves open the problem of vanishing gradients.

In this chapter, we introduce the key ideas behind the most successful
RNN architectures for sequences, which stem from two papers. The first,
*Long Short-Term Memory* :cite:`Hochreiter.Schmidhuber.1997`,
introduces the *memory cell*, a unit of computation that replaces
traditional nodes in the hidden layer of a network. With these memory
cells, networks are able to overcome difficulties with training
encountered by earlier recurrent networks. Intuitively, the memory cell
avoids the vanishing gradient problem by keeping values in each memory
cell’s internal state cascading along a recurrent edge with weight 1
across many successive time steps. A set of multiplicative gates help
the network to determine not only the inputs to allow into the memory
state, but when the content of the memory state should influence the
model’s output.

The second paper, *Bidirectional Recurrent Neural Networks*
:cite:`Schuster.Paliwal.1997`, introduces an architecture in which
information from both the future (subsequent time steps) and the past
(preceding time steps) are used to determine the output at any point in
the sequence. This is in contrast to previous networks, in which only
past input can affect the output. Bidirectional RNNs have become a
mainstay for sequence labeling tasks in natural language processing,
among a myriad of other tasks. Fortunately, the two innovations are not
mutually exclusive, and have been successfully combined for phoneme
classification :cite:`Graves.Schmidhuber.2005` and handwriting
recognition :cite:`graves2008novel`.

The first sections in this chapter will explain the LSTM architecture, a
lighter-weight version called the gated recurrent unit (GRU), the key
ideas behind bidirectional RNNs and a brief explanation of how RNN
layers are stacked together to form deep RNNs. Subsequently, we will
explore the application of RNNs in sequence-to-sequence tasks,
introducing machine translation along with key ideas such as
*encoder–decoder* architectures and *beam search*.

.. toctree::
   :maxdepth: 2

   lstm
   gru
   deep-rnn
   bi-rnn
   machine-translation-and-dataset
   encoder-decoder
   seq2seq
   beam-search