9. Modern Recurrent Networksnavigate_next 9.8. Beam Search
Quick search
code
Show Source
Courses PDF All Notebooks Discuss GitHub 中文版
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability
    • 2.7. Documentation
  • 3. Linear Neural Networks
    • 3.1. Linear Regression
    • 3.2. Linear Regression Implementation from Scratch
    • 3.3. Concise Implementation of Linear Regression
    • 3.4. Softmax Regression
    • 3.5. The Image Classification Dataset (Fashion-MNIST)
    • 3.6. Implementation of Softmax Regression from Scratch
    • 3.7. Concise Implementation of Softmax Regression
  • 4. Multilayer Perceptrons
    • 4.1. Multilayer Perceptron
    • 4.2. Implementation of Multilayer Perceptron from Scratch
    • 4.3. Concise Implementation of Multilayer Perceptron
    • 4.4. Model Selection, Underfitting and Overfitting
    • 4.5. Weight Decay
    • 4.6. Dropout
    • 4.7. Forward Propagation, Backward Propagation, and Computational Graphs
    • 4.8. Numerical Stability and Initialization
    • 4.9. Considering the Environment
    • 4.10. Predicting House Prices on Kaggle
  • 5. Deep Learning Computation
    • 5.1. Layers and Blocks
    • 5.2. Parameter Management
    • 5.3. Deferred Initialization
    • 5.4. Custom Layers
    • 5.5. File I/O
    • 5.6. GPUs
  • 6. Convolutional Neural Networks
    • 6.1. From Dense Layers to Convolutions
    • 6.2. Convolutions for Images
    • 6.3. Padding and Stride
    • 6.4. Multiple Input and Output Channels
    • 6.5. Pooling
    • 6.6. Convolutional Neural Networks (LeNet)
  • 7. Modern Convolutional Networks
    • 7.1. Deep Convolutional Neural Networks (AlexNet)
    • 7.2. Networks Using Blocks (VGG)
    • 7.3. Network in Network (NiN)
    • 7.4. Networks with Parallel Concatenations (GoogLeNet)
    • 7.5. Batch Normalization
    • 7.6. Residual Networks (ResNet)
    • 7.7. Densely Connected Networks (DenseNet)
  • 8. Recurrent Neural Networks
    • 8.1. Sequence Models
    • 8.2. Text Preprocessing
    • 8.3. Language Models and the Dataset
    • 8.4. Recurrent Neural Networks
    • 8.5. Implementation of Recurrent Neural Networks from Scratch
    • 8.6. Concise Implementation of Recurrent Neural Networks
    • 8.7. Backpropagation Through Time
  • 9. Modern Recurrent Networks
    • 9.1. Gated Recurrent Units (GRU)
    • 9.2. Long Short Term Memory (LSTM)
    • 9.3. Deep Recurrent Neural Networks
    • 9.4. Bidirectional Recurrent Neural Networks
    • 9.5. Machine Translation and the Dataset
    • 9.6. Encoder-Decoder Architecture
    • 9.7. Sequence to Sequence
    • 9.8. Beam Search
  • 10. Attention Mechanisms
    • 10.1. Attention Mechanism
    • 10.2. Sequence to Sequence with Attention Mechanism
    • 10.3. Transformer
  • 11. Optimization Algorithms
    • 11.1. Optimization and Deep Learning
    • 11.2. Convexity
    • 11.3. Gradient Descent
    • 11.4. Stochastic Gradient Descent
    • 11.5. Minibatch Stochastic Gradient Descent
    • 11.6. Momentum
    • 11.7. Adagrad
    • 11.8. RMSProp
    • 11.9. Adadelta
    • 11.10. Adam
    • 11.11. Learning Rate Scheduling
  • 12. Computational Performance
    • 12.1. Compilers and Interpreters
    • 12.2. Asynchronous Computation
    • 12.3. Automatic Parallelism
    • 12.4. Hardware
    • 12.5. Training on Multiple GPUs
    • 12.6. Concise Implementation for Multiple GPUs
  • 13. Computer Vision
    • 13.1. Image Augmentation
    • 13.2. Fine Tuning
    • 13.3. Object Detection and Bounding Boxes
    • 13.4. Anchor Boxes
    • 13.5. Multiscale Object Detection
    • 13.6. The Object Detection Dataset (Pikachu)
    • 13.7. Single Shot Multibox Detection (SSD)
    • 13.8. Region-based CNNs (R-CNNs)
    • 13.9. Semantic Segmentation and the Dataset
    • 13.10. Transposed Convolution
    • 13.11. Fully Convolutional Networks (FCN)
    • 13.12. Neural Style Transfer
    • 13.13. Image Classification (CIFAR-10) on Kaggle
    • 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 14. Natural Language Processing
    • 14.1. Word Embedding (word2vec)
    • 14.2. Approximate Training for Word2vec
    • 14.3. The Dataset for Word2vec
    • 14.4. Implementation of Word2vec
    • 14.5. Subword Embedding (fastText)
    • 14.6. Word Embedding with Global Vectors (GloVe)
    • 14.7. Finding Synonyms and Analogies
    • 14.8. Text Classification and the Dataset
    • 14.9. Text Sentiment Classification: Using Recurrent Neural Networks
    • 14.10. Text Sentiment Classification: Using Convolutional Neural Networks (textCNN)
  • 15. Recommender Systems
    • 15.1. Overview of Recommender Systems
    • 15.2. The MovieLens Dataset
    • 15.3. Matrix Factorization
    • 15.4. AutoRec: Rating Prediction with Autoencoders
    • 15.5. Personalized Ranking for Recommender Systems
    • 15.6. Neural Collaborative Filtering for Personalized Ranking
    • 15.7. Sequence-Aware Recommender Systems
    • 15.8. Feature-Rich Recommender Systems
    • 15.9. Factorization Machines
    • 15.10. Deep Factorization Machines
  • 16. Generative Adversarial Networks
    • 16.1. Generative Adversarial Networks
    • 16.2. Deep Convolutional Generative Adversarial Networks
  • 17. Appendix: Mathematics for Deep Learning
    • 17.1. Geometry and Linear Algebraic Operations
    • 17.2. Eigendecompositions
    • 17.3. Single Variable Calculus
    • 17.4. Multivariable Calculus
    • 17.5. Integral Calculus
    • 17.6. Random Variables
    • 17.7. Maximum Likelihood
    • 17.8. Distributions
    • 17.9. Naive Bayes
    • 17.10. Statistics
    • 17.11. Information Theory
  • 18. Appendix: Tools for Deep Learning
    • 18.1. Using Jupyter
    • 18.2. Using AWS Instances
    • 18.3. Selecting Servers and GPUs
    • 18.4. Contributing to This Book
    • 18.5. d2l API Document
  • References
Dive into Deep Learning
Table Of Contents
  • Preface
  • Installation
  • Notation
  • 1. Introduction
  • 2. Preliminaries
    • 2.1. Data Manipulation
    • 2.2. Data Preprocessing
    • 2.3. Linear Algebra
    • 2.4. Calculus
    • 2.5. Automatic Differentiation
    • 2.6. Probability
    • 2.7. Documentation
  • 3. Linear Neural Networks
    • 3.1. Linear Regression
    • 3.2. Linear Regression Implementation from Scratch
    • 3.3. Concise Implementation of Linear Regression
    • 3.4. Softmax Regression
    • 3.5. The Image Classification Dataset (Fashion-MNIST)
    • 3.6. Implementation of Softmax Regression from Scratch
    • 3.7. Concise Implementation of Softmax Regression
  • 4. Multilayer Perceptrons
    • 4.1. Multilayer Perceptron
    • 4.2. Implementation of Multilayer Perceptron from Scratch
    • 4.3. Concise Implementation of Multilayer Perceptron
    • 4.4. Model Selection, Underfitting and Overfitting
    • 4.5. Weight Decay
    • 4.6. Dropout
    • 4.7. Forward Propagation, Backward Propagation, and Computational Graphs
    • 4.8. Numerical Stability and Initialization
    • 4.9. Considering the Environment
    • 4.10. Predicting House Prices on Kaggle
  • 5. Deep Learning Computation
    • 5.1. Layers and Blocks
    • 5.2. Parameter Management
    • 5.3. Deferred Initialization
    • 5.4. Custom Layers
    • 5.5. File I/O
    • 5.6. GPUs
  • 6. Convolutional Neural Networks
    • 6.1. From Dense Layers to Convolutions
    • 6.2. Convolutions for Images
    • 6.3. Padding and Stride
    • 6.4. Multiple Input and Output Channels
    • 6.5. Pooling
    • 6.6. Convolutional Neural Networks (LeNet)
  • 7. Modern Convolutional Networks
    • 7.1. Deep Convolutional Neural Networks (AlexNet)
    • 7.2. Networks Using Blocks (VGG)
    • 7.3. Network in Network (NiN)
    • 7.4. Networks with Parallel Concatenations (GoogLeNet)
    • 7.5. Batch Normalization
    • 7.6. Residual Networks (ResNet)
    • 7.7. Densely Connected Networks (DenseNet)
  • 8. Recurrent Neural Networks
    • 8.1. Sequence Models
    • 8.2. Text Preprocessing
    • 8.3. Language Models and the Dataset
    • 8.4. Recurrent Neural Networks
    • 8.5. Implementation of Recurrent Neural Networks from Scratch
    • 8.6. Concise Implementation of Recurrent Neural Networks
    • 8.7. Backpropagation Through Time
  • 9. Modern Recurrent Networks
    • 9.1. Gated Recurrent Units (GRU)
    • 9.2. Long Short Term Memory (LSTM)
    • 9.3. Deep Recurrent Neural Networks
    • 9.4. Bidirectional Recurrent Neural Networks
    • 9.5. Machine Translation and the Dataset
    • 9.6. Encoder-Decoder Architecture
    • 9.7. Sequence to Sequence
    • 9.8. Beam Search
  • 10. Attention Mechanisms
    • 10.1. Attention Mechanism
    • 10.2. Sequence to Sequence with Attention Mechanism
    • 10.3. Transformer
  • 11. Optimization Algorithms
    • 11.1. Optimization and Deep Learning
    • 11.2. Convexity
    • 11.3. Gradient Descent
    • 11.4. Stochastic Gradient Descent
    • 11.5. Minibatch Stochastic Gradient Descent
    • 11.6. Momentum
    • 11.7. Adagrad
    • 11.8. RMSProp
    • 11.9. Adadelta
    • 11.10. Adam
    • 11.11. Learning Rate Scheduling
  • 12. Computational Performance
    • 12.1. Compilers and Interpreters
    • 12.2. Asynchronous Computation
    • 12.3. Automatic Parallelism
    • 12.4. Hardware
    • 12.5. Training on Multiple GPUs
    • 12.6. Concise Implementation for Multiple GPUs
  • 13. Computer Vision
    • 13.1. Image Augmentation
    • 13.2. Fine Tuning
    • 13.3. Object Detection and Bounding Boxes
    • 13.4. Anchor Boxes
    • 13.5. Multiscale Object Detection
    • 13.6. The Object Detection Dataset (Pikachu)
    • 13.7. Single Shot Multibox Detection (SSD)
    • 13.8. Region-based CNNs (R-CNNs)
    • 13.9. Semantic Segmentation and the Dataset
    • 13.10. Transposed Convolution
    • 13.11. Fully Convolutional Networks (FCN)
    • 13.12. Neural Style Transfer
    • 13.13. Image Classification (CIFAR-10) on Kaggle
    • 13.14. Dog Breed Identification (ImageNet Dogs) on Kaggle
  • 14. Natural Language Processing
    • 14.1. Word Embedding (word2vec)
    • 14.2. Approximate Training for Word2vec
    • 14.3. The Dataset for Word2vec
    • 14.4. Implementation of Word2vec
    • 14.5. Subword Embedding (fastText)
    • 14.6. Word Embedding with Global Vectors (GloVe)
    • 14.7. Finding Synonyms and Analogies
    • 14.8. Text Classification and the Dataset
    • 14.9. Text Sentiment Classification: Using Recurrent Neural Networks
    • 14.10. Text Sentiment Classification: Using Convolutional Neural Networks (textCNN)
  • 15. Recommender Systems
    • 15.1. Overview of Recommender Systems
    • 15.2. The MovieLens Dataset
    • 15.3. Matrix Factorization
    • 15.4. AutoRec: Rating Prediction with Autoencoders
    • 15.5. Personalized Ranking for Recommender Systems
    • 15.6. Neural Collaborative Filtering for Personalized Ranking
    • 15.7. Sequence-Aware Recommender Systems
    • 15.8. Feature-Rich Recommender Systems
    • 15.9. Factorization Machines
    • 15.10. Deep Factorization Machines
  • 16. Generative Adversarial Networks
    • 16.1. Generative Adversarial Networks
    • 16.2. Deep Convolutional Generative Adversarial Networks
  • 17. Appendix: Mathematics for Deep Learning
    • 17.1. Geometry and Linear Algebraic Operations
    • 17.2. Eigendecompositions
    • 17.3. Single Variable Calculus
    • 17.4. Multivariable Calculus
    • 17.5. Integral Calculus
    • 17.6. Random Variables
    • 17.7. Maximum Likelihood
    • 17.8. Distributions
    • 17.9. Naive Bayes
    • 17.10. Statistics
    • 17.11. Information Theory
  • 18. Appendix: Tools for Deep Learning
    • 18.1. Using Jupyter
    • 18.2. Using AWS Instances
    • 18.3. Selecting Servers and GPUs
    • 18.4. Contributing to This Book
    • 18.5. d2l API Document
  • References

9.8. Beam Search¶

In Section 9.7, we discussed how to train an encoder-decoder with input and output sequences that are both of variable length. In this section, we are going to introduce how to use the encoder-decoder to predict sequences of variable length.

As in Section 9.5, when preparing to train the dataset, we normally attach a special symbol “<eos>” after each sentence to indicate the termination of the sequence. We will continue to use this mathematical symbol in the discussion below. For ease of discussion, we assume that the output of the decoder is a sequence of text. Let the size of output text dictionary \(\mathcal{Y}\) (contains special symbol “<eos>”) be \(\left|\mathcal{Y}\right|\), and the maximum length of the output sequence be \(T'\). There are a total \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) types of possible output sequences. All the subsequences after the special symbol “<eos>” in these output sequences will be discarded. Besides, we still denote the context vector as \(\mathbf{c}\), which encodes information of all the hidden states from the input.

9.8.1. Greedy Search¶

First, we will take a look at a simple solution: greedy search. For any timestep \(t'\) of the output sequence, we are going to search for the word with the highest conditional probability from \(|\mathcal{Y}|\) numbers of words, with

(9.8.1)¶\[y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c})\]

as the output. Once the “<eos>” symbol is detected, or the output sequence has reached its maximum length \(T'\), the output is completed.

As we mentioned in our discussion of the decoder, the conditional probability of generating an output sequence based on the input sequence is \(\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})\). We will take the output sequence with the highest conditional probability as the optimal sequence. The main problem with greedy search is that there is no guarantee that the optimal sequence will be obtained.

Take a look at the example below. We assume that there are four words “A”, “B”, “C”, and “<eos>” in the output dictionary. The four numbers under each timestep in Fig. 9.8.1 represent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that timestep, respectively. At each timestep, greedy search selects the word with the highest conditional probability. Therefore, the output sequence “A”, “B”, “C”, and “<eos>” will be generated in Fig. 9.8.1. The conditional probability of this output sequence is \(0.5\times0.4\times0.4\times0.6 = 0.048\).

../_images/s2s-prob1.svg

Fig. 9.8.1 The four numbers under each timestep represent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that timestep, respectively. At each timestep, greedy search selects the word with the highest conditional probability.¶

Now, we will look at another example shown in Fig. 9.8.2. Unlike in Fig. 9.8.1, the following figure Fig. 9.8.2 selects the word “C”, which has the second highest conditional probability at timestep 2. Since the output subsequences of timesteps 1 and 2, on which timestep 3 is based, are changed from “A” and “B” in Fig. 9.8.1 to “A” and “C” in Fig. 9.8.2, the conditional probability of each word generated at timestep 3 has also changed in Fig. 9.8.2. We choose the word “B”, which has the highest conditional probability. Now, the output subsequences of timestep 4 based on the first three timesteps are “A”, “C”, and “B”, which are different from “A”, “B”, and “C” in Fig. 9.8.1. Therefore, the conditional probability of generating each word in timestep 4 in Fig. 9.8.2 is also different from that in Fig. 9.8.1. We find that the conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” at the current timestep is \(0.5\times0.3 \times0.6\times0.6=0.054\), which is higher than the conditional probability of the output sequence obtained by greedy search. Therefore, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy search is not an optimal sequence.

../_images/s2s-prob2.svg

Fig. 9.8.2 The four numbers under each timestep represent the conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that timestep. At timestep 2, the word “C”, which has the second highest conditional probability, is selected.¶

9.8.2. Exhaustive Search¶

If the goal is to obtain the optimal sequence, we may consider using exhaustive search: an exhaustive examination of all possible output sequences, which outputs the sequence with the highest conditional probability.

Although we can use an exhaustive search to obtain the optimal sequence, its computational overhead \(\mathcal{O}(\left|\mathcal{Y}\right|^{T'})\) is likely to be excessively high. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we will need to evaluate \(10000^{10} = 10^{40}\) sequences. This is next to impossible to complete. The computational overhead of greedy search is \(\mathcal{O}(\left|\mathcal{Y}\right|T')\), which is usually significantly less than the computational overhead of an exhaustive search. For example, when \(|\mathcal{Y}|=10000\) and \(T'=10\), we only need to evaluate \(10000\times10=1\times10^5\) sequences.

9.8.3. Beam Search¶

Beam search is an improved algorithm based on greedy search. It has a hyper-parameter named beam size, \(k\). At timestep 1, we select \(k\) words with the highest conditional probability to be the first word of the \(k\) candidate output sequences. For each subsequent timestep, we are going to select the \(k\) output sequences with the highest conditional probability from the total of \(k\left|\mathcal{Y}\right|\) possible output sequences based on the \(k\) candidate output sequences from the previous timestep. These will be the candidate output sequence for that timestep. Finally, we will filter out the sequences containing the special symbol “<eos>” from the candidate output sequences of each timestep and discard all the subsequences after it to obtain a set of final candidate output sequences.

../_images/beam-search.svg

Fig. 9.8.3 The beam search process. The beam size is 2 and the maximum length of the output sequence is 3. The candidate output sequences are \(A\), \(C\), \(AB\), \(CE\), \(ABD\), and \(CED\).¶

Fig. 9.8.3 demonstrates the process of beam search with an example. Suppose that the vocabulary of the output sequence only contains five elements: \(\mathcal{Y} = \{A, B, C, D, E\}\) where one of them is a special symbol “<eos>”. Set beam size to 2, the maximum length of the output sequence to 3. At timestep 1 of the output sequence, suppose the words with the highest conditional probability \(P(y_1 \mid \mathbf{c})\) are \(A\) and \(C\). At timestep 2, for all \(y_2 \in \mathcal{Y},\) we compute

(9.8.2)¶\[P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c})\]

and

(9.8.3)¶\[P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\]

and pick the largest two among these 10 values, say

(9.8.4)¶\[P(A, B \mid \mathbf{c}) \text{ and } P(C, E \mid \mathbf{c}).\]

Then at timestep 3, for all \(y_3 \in \mathcal{Y}\), we compute

(9.8.5)¶\[P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c})\]

and

(9.8.6)¶\[P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\]

and pick the largest two among these 10 values, say

(9.8.7)¶\[P(A, B, D \mid \mathbf{c}) \text{ and } P(C, E, D \mid \mathbf{c}).\]

As a result, we obtain 6 candidates output sequences: (1) \(A\); (2) \(C\); (3) \(A\), \(B\); (4) \(C\), \(E\); (5) \(A\), \(B\), \(D\); and (6) \(C\), \(E\), \(D\). In the end, we will get the set of final candidate output sequences based on these 6 sequences.

In the set of final candidate output sequences, we will take the sequence with the highest score as the output sequence from those below:

(9.8.8)¶\[\frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),\]

Here, \(L\) is the length of the final candidate sequence and the selection for \(\alpha\) is generally 0.75. The \(L^\alpha\) on the denominator is a penalty on the logarithmic addition scores for the longer sequences above. The computational overhead \(\mathcal{O}(k\left|\mathcal{Y}\right|T')\) of the beam search can be obtained through analysis. The result is between the computational overhead of greedy search and exhaustive search. In addition, greedy search can be treated as a beam search with a beam size of 1. Beam search strikes a balance between computational overhead and search quality using a flexible beam size of \(k\).

9.8.4. Summary¶

  • Methods for predicting variable-length sequences include greedy search, exhaustive search, and beam search.

  • Beam search strikes a balance between computational overhead and search quality using a flexible beam size.

9.8.5. Exercises¶

  1. Can we treat an exhaustive search as a beam search with a special beam size? Why?

  2. We used language models to generate sentences in Section 8.5. Which kind of search does this output use? Can you improve it?

9.8.6. Discussions¶

image0

Table Of Contents

  • 9.8. Beam Search
    • 9.8.1. Greedy Search
    • 9.8.2. Exhaustive Search
    • 9.8.3. Beam Search
    • 9.8.4. Summary
    • 9.8.5. Exercises
    • 9.8.6. Discussions
Previous
9.7. Sequence to Sequence
Next
10. Attention Mechanisms