.. _sec_seq2seq:
Sequence to Sequence
====================
The sequence to sequence (seq2seq) model is based on the encoder-decoder
architecture to generate a sequence output for a sequence input, as
demonstrated in :numref:`fig_seq2seq`. Both the encoder and the
decoder use recurrent neural networks (RNNs) to handle sequence inputs
of variable length. The hidden state of the encoder is used directly to
initialize the decoder hidden state to pass information from the encoder
to the decoder.
.. _fig_seq2seq:
.. figure:: ../img/seq2seq.svg
The sequence to sequence model architecture.
The layers in the encoder and the decoder are illustrated in
:numref:`fig_seq2seq_details`.
.. _fig_seq2seq_details:
.. figure:: ../img/seq2seq-details.svg
Layers in the encoder and the decoder.
In this section we will explain and implement the seq2seq model to train
on the machine translation dataset.
.. code:: python
import d2l
from mxnet import np, npx, init, gluon, autograd
from mxnet.gluon import nn, rnn
npx.set_np()
Encoder
-------
Recall that the encoder of seq2seq can transform the inputs of variable
length to a fixed-length context vector :math:`\mathbf{c}` by encoding
the sequence information into :math:`\mathbf{c}`. We usually use RNN
layers within the encoder. Suppose that we have an input sequence
:math:`x_1, \ldots, x_T`, where :math:`x_t` is the :math:`t^\mathrm{th}`
word. At timestep :math:`t`, the RNN will have two vectors as the input:
the feature vector :math:`\mathbf{x}_t` of :math:`x_t` and the hidden
state of the last timestep :math:`\mathbf{h}_{t-1}`. Let’s denote the
transformation of the RNN’s hidden states by a function :math:`f`:
.. math:: \mathbf{h}_t = f (\mathbf{x}_t, \mathbf{h}_{t-1}).
Next, the encoder captures information of all the hidden states and
encodes it into the context vector :math:`\mathbf{c}` with a function
:math:`q`:
.. math:: \mathbf{c} = q (\mathbf{h}_1, \ldots, \mathbf{h}_T).
For example, if we choose :math:`q` as
:math:`q (\mathbf{h}_1, \ldots, \mathbf{h}_T) = \mathbf{h}_T`, then the
context vector will be the final hidden state :math:`\mathbf{h}_T`.
So far what we describe above is a unidirectional RNN, where each
timestep’s hidden state depends only on the previous timesteps’. We can
also use other forms of RNNs such as GRUs, LSTMs, and bidirectional RNNs
to encode the sequential input.
Now let’s implement the seq2seq’s encoder. Here we use the word
embedding layer to obtain the feature vector according to the word index
of the input language. Those feature vectors will be fed to a
multi-layer LSTM. The input for the encoder is a batch of sequences,
which is 2-D tensor with shape (batch size, sequence length). The
encoder returns both the LSTM outputs, i.e., hidden states of all the
timesteps, as well as the hidden state and the memory cell of the final
timestep.
.. code:: python
# Saved in the d2l package for later use
class Seq2SeqEncoder(d2l.Encoder):
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqEncoder, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = rnn.LSTM(num_hiddens, num_layers, dropout=dropout)
def forward(self, X, *args):
X = self.embedding(X) # X shape: (batch_size, seq_len, embed_size)
# RNN needs first axes to be timestep, i.e., seq_len
X = X.swapaxes(0, 1)
state = self.rnn.begin_state(batch_size=X.shape[1], ctx=X.context)
out, state = self.rnn(X, state)
# out shape: (seq_len, batch_size, num_hiddens)
# state shape: (num_layers, batch_size, num_hiddens),
# where "state" contains the hidden state and the memory cell
return out, state
Next, we will create a minibatch sequence input with a batch size of 4
and 7 timesteps. We assume the number of hidden layers of the LSTM unit
is 2 and the number of hidden units is 16. The output shape returned by
the encoder after performing forward calculation on the input is (number
of timesteps, batch size, number of hidden units). The shape of the
multi-layer hidden state of the gated recurrent unit in the final
timestep is (number of hidden layers, batch size, number of hidden
units). For the gated recurrent unit, the ``state`` list contains only
one element, which is the hidden state. If long short-term memory is
used, the ``state`` list will also contain another element, which is the
memory cell.
.. code:: python
encoder = Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
num_layers=2)
encoder.initialize()
X = np.zeros((4, 7))
output, state = encoder(X)
output.shape
.. parsed-literal::
:class: output
(7, 4, 16)
Since an LSTM is used, the ``state`` list will contain both the hidden
state and the memory cell with same shape (number of hidden layers,
batch size, number of hidden units). However, if a GRU is used, the
``state`` list will contain only one element—the hidden state in the
final timestep with shape (number of hidden layers, batch size, number
of hidden units).
.. code:: python
len(state), state[0].shape, state[1].shape
.. parsed-literal::
:class: output
(2, (2, 4, 16), (2, 4, 16))
.. _sec_seq2seq_decoder:
Decoder
-------
As we just introduced, the context vector :math:`\mathbf{c}` encodes the
information from the whole input sequence :math:`x_1, \ldots, x_T`.
Suppose that the given outputs in the training set are
:math:`y_1, \ldots, y_{T'}`. At each timestep :math:`t'`, the
conditional probability of output :math:`y_{t'}` will depend on the
previous output sequence :math:`y_1, \ldots, y_{t'-1}` and the context
vector :math:`\mathbf{c}`, i.e.,
.. math:: P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}).
Hence, we can use another RNN as the decoder. At timestep :math:`t'`,
the decoder will update its hidden state :math:`\mathbf{s}_{t'}` using
three inputs: the feature vector :math:`\mathbf{y}_{t'-1}` of
:math:`y_{t'-1}`, the context vector :math:`\mathbf{c}`, and the hidden
state of the last timestep :math:`\mathbf{s}_{t'-1}`. Let’s denote the
transformation of the RNN’s hidden states within the decoder by a
function :math:`g`:
.. math:: \mathbf{s}_{t'} = g(\mathbf{y}_{t'-1}, \mathbf{c}, \mathbf{s}_{t'-1}).
When implementing the decoder, we directly use the hidden state of the
encoder in the final timestep as the initial hidden state of the
decoder. This requires that the encoder and decoder RNNs have the same
numbers of layers and hidden units. The LSTM forward calculation of the
decoder is similar to that of the encoder. The only difference is that
we add a dense layer after the LSTM layers, where the hidden size is the
vocabulary size. The dense layer will predict the confidence score for
each word.
.. code:: python
# Saved in the d2l package for later use
class Seq2SeqDecoder(d2l.Decoder):
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0, **kwargs):
super(Seq2SeqDecoder, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = rnn.LSTM(num_hiddens, num_layers, dropout=dropout)
self.dense = nn.Dense(vocab_size, flatten=False)
def init_state(self, enc_outputs, *args):
return enc_outputs[1]
def forward(self, X, state):
X = self.embedding(X).swapaxes(0, 1)
out, state = self.rnn(X, state)
# Make the batch to be the first dimension to simplify loss
# computation
out = self.dense(out).swapaxes(0, 1)
return out, state
We create a decoder with the same hyper-parameters as the encoder. As we
can see, the output shape is changed to (batch size, the sequence
length, vocabulary size).
.. code:: python
decoder = Seq2SeqDecoder(vocab_size=10, embed_size=8,
num_hiddens=16, num_layers=2)
decoder.initialize()
state = decoder.init_state(encoder(X))
out, state = decoder(X, state)
out.shape, len(state), state[0].shape, state[1].shape
.. parsed-literal::
:class: output
((4, 7, 10), 2, (2, 4, 16), (2, 4, 16))
The Loss Function
-----------------
For each timestep, the decoder outputs a vocabulary-size confidence
score vector to predict words. Similar to language modeling, we can
apply softmax to obtain the probabilities and then use cross-entropy
loss to calculate the loss. Note that we padded the target sentences to
make them have the same length, but we do not need to compute the loss
on the padding symbols.
To implement the loss function that filters out some entries, we will
use an operator called ``SequenceMask``. It can specify to mask the
first dimension (``axis=0``) or the second one (``axis=1``). If the
second one is chosen, given a valid length vector ``len`` and 2-dim
input ``X``, this operator sets ``X[i, len[i]:] = 0`` for all
:math:`i`\ ’s.
.. code:: python
X = np.array([[1, 2, 3], [4, 5, 6]])
npx.sequence_mask(X, np.array([1, 2]), True, axis=1)
.. parsed-literal::
:class: output
array([[1., 0., 0.],
[4., 5., 0.]])
Apply to :math:`n`-dim tensor :math:`X`, it sets
``X[i, len[i]:, :, ..., :] = 0``. In addition, we can specify the
filling value such as :math:`-1` as shown below.
.. code:: python
X = np.ones((2, 3, 4))
npx.sequence_mask(X, np.array([1, 2]), True, value=-1, axis=1)
.. parsed-literal::
:class: output
array([[[ 1., 1., 1., 1.],
[-1., -1., -1., -1.],
[-1., -1., -1., -1.]],
[[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[-1., -1., -1., -1.]]])
Now we can implement the masked version of the softmax cross-entropy
loss. Note that each Gluon loss function allows to specify per-example
weights, in default they are 1s. Then we can just use a zero weight for
each example we would like to remove. So our customized loss function
accepts an additional ``valid_length`` argument to ignore some failing
elements in each sequence.
.. code:: python
# Saved in the d2l package for later use
class MaskedSoftmaxCELoss(gluon.loss.SoftmaxCELoss):
# pred shape: (batch_size, seq_len, vocab_size)
# label shape: (batch_size, seq_len)
# valid_length shape: (batch_size, )
def forward(self, pred, label, valid_length):
# weights shape: (batch_size, seq_len, 1)
weights = np.expand_dims(np.ones_like(label), axis=-1)
weights = npx.sequence_mask(weights, valid_length, True, axis=1)
return super(MaskedSoftmaxCELoss, self).forward(pred, label, weights)
For a sanity check, we create identical three sequences, keep 4 elements
for the first sequence, 2 elements for the second sequence, and none for
the last one. Then the first example loss should be 2 times larger than
the second one, and the last loss should be 0.
.. code:: python
loss = MaskedSoftmaxCELoss()
loss(np.ones((3, 4, 10)), np.ones((3, 4)), np.array([4, 2, 0]))
.. parsed-literal::
:class: output
array([2.3025851, 1.1512926, 0. ])
.. _sec_seq2seq_training:
Training
--------
During training, if the target sequence has length :math:`n`, we feed
the first :math:`n-1` tokens into the decoder as inputs, and the last
:math:`n-1` tokens are used as ground truth label.
.. code:: python
# Saved in the d2l package for later use
def train_s2s_ch9(model, data_iter, lr, num_epochs, ctx):
model.initialize(init.Xavier(), force_reinit=True, ctx=ctx)
trainer = gluon.Trainer(model.collect_params(),
'adam', {'learning_rate': lr})
loss = MaskedSoftmaxCELoss()
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[1, num_epochs], ylim=[0, 0.25])
for epoch in range(1, num_epochs + 1):
timer = d2l.Timer()
metric = d2l.Accumulator(2) # loss_sum, num_tokens
for batch in data_iter:
X, X_vlen, Y, Y_vlen = [x.as_in_context(ctx) for x in batch]
Y_input, Y_label, Y_vlen = Y[:, :-1], Y[:, 1:], Y_vlen-1
with autograd.record():
Y_hat, _ = model(X, Y_input, X_vlen, Y_vlen)
l = loss(Y_hat, Y_label, Y_vlen)
l.backward()
d2l.grad_clipping(model, 1)
num_tokens = Y_vlen.sum()
trainer.step(num_tokens)
metric.add(l.sum(), num_tokens)
if epoch % 10 == 0:
animator.add(epoch, (metric[0]/metric[1],))
print('loss %.3f, %d tokens/sec on %s ' % (
metric[0]/metric[1], metric[1]/timer.stop(), ctx))
Next, we create a model instance and set hyper-parameters. Then, we can
train the model.
.. code:: python
embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.0
batch_size, num_steps = 64, 10
lr, num_epochs, ctx = 0.005, 300, d2l.try_gpu()
src_vocab, tgt_vocab, train_iter = d2l.load_data_nmt(batch_size, num_steps)
encoder = Seq2SeqEncoder(
len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(
len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = d2l.EncoderDecoder(encoder, decoder)
train_s2s_ch9(model, train_iter, lr, num_epochs, ctx)
.. parsed-literal::
:class: output
loss 0.027, 8757 tokens/sec on gpu(0)
.. figure:: output_seq2seq_9a20e0_23_1.svg
Predicting
----------
Here we implement the simplest method, greedy search, to generate an
output sequence. As illustrated in :numref:`fig_seq2seq_predict`,
during predicting, we feed the same “” token to the decoder as
training at timestep 0. But the input token for a later timestep is the
predicted token from the previous timestep.
.. _fig_seq2seq_predict:
.. figure:: ../img/seq2seq_predict.svg
Sequence to sequence model predicting with greedy search
.. code:: python
# Saved in the d2l package for later use
def predict_s2s_ch9(model, src_sentence, src_vocab, tgt_vocab, num_steps,
ctx):
src_tokens = src_vocab[src_sentence.lower().split(' ')]
enc_valid_length = np.array([len(src_tokens)], ctx=ctx)
src_tokens = d2l.trim_pad(src_tokens, num_steps, src_vocab[''])
enc_X = np.array(src_tokens, ctx=ctx)
# Add the batch_size dimension
enc_outputs = model.encoder(np.expand_dims(enc_X, axis=0),
enc_valid_length)
dec_state = model.decoder.init_state(enc_outputs, enc_valid_length)
dec_X = np.expand_dims(np.array([tgt_vocab['']], ctx=ctx), axis=0)
predict_tokens = []
for _ in range(num_steps):
Y, dec_state = model.decoder(dec_X, dec_state)
# The token with highest score is used as the next timestep input
dec_X = Y.argmax(axis=2)
py = dec_X.squeeze(axis=0).astype('int32').item()
if py == tgt_vocab['']:
break
predict_tokens.append(py)
return ' '.join(tgt_vocab.to_tokens(predict_tokens))
Try several examples:
.. code:: python
for sentence in ['Go .', 'Wow !', "I'm OK .", 'I won !']:
print(sentence + ' => ' + predict_s2s_ch9(
model, sentence, src_vocab, tgt_vocab, num_steps, ctx))
.. parsed-literal::
:class: output
Go . => va !
Wow ! => !
I'm OK . => je vais bien .
I won ! => j'ai gagné !
Summary
-------
- The sequence to sequence (seq2seq) model is based on the
encoder-decoder architecture to generate a sequence output from a
sequence input.
- We use multiple LSTM layers for both the encoder and the decoder.
Exercises
---------
1. Can you think of other use cases of seq2seq besides neural machine
translation?
2. What if the input sequence in the example of this section is longer?
3. If we do not use the ``SequenceMask`` in the loss function, what may
happen?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_seq2seq.svg