.. _sec_transformer:
Transformer
===========
In previous chapters, we have covered major neural network architectures
such as convolution neural networks (CNNs) and recurrent neural networks
(RNNs). Let’s recap their pros and cons:
- **CNNs** are easy to parallelize at a layer but cannot capture the
variable-length sequential dependency very well.
- **RNNs** are able to capture the long-range, variable-length
sequential information, but suffer from inability to parallelize
within a sequence.
To combine the advantages from both CNNs and RNNs,
:cite:`Vaswani.Shazeer.Parmar.ea.2017` designed a novel architecture
using the attention mechanism. This architecture, which is called as
*Transformer*, achieves parallelization by capturing recurrence sequence
with attention and at the same time encodes each item’s position in the
sequence. As a result, Transformer leads to a compatible model with
significantly shorter training time.
Similar to the seq2seq model in :numref:`sec_seq2seq`, Transformer is
also based on the encoder-decoder architecture. However, Transformer
differs to the former by replacing the recurrent layers in seq2seq with
*multi-head attention* layers, incorporating the position-wise
information through *position encoding*, and applying *layer
normalization*. We compare Transformer and seq2seq side-by-side in
:numref:`fig_transformer`.
Overall, these two models are similar to each other: the source sequence
embeddings are fed into :math:`n` repeated blocks. The outputs of the
last block are then used as attention memory for the decoder. The target
sequence embeddings are similarly fed into :math:`n` repeated blocks in
the decoder, and the final outputs are obtained by applying a dense
layer with vocabulary size to the last block’s outputs.
.. _fig_transformer:
.. figure:: ../img/transformer.svg
:width: 500px
The Transformer architecture.
On the flip side, Transformer differs from the seq2seq with attention
model in the following:
1. **Transformer block**: a recurrent layer in seq2seq is replaced by a
*Transformer block*. This block contains a *multi-head attention*
layer and a network with two *position-wise feed-forward network*
layers for the encoder. For the decoder, another multi-head attention
layer is used to take the encoder state.
2. **Add and norm**: the inputs and outputs of both the multi-head
attention layer or the position-wise feed-forward network, are
processed by two “add and norm” layer that contains a residual
structure and a *layer normalization* layer.
3. **Position encoding**: since the self-attention layer does not
distinguish the item order in a sequence, a positional encoding layer
is used to add sequential information into each sequence item.
In the rest of this section, we will equip you with each new component
introduced by Transformer, and get you up and running to construct a
machine translation model.
.. code:: python
import d2l
import math
from mxnet import autograd, np, npx
from mxnet.gluon import nn
npx.set_np()
Multi-Head Attention
--------------------
Before the discussion of the *multi-head attention* layer, let’s quick
express the *self-attention* architecture. The self-attention model is a
normal attention model, with its query, its key, and its value being
copied exactly the same from each item of the sequential inputs. As we
illustrate in :numref:`fig_self_attention`, self-attention outputs a
same-length sequential output for each input item. Compared with a
recurrent layer, output items of a self-attention layer can be computed
in parallel and, therefore, it is easy to obtain a highly-efficient
implementation.
.. _fig_self_attention:
.. figure:: ../img/self-attention.svg
Self-attention architecture.
The *multi-head attention* layer consists of :math:`h` parallel
self-attention layers, each one is called a *head*. For each head,
before feeding into the attention layer, we project the queries, keys,
and values with three dense layers with hidden sizes :math:`p_q`,
:math:`p_k`, and :math:`p_v`, respectively. The outputs of these
:math:`h` attention heads are concatenated and then processed by a final
dense layer.
.. _sec_attention:.
.. figure:: ../img/multi-head-attention.svg
Multi-head attention
Assume that the dimension for a query, a key, and a value are
:math:`d_q`, :math:`d_k`, and :math:`d_v`, respectively. Then, for each
head :math:`i=1,\ldots, h`, we can train learnable parameters
:math:`\mathbf W_q^{(i)}\in\mathbb R^{p_q\times d_q}`,
:math:`\mathbf W_k^{(i)}\in\mathbb R^{p_k\times d_k}`, and
:math:`\mathbf W_v^{(i)}\in\mathbb R^{p_v\times d_v}`. Therefore, the
output for each head is
.. math:: \mathbf o^{(i)} = \textrm{attention}(\mathbf W_q^{(i)}\mathbf q, \mathbf W_k^{(i)}\mathbf k,\mathbf W_v^{(i)}\mathbf v),
where :math:`\textrm{attention}` can be any attention layer, such as the
``DotProductAttention`` and ``MLPAttention`` as we introduced in
After that, the output with length :math:`p_v` from each of the
:math:`h` attention heads are concatenated to be an output of length
:math:`h p_v`, which is then passed the final dense layer with
:math:`d_o` hidden units. The weights of this dense layer can be denoted
by :math:`\mathbf W_o\in\mathbb R^{d_o\times h p_v}`. As a result, the
multi-head attention output will be
.. math:: \mathbf o = \mathbf W_o \begin{bmatrix}\mathbf o^{(1)}\\\vdots\\\mathbf o^{(h)}\end{bmatrix}.
Now we can implement the multi-head attention. Assume that the
multi-head attention contain the number heads ``num_heads`` :math:`=h`,
the hidden size ``num_hiddens`` :math:`=p_q=p_k=p_v` are the same for
the query, key, and value dense layers. In addition, since the
multi-head attention keeps the same dimensionality between its input and
its output, we have the output feature size :math:`d_o =`
``num_hiddens`` as well.
.. code:: python
# Saved in the d2l package for later use
class MultiHeadAttention(nn.Block):
def __init__(self, num_hiddens, num_heads, dropout, **kwargs):
super(MultiHeadAttention, self).__init__(**kwargs)
self.num_heads = num_heads
self.attention = d2l.DotProductAttention(dropout)
self.W_q = nn.Dense(num_hiddens, use_bias=False, flatten=False)
self.W_k = nn.Dense(num_hiddens, use_bias=False, flatten=False)
self.W_v = nn.Dense(num_hiddens, use_bias=False, flatten=False)
self.W_o = nn.Dense(num_hiddens, use_bias=False, flatten=False)
def forward(self, query, key, value, valid_length):
# For self-attention, query, key, and value shape:
# (batch_size, seq_len, dim), where seq_len is the length of input
# sequence. valid_length shape is either (batch_size, ) or
# (batch_size, seq_len).
# Project and transpose query, key, and value from
# (batch_size, seq_len, num_hiddens) to
# (batch_size * num_heads, seq_len, num_hiddens / num_heads).
query = transpose_qkv(self.W_q(query), self.num_heads)
key = transpose_qkv(self.W_k(key), self.num_heads)
value = transpose_qkv(self.W_v(value), self.num_heads)
if valid_length is not None:
# Copy valid_length by num_heads times
if valid_length.ndim == 1:
valid_length = np.tile(valid_length, self.num_heads)
else:
valid_length = np.tile(valid_length, (self.num_heads, 1))
# For self-attention, output shape:
# (batch_size * num_heads, seq_len, num_hiddens / num_heads)
output = self.attention(query, key, value, valid_length)
# output_concat shape: (batch_size, seq_len, num_hiddens)
output_concat = transpose_output(output, self.num_heads)
return self.W_o(output_concat)
Here are the definitions of the transpose functions ``transpose_qkv``
and ``transpose_output``, who are the inverse of each other.
.. code:: python
# Saved in the d2l package for later use
def transpose_qkv(X, num_heads):
# Input X shape: (batch_size, seq_len, num_hiddens).
# Output X shape:
# (batch_size, seq_len, num_heads, num_hiddens / num_heads)
X = X.reshape(X.shape[0], X.shape[1], num_heads, -1)
# X shape: (batch_size, num_heads, seq_len, num_hiddens / num_heads)
X = X.transpose(0, 2, 1, 3)
# output shape: (batch_size * num_heads, seq_len, num_hiddens / num_heads)
output = X.reshape(-1, X.shape[2], X.shape[3])
return output
# Saved in the d2l package for later use
def transpose_output(X, num_heads):
# A reversed version of transpose_qkv
X = X.reshape(-1, num_heads, X.shape[1], X.shape[2])
X = X.transpose(0, 2, 1, 3)
return X.reshape(X.shape[0], X.shape[1], -1)
Let’s test the ``MultiHeadAttention`` model in the a toy example. Create
a multi-head attention with the hidden size :math:`d_o = 100`, the
output will share the same batch size and sequence length as the input,
but the last dimension will be equal to the ``num_hiddens``
:math:`= 100`.
.. code:: python
cell = MultiHeadAttention(90, 9, 0.5)
cell.initialize()
X = np.ones((2, 4, 5))
valid_length = np.array([2, 3])
cell(X, X, X, valid_length).shape
.. parsed-literal::
:class: output
(2, 4, 90)
Position-wise Feed-Forward Networks
-----------------------------------
Another key component in the Transformer block is called *position-wise
feed-forward network (FFN)*. It accepts a :math:`3`-dimensional input
with shape (batch size, sequence length, feature size). The
position-wise FFN consists of two dense layers that applies to the last
dimension. Since the same two dense layers are used for each position
item in the sequence, we referred to it as *position-wise*. Indeed, it
is equivalent to applying two :math:`1 \times 1` convolution layers.
Below, the ``PositionWiseFFN`` shows how to implement a position-wise
FFN with two dense layers of hidden size ``pw_num_hiddens`` and
``pw_num_outputs``, respectively.
.. code:: python
# Saved in the d2l package for later use
class PositionWiseFFN(nn.Block):
def __init__(self, pw_num_hiddens, pw_num_outputs, **kwargs):
super(PositionWiseFFN, self).__init__(**kwargs)
self.dense1 = nn.Dense(pw_num_hiddens, flatten=False,
activation='relu')
self.dense2 = nn.Dense(pw_num_outputs, flatten=False)
def forward(self, X):
return self.dense2(self.dense1(X))
Similar to the multi-head attention, the position-wise feed-forward
network will only change the last dimension size of the input—the
feature dimension. In addition, if two items in the input sequence are
identical, the according outputs will be identical as well.
.. code:: python
ffn = PositionWiseFFN(4, 8)
ffn.initialize()
ffn(np.ones((2, 3, 4)))[0]
.. parsed-literal::
:class: output
array([[ 9.15348239e-04, -7.27669394e-04, 1.14063594e-04,
-8.76279722e-04, -1.02867256e-03, 8.02748313e-04,
-4.53725770e-05, 2.15598906e-04],
[ 9.15348239e-04, -7.27669394e-04, 1.14063594e-04,
-8.76279722e-04, -1.02867256e-03, 8.02748313e-04,
-4.53725770e-05, 2.15598906e-04],
[ 9.15348239e-04, -7.27669394e-04, 1.14063594e-04,
-8.76279722e-04, -1.02867256e-03, 8.02748313e-04,
-4.53725770e-05, 2.15598906e-04]])
Add and Norm
------------
Besides the above two components in the Transformer block, the “add and
norm” within the block also plays a key role to connect the inputs and
outputs of other layers smoothly. To explain, we add a layer that
contains a residual structure and a *layer normalization* after both the
multi-head attention layer and the position-wise FFN network. *Layer
normalization* is similar to batch normalization in
:numref:`sec_batch_norm`. One difference is that the mean and
variances for the layer normalization are calculated along the last
dimension, e.g ``X.mean(axis=-1)`` instead of the first batch dimension,
e.g., ``X.mean(axis=0)``. Layer normalization prevents the range of
values in the layers from changing too much, which means that faster
training and better generalization ability.
MXNet has both ``LayerNorm`` and ``BatchNorm`` implemented within the
``nn`` block. Let’s call both of them and see the difference in the
example below.
.. code:: python
layer = nn.LayerNorm()
layer.initialize()
batch = nn.BatchNorm()
batch.initialize()
X = np.array([[1, 2], [2, 3]])
# Compute mean and variance from X in the training mode
with autograd.record():
print('layer norm:', layer(X), '\nbatch norm:', batch(X))
.. parsed-literal::
:class: output
layer norm: [[-0.99998 0.99998]
[-0.99998 0.99998]]
batch norm: [[-0.99998 -0.99998]
[ 0.99998 0.99998]]
Now let’s implement the connection block ``AddNorm`` together.
``AddNorm`` accepts two inputs :math:`X` and :math:`Y`. We can deem
:math:`X` as the original input in the residual network, and :math:`Y`
as the outputs from either the multi-head attention layer or the
position-wise FFN network. In addition, we apply dropout on :math:`Y`
for regularization.
.. code:: python
# Saved in the d2l package for later use
class AddNorm(nn.Block):
def __init__(self, dropout, **kwargs):
super(AddNorm, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)
self.ln = nn.LayerNorm()
def forward(self, X, Y):
return self.ln(self.dropout(Y) + X)
Due to the residual connection, :math:`X` and :math:`Y` should have the
same shape.
.. code:: python
add_norm = AddNorm(0.5)
add_norm.initialize()
add_norm(np.ones((2, 3, 4)), np.ones((2, 3, 4))).shape
.. parsed-literal::
:class: output
(2, 3, 4)
Positional Encoding
-------------------
Unlike the recurrent layer, both the multi-head attention layer and the
position-wise feed-forward network compute the output of each item in
the sequence independently. This feature enables us to parallelize the
computation, but it fails to model the sequential information for a
given sequence. To better capture the sequential information, the
Transformer model uses the *positional encoding* to maintain the
positional information of the input sequence.
To explain, assume that :math:`X\in\mathbb R^{l\times d}` is the
embedding of an example, where :math:`l` is the sequence length and
:math:`d` is the embedding size. This positional encoding layer encodes
X’s position :math:`P\in\mathbb R^{l\times d}` and outputs :math:`P+X`.
The position :math:`P` is a 2-D matrix, where :math:`i` refers to the
order in the sentence, and :math:`j` refers to the position along the
embedding vector dimension. In this way, each value in the origin
sequence is then maintained using the equations below:
.. math:: P_{i, 2j} = \sin(i/10000^{2j/d}),
.. math:: \quad P_{i, 2j+1} = \cos(i/10000^{2j/d}),
for :math:`i=0,\ldots, l-1` and
:math:`j=0,\ldots,\lfloor(d-1)/2\rfloor`.
:numref:`fig_positional_encoding` illustrates the positional encoding.
.. _fig_positional_encoding:
.. figure:: ../img/positional_encoding.svg
Positional encoding.
.. code:: python
# Saved in the d2l package for later use
class PositionalEncoding(nn.Block):
def __init__(self, embed_size, dropout, max_len=1000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(dropout)
# Create a long enough P
self.P = np.zeros((1, max_len, embed_size))
X = np.arange(0, max_len).reshape(-1, 1) / np.power(
10000, np.arange(0, embed_size, 2) / embed_size)
self.P[:, :, 0::2] = np.sin(X)
self.P[:, :, 1::2] = np.cos(X)
def forward(self, X):
X = X + self.P[:, :X.shape[1], :].as_in_context(X.context)
return self.dropout(X)
Now we test the ``PositionalEncoding`` class with a toy model for 4
dimensions. As we can see, the :math:`4^{\mathrm{th}}` dimension has the
same frequency as the :math:`5^{\mathrm{th}}` but with different offset.
The :math:`5^{\mathrm{th}}` and :math:`6^{\mathrm{th}}` dimensions have
a lower frequency.
.. code:: python
pe = PositionalEncoding(20, 0)
pe.initialize()
Y = pe(np.zeros((1, 100, 20)))
d2l.plot(np.arange(100), Y[0, :, 4:8].T, figsize=(6, 2.5),
legend=["dim %d" % p for p in [4, 5, 6, 7]])
.. figure:: output_transformer_fc4c21_21_0.svg
Encoder
-------
Armed with all the essential components of Transformer, let’s first
build a Transformer encoder block. This encoder contains a multi-head
attention layer, a position-wise feed-forward network, and two “add and
norm” connection blocks. As shown in the code, for both of the attention
model and the positional FFN model in the ``EncoderBlock``, their
outputs’ dimension are equal to the ``embed_size``. This is due to the
nature of the residual block, as we need to add these outputs back to
the original value during “add and norm”.
.. code:: python
# Saved in the d2l package for later use
class EncoderBlock(nn.Block):
def __init__(self, embed_size, pw_num_hiddens, num_heads, dropout,
**kwargs):
super(EncoderBlock, self).__init__(**kwargs)
self.attention = MultiHeadAttention(embed_size, num_heads, dropout)
self.addnorm1 = AddNorm(dropout)
self.ffn = PositionWiseFFN(pw_num_hiddens, embed_size)
self.addnorm2 = AddNorm(dropout)
def forward(self, X, valid_length):
Y = self.addnorm1(X, self.attention(X, X, X, valid_length))
return self.addnorm2(Y, self.ffn(Y))
Due to the residual connections, this block will not change the input
shape. It means that the ``embed_size`` argument should be equal to the
input size of the last dimension. In our toy example below,
``embed_size`` :math:`= 24`, ``pw_num_hiddens`` :math:`=48`,
``num_heads`` :math:`= 8`, and ``dropout`` :math:`= 0.5`.
.. code:: python
X = np.ones((2, 100, 24))
encoder_blk = EncoderBlock(24, 48, 8, 0.5)
encoder_blk.initialize()
encoder_blk(X, valid_length).shape
.. parsed-literal::
:class: output
(2, 100, 24)
Now it comes to the implementation of the entire Transformer encoder.
With the Transformer encoder, :math:`n` blocks of ``EncoderBlock`` stack
up one after another. Because of the residual connection, the embedding
layer size :math:`d` is same as the Transformer block output size. Also
note that we multiply the embedding output by :math:`\sqrt{d}` to
prevent its values from being too small.
.. code:: python
# Saved in the d2l package for later use
class TransformerEncoder(d2l.Encoder):
def __init__(self, vocab_size, embed_size, pw_num_hiddens,
num_heads, num_layers, dropout, **kwargs):
super(TransformerEncoder, self).__init__(**kwargs)
self.embed_size = embed_size
self.embed = nn.Embedding(vocab_size, embed_size)
self.pos_encoding = PositionalEncoding(embed_size, dropout)
self.blks = nn.Sequential()
for i in range(num_layers):
self.blks.add(
EncoderBlock(embed_size, pw_num_hiddens, num_heads, dropout))
def forward(self, X, valid_length, *args):
X = self.pos_encoding(self.embed(X) * math.sqrt(self.embed_size))
for blk in self.blks:
X = blk(X, valid_length)
return X
Let’s create an encoder with two stacked Transformer encoder blocks,
whose hyperparameters are the same as before. Similar to the previous
toy example’s parameters, we add two more parameters ``vocab_size`` to
be :math:`200` and ``num_layers`` to be :math:`2` here.
.. code:: python
encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)
encoder.initialize()
encoder(np.ones((2, 100)), valid_length).shape
.. parsed-literal::
:class: output
(2, 100, 24)
Decoder
-------
The Transformer decoder block looks similar to the Transformer encoder
block. However, besides the two sub-layers—the multi-head attention
layer and the positional encoding network, the decoder Transformer block
contains a third sub-layer, which applies multi-head attention on the
output of the encoder stack. Similar to the Transformer encoder block,
the Transformer decoder block employs “add and norm”, i.e., the residual
connections and the layer normalization to connect each of the
sub-layers.
To be specific, at timestep :math:`t`, assume that :math:`\mathbf x_t`
is the current input, i.e., the query. As illustrated in
:numref:`fig_self_attention_predict`, the keys and values of the
self-attention layer consist of the current query with all the past
queries :math:`\mathbf x_1, \ldots, \mathbf x_{t-1}`.
.. _fig_self_attention_predict:
.. figure:: ../img/self-attention-predict.svg
Predict at timestep :math:`t` for a self-attention layer.
During training, the output for the :math:`t`-query could observe all
the previous key-value pairs. It results in an different behavior from
prediction. Thus, during prediction we can eliminate the unnecessary
information by specifying the valid length to be :math:`t` for the
:math:`t^\textrm{th}` query.
.. code:: python
class DecoderBlock(nn.Block):
# i means it is the i-th block in the decoder
def __init__(self, embed_size, pw_num_hiddens, num_heads,
dropout, i, **kwargs):
super(DecoderBlock, self).__init__(**kwargs)
self.i = i
self.attention1 = MultiHeadAttention(embed_size, num_heads, dropout)
self.addnorm1 = AddNorm(dropout)
self.attention2 = MultiHeadAttention(embed_size, num_heads, dropout)
self.addnorm2 = AddNorm(dropout)
self.ffn = PositionWiseFFN(pw_num_hiddens, embed_size)
self.addnorm3 = AddNorm(dropout)
def forward(self, X, state):
enc_outputs, enc_valid_length = state[0], state[1]
# state[2][i] contains the past queries for this block
if state[2][self.i] is None:
key_values = X
else:
key_values = np.concatenate((state[2][self.i], X), axis=1)
state[2][self.i] = key_values
if autograd.is_training():
batch_size, seq_len, _ = X.shape
# Shape: (batch_size, seq_len), the values in the j-th column
# are j+1
valid_length = np.tile(np.arange(1, seq_len+1, ctx=X.context),
(batch_size, 1))
else:
valid_length = None
X2 = self.attention1(X, key_values, key_values, valid_length)
Y = self.addnorm1(X, X2)
Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_length)
Z = self.addnorm2(Y, Y2)
return self.addnorm3(Z, self.ffn(Z)), state
Similar to the Transformer encoder block, ``embed_size`` should be equal
to the last dimension size of :math:`X`.
.. code:: python
decoder_blk = DecoderBlock(24, 48, 8, 0.5, 0)
decoder_blk.initialize()
X = np.ones((2, 100, 24))
state = [encoder_blk(X, valid_length), valid_length, [None]]
decoder_blk(X, state)[0].shape
.. parsed-literal::
:class: output
(2, 100, 24)
The construction of the entire Transformer decoder is identical to the
Transformer encoder, except for the additional dense layer to obtain the
output confidence scores.
Let’s implement the Transformer decoder ``TransformerDecoder``. Besides
the regular hyperparameters such as the ``vocab_size`` and
``embed_size``, the Transformer decoder also needs the encoder
Transformer’s outputs ``enc_outputs`` and ``env_valid_length``.
.. code:: python
class TransformerDecoder(d2l.Decoder):
def __init__(self, vocab_size, embed_size, pw_num_hiddens,
num_heads, num_layers, dropout, **kwargs):
super(TransformerDecoder, self).__init__(**kwargs)
self.embed_size = embed_size
self.num_layers = num_layers
self.embed = nn.Embedding(vocab_size, embed_size)
self.pos_encoding = PositionalEncoding(embed_size, dropout)
self.blks = nn.Sequential()
for i in range(num_layers):
self.blks.add(
DecoderBlock(embed_size, pw_num_hiddens, num_heads,
dropout, i))
self.dense = nn.Dense(vocab_size, flatten=False)
def init_state(self, enc_outputs, env_valid_length, *args):
return [enc_outputs, env_valid_length, [None]*self.num_layers]
def forward(self, X, state):
X = self.pos_encoding(self.embed(X) * math.sqrt(self.embed_size))
for blk in self.blks:
X, state = blk(X, state)
return self.dense(X), state
Training
--------
Finally, we can build a encoder-decoder model with Transformer
architecture. Similar to the seq2seq with attention model in
:numref:`sec_seq2seq_attention`, we use the following hyperparameters:
two Transformer blocks with both the embedding size and the block output
size to be :math:`32`. In addition, we use :math:`4` heads, and set the
hidden size to be twice larger than the output size.
.. code:: python
embed_size, embed_size, num_layers, dropout = 32, 32, 2, 0.0
batch_size, num_steps = 64, 10
lr, num_epochs, ctx = 0.005, 100, d2l.try_gpu()
num_hiddens, num_heads = 64, 4
src_vocab, tgt_vocab, train_iter = d2l.load_data_nmt(batch_size, num_steps)
encoder = TransformerEncoder(
len(src_vocab), embed_size, num_hiddens, num_heads, num_layers,
dropout)
decoder = TransformerDecoder(
len(src_vocab), embed_size, num_hiddens, num_heads, num_layers,
dropout)
model = d2l.EncoderDecoder(encoder, decoder)
d2l.train_s2s_ch9(model, train_iter, lr, num_epochs, ctx)
.. parsed-literal::
:class: output
loss 0.033, 3249 tokens/sec on gpu(0)
.. figure:: output_transformer_fc4c21_37_1.svg
As we can see from the training time and accuracy, compared with the
seq2seq model with attention model, Transformer runs faster per epoch,
and converges faster at the beginning.
We can use the trained Transformer to translate some simple sentences.
.. code:: python
for sentence in ['Go .', 'Wow !', "I'm OK .", 'I won !']:
print(sentence + ' => ' + d2l.predict_s2s_ch9(
model, sentence, src_vocab, tgt_vocab, num_steps, ctx))
.. parsed-literal::
:class: output
Go . => !
Wow ! => !
I'm OK . => ça va .
I won ! => je l'ai emporté emporté prie !
Summary
-------
- The Transformer model is based on the encoder-decoder architecture.
- Multi-head attention layer contains :math:`h` parallel attention
layers.
- Position-wise feed-forward network consists of two dense layers that
apply to the last dimension.
- Layer normalization differs from batch normalization by normalizing
along the last dimension (the feature dimension) instead of the first
(batch size) dimension.
- Positional encoding is the only place that adds positional
information to the Transformer model.
Exercises
---------
1. Try a larger size of epochs and compare the loss between seq2seq
model and Transformer model.
2. Can you think of any other benefit of positional encoding?
3. Compare layer normalization and batch normalization, when shall we
apply which?
`Discussions `__
-------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_transformer.svg