.. _sec_attention:
Attention Mechanism
===================
In :numref:`sec_seq2seq`, we encode the source sequence input
information in the recurrent unit state and then pass it to the decoder
to generate the target sequence. A token in the target sequence may
closely relate to one or more tokens in the source sequence, instead of
the whole source sequence. For example, when translating “Hello world.”
to “Bonjour le monde.”, “Bonjour” maps to “Hello” and “monde” maps to
“world”. In the seq2seq model, the decoder may implicitly select the
corresponding information from the state passed by the encoder. The
attention mechanism, however, makes this selection explicit.
*Attention* is a generalized pooling method with bias alignment over
inputs. The core component in the attention mechanism is the attention
layer, or called *attention* for simplicity. An input of the attention
layer is called a *query*. For a query, attention returns an output
based on the memory—a set of key-value pairs encoded in the attention
layer. To be more specific, assume that the memory contains :math:`n`
key-value pairs,
:math:`(\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_n, \mathbf{v}_n)`,
with :math:`\mathbf{k}_i \in \mathbb R^{d_k}`,
:math:`\mathbf{v}_i \in \mathbb R^{d_v}`. Given a query
:math:`\mathbf{q} \in \mathbb R^{d_q}`, the attention layer returns an
output :math:`\mathbf{o} \in \mathbb R^{d_v}` with the same shape as the
value.
.. _fig_attention:
.. figure:: ../img/attention.svg
The attention layer returns an output based on the input query and
its memory.
The full process of attention mechanism is expressed in
:numref:`fig_attention_output`. To compute the output of attention, we
first use a score function :math:`\alpha` that measures the similarity
between the query and key. Then for each key
:math:`(\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_n, \mathbf{v}_n)`,
we compute the scores :math:`a_1, \ldots, a_n` by
.. math:: a_i = \alpha(\mathbf q, \mathbf k_i).
Next we use softmax to obtain the attention weights, i.e.,
.. math::
\mathbf{b} = \mathrm{softmax}(\mathbf{a})\quad \text{, where }\quad
{b}_i = \frac{\exp(a_i)}{\sum_j \exp(a_j)}, \mathbf{b} = [b_1, \ldots, b_n]^T .
Finally, the output is a weighted sum of the values:
.. math:: \mathbf o = \sum_{i=1}^n b_i \mathbf v_i.
.. _fig_attention_output:
.. figure:: ../img/attention_output.svg
The attention output is a weighted sum of the values.
Different choices of the score function lead to different attention
layers. Below, we introduce two commonly used attention layers. Before
diving into the implementation, we first express two operators to get
you up and running: a masked version of the softmax operator
``masked_softmax`` and a specialized dot operator ``batched_dot``.
.. code:: python
import math
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
The masked softmax takes a 3-dimensional input and enables us to filter
out some elements by specifying a valid length for the last dimension.
(Refer to :numref:`sec_machine_translation` for the definition of a
valid length.) As a result, any value outside the valid length will be
masked as :math:`0`. Let’s implement the ``masked_softmax`` function.
.. code:: python
# Saved in the d2l package for later use
def masked_softmax(X, valid_length):
# X: 3-D tensor, valid_length: 1-D or 2-D tensor
if valid_length is None:
return npx.softmax(X)
else:
shape = X.shape
if valid_length.ndim == 1:
valid_length = valid_length.repeat(shape[1], axis=0)
else:
valid_length = valid_length.reshape(-1)
# Fill masked elements with a large negative, whose exp is 0
X = npx.sequence_mask(X.reshape(-1, shape[-1]), valid_length, True,
axis=1, value=-1e6)
return npx.softmax(X).reshape(shape)
To illustrate how this function works, we construct two
:math:`2 \times 4` matrices as the input. In addition, we specify that
the valid length equals to 2 for the first example, and 3 for the second
example. Then, as we can see from the following outputs, the values
outside valid lengths are masked as zero.
.. code:: python
masked_softmax(np.random.uniform(size=(2, 2, 4)), np.array([2, 3]))
.. parsed-literal::
:class: output
array([[[0.488994 , 0.511006 , 0. , 0. ],
[0.43654838, 0.56345165, 0. , 0. ]],
[[0.28817102, 0.3519408 , 0.3598882 , 0. ],
[0.29034293, 0.25239873, 0.45725834, 0. ]]])
Moreover, the second operator ``batched_dot`` takes two inputs :math:`X`
and :math:`Y` with shapes :math:`(b, n, m)` and :math:`(b, m, k)`,
respectively, and returns an output with shape :math:`(b, n, k)`. To be
specific, it computes :math:`b` dot products for
:math:`i= \{1,\ldots, b\}`, i.e.,
.. math:: Z[i,:,:] = X[i,:,:] Y[i,:,:].
.. code:: python
npx.batch_dot(np.ones((2, 1, 3)), np.ones((2, 3, 2)))
.. parsed-literal::
:class: output
array([[[3., 3.]],
[[3., 3.]]])
Dot Product Attention
---------------------
Equipped with the above two operators: ``masked_softmax`` and
``batched_dot``, let’s dive into the details of two widely used
attentions layers. The first one is the *dot product attention*: it
assumes that the query has the same dimension as the keys, namely
:math:`\mathbf q, \mathbf k_i \in\mathbb R^d` for all :math:`i`. The dot
product attention computes the scores by an dot product between the
query and a key, which is then divided by :math:`\sqrt{d}` to minimize
the unrelated influence of the dimension :math:`d` on the scores. In
other words,
.. math:: \alpha(\mathbf q, \mathbf k) = \langle \mathbf q, \mathbf k \rangle /\sqrt{d}.
Beyond the single-dimensional queries and keys, we can always generalize
them to multi-dimensional queries and keys. Assume that
:math:`\mathbf Q\in\mathbb R^{m\times d}` contains :math:`m` queries and
:math:`\mathbf K\in\mathbb R^{n\times d}` has all the :math:`n` keys. We
can compute all :math:`mn` scores by
.. math:: \alpha(\mathbf Q, \mathbf K) = \mathbf Q \mathbf K^\top /\sqrt{d}.
:label: eq_alpha_QK
With :eq:`eq_alpha_QK`, we can implement the dot product attention
layer ``DotProductAttention`` that supports a batch of queries and
key-value pairs. In addition, for regularization we also use a dropout
layer.
.. code:: python
# Saved in the d2l package for later use
class DotProductAttention(nn.Block):
def __init__(self, dropout, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)
# query: (batch_size, #queries, d)
# key: (batch_size, #kv_pairs, d)
# value: (batch_size, #kv_pairs, dim_v)
# valid_length: either (batch_size, ) or (batch_size, xx)
def forward(self, query, key, value, valid_length=None):
d = query.shape[-1]
# Set transpose_b=True to swap the last two dimensions of key
scores = npx.batch_dot(query, key, transpose_b=True) / math.sqrt(d)
attention_weights = self.dropout(masked_softmax(scores, valid_length))
return npx.batch_dot(attention_weights, value)
| Let’s test the class ``DotProductAttention`` in a toy example. First,
create two batches, where each batch has one query and 10 key-value
pairs.
| Via the ``valid_length`` argument, we specify that we will check the
first :math:`2` key-value pairs for the first batch and :math:`6` for
the second one. Therefore, even though both batches have the same
query and key-value pairs, we obtain different outputs.
.. code:: python
atten = DotProductAttention(dropout=0.5)
atten.initialize()
keys = np.ones((2, 10, 2))
values = np.arange(40).reshape(1, 10, 4).repeat(2, axis=0)
atten(np.ones((2, 1, 2)), keys, values, np.array([2, 6]))
.. parsed-literal::
:class: output
array([[[ 2. , 3. , 4. , 5. ]],
[[10. , 11. , 12.000001, 13. ]]])
As we can see above, dot product attention simply multiplies the query
and key together, and hopes to derive their similarities from there.
Whereas, the query and key may not be of the same dimension. To address
such an issue, we may resort to the multilayer perceptron attention.
Multilayer Perceptron Attention
-------------------------------
In *multilayer perceptron attention*, we project both query and keys
into :math:`\mathbb R^{h}` by learnable weights parameters. Assume that
the learnable weights are :math:`\mathbf W_k\in\mathbb R^{h\times d_k}`,
:math:`\mathbf W_q\in\mathbb R^{h\times d_q}`, and
:math:`\mathbf v\in\mathbb R^{h}`. Then the score function is defined by
.. math:: \alpha(\mathbf k, \mathbf q) = \mathbf v^\top \text{tanh}(\mathbf W_k \mathbf k + \mathbf W_q\mathbf q).
Intuitively, you can imagine
:math:`\mathbf W_k \mathbf k + \mathbf W_q\mathbf q` as concatenating
the key and value in the feature dimension and feeding them to a single
hidden layer perceptron with hidden layer size :math:`h` and output
layer size :math:`1`. In this hidden layer, the activation function is
:math:`\tanh` and no bias is applied. Now let’s implement the multilayer
perceptron attention.
.. code:: python
# Saved in the d2l package for later use
class MLPAttention(nn.Block):
def __init__(self, units, dropout, **kwargs):
super(MLPAttention, self).__init__(**kwargs)
# Use flatten=True to keep query's and key's 3-D shapes
self.W_k = nn.Dense(units, activation='tanh',
use_bias=False, flatten=False)
self.W_q = nn.Dense(units, activation='tanh',
use_bias=False, flatten=False)
self.v = nn.Dense(1, use_bias=False, flatten=False)
self.dropout = nn.Dropout(dropout)
def forward(self, query, key, value, valid_length):
query, key = self.W_k(query), self.W_q(key)
# Expand query to (batch_size, #querys, 1, units), and key to
# (batch_size, 1, #kv_pairs, units). Then plus them with broadcast
features = np.expand_dims(query, axis=2) + np.expand_dims(key, axis=1)
scores = np.squeeze(self.v(features), axis=-1)
attention_weights = self.dropout(masked_softmax(scores, valid_length))
return npx.batch_dot(attention_weights, value)
To test the above ``MLPAttention`` class, we use the same inputs as in
the previous toy example. As we can see below, despite ``MLPAttention``
containing an additional MLP model, we obtain the same outputs as for
``DotProductAttention``.
.. code:: python
atten = MLPAttention(units=8, dropout=0.1)
atten.initialize()
atten(np.ones((2, 1, 2)), keys, values, np.array([2, 6]))
.. parsed-literal::
:class: output
array([[[ 2. , 3. , 4. , 5. ]],
[[10. , 11. , 12.000001, 13. ]]])
Summary
-------
- An attention layer explicitly selects related information.
- An attention layer’s memory consists of key-value pairs, so its
output is close to the values whose keys are similar to the queries.
- Two commonly used attention models are dot product attention and
multilayer perceptron attention.
Exercises
---------
1. What are the advantages and disadvantages for dot product attention
and multilayer perceptron attention, respectively?
`Discussions `__
-----------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_attention.svg