15.5. Natural Language Inference: Using Attention
Open the notebook in Colab

We introduced the natural language inference (NLI) task and the SNLI dataset in Section 15.4. In view of many models that are based on complex and deep architectures, Parikh et al. proposed to address NLI with attention mechanisms and called it a “decomposable attention model” [Parikh et al., 2016]. This results in a model without recurrent or convolutional layers, achieving the best result at the time on the SNLI dataset with much fewer parameters. In this section, we will describe and implement this attention-based method (with MLPs) for NLI, as depicted in Fig. 15.5.1.


Fig. 15.5.1 This section feeds pretrained GloVe to an architecture based on attention and MLPs for natural language inference.

15.5.1. The Model

Simpler than preserving the order of words in premises and hypotheses, we can just align words in one text sequence to every word in the other, and vice versa, then compare and aggregate such information to predict the logical relationships between premises and hypotheses. Similar to alignment of words between source and target sentences in machine translation, the alignment of words between premises and hypotheses can be neatly accomplished by attention mechanisms.


Fig. 15.5.2 NLI using attention mechanisms.

Fig. 15.5.2 depicts the NLI method using attention mechanisms. At a high level, it consists of three jointly trained steps: attending, comparing, and aggregating. We will illustrate them step by step in the following.

import d2l
import mxnet as mx
from mxnet import autograd, gluon, init, np, npx
from mxnet.contrib import text
from mxnet.gluon import nn

npx.set_np() Attending

The first step is to align words in one text sequence to each word in the other sequence. Suppose that the premise is “i do need sleep” and the hypothesis is “i am tired”. Due to semantical similarity, we may wish to align “i” in the hypothesis with “i” in the premise, and align “tired” in the hypothesis with “sleep” in the premise. Likewise, we may wish to align “i” in the premise with “i” in the hypothesis, and align “need” and “sleep” in the premise with “tired” in the hypothesis. Note that such alignment is soft using weighted average, where ideally large weights are associated with the words to be aligned. For ease of demonstration, Fig. 15.5.2 shows such alignment in a hard way.

Now we describe the soft alignment using attention mechanisms in more detail. Denote by \(\mathbf{A} = (\mathbf{a}_1, \ldots, \mathbf{a}_m)\) and \(\mathbf{B} = (\mathbf{b}_1, \ldots, \mathbf{b}_n)\) the premise and hypothesis, whose number of words are \(m\) and \(n\), respectively, where \(\mathbf{a}_i, \mathbf{b}_j \in \mathbb{R}^{d}\) (\(i = 1, \ldots, m, j = 1, \ldots, n\)) is a \(d\)-dimensional word embedding vector. For soft alignment, we compute the attention weights \(e_{ij} \in \mathbb{R}\) as

(15.5.1)\[e_{ij} = f(\mathbf{a}_i)^\top f(\mathbf{b}_j),\]

where the function \(f\) is a multilayer perceptron defined in the following mlp function. The output dimension of \(f\) is specified by the num_hiddens argument of mlp.

def mlp(num_hiddens, flatten):
    net = nn.Sequential()
    net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))
    net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))
    return net

It should be highlighted that, in (15.5.1) \(f\) takes inputs \(\mathbf{a}_i\) and \(\mathbf{b}_j\) separately rather than takes a pair of them together as the input. This decomposition trick leads to only \(m + n\) applications (linear complexity) of \(f\) rather than \(mn\) applications (quadratic complexity).

Normalizing the attention weights in (15.5.1), we compute the weighted average of all the word embeddings in the hypothesis to obtain representation of the hypothesis that is softly aligned with the word indexed by \(i\) in the premise:

(15.5.2)\[\boldsymbol{\beta}_i = \sum_{j=1}^{n}\frac{\exp(e_{ij})}{ \sum_{k=1}^{n} \exp(e_{ik})} \mathbf{b}_j.\]

Likewise, we compute soft alignment of premise words for each word indexed by \(j\) in the hypothesis:

(15.5.3)\[\boldsymbol{\alpha}_j = \sum_{i=1}^{m}\frac{\exp(e_{ij})}{ \sum_{k=1}^{m} \exp(e_{kj})} \mathbf{a}_i.\]

Below we define the Attend class to compute the soft alignment of hypotheses (beta) with input premises A and soft alignment of premises (alpha) with input hypotheses B.

class Attend(nn.Block):
    def __init__(self, num_hiddens, **kwargs):
        super(Attend, self).__init__(**kwargs)
        self.f = mlp(num_hiddens=num_hiddens, flatten=False)

    def forward(self, A, B):
        # Shape of A/B: (batch_size, #words in sequence A/B, embed_size)
        # Shape of f_A/f_B: (batch_size, #words in sequence A/B, num_hiddens)
        f_A = self.f(A)
        f_B = self.f(B)
        # Shape of e: (batch_size, #words in sequence A, #words in sequence B)
        e = npx.batch_dot(f_A, f_B, transpose_b=True)
        # Shape of beta: (batch_size, #words in sequence A, embed_size), where
        # sequence B is softly aligned with each word (axis 1 of beta) in
        # sequence A
        beta = npx.batch_dot(npx.softmax(e), B)
        # Shape of alpha: (batch_size, #words in sequence B, embed_size),
        # where sequence A is softly aligned with each word (axis 1 of alpha)
        # in sequence B
        alpha = npx.batch_dot(npx.softmax(e.transpose(0, 2, 1)), A)
        return beta, alpha Comparing

In the next step, we compare a word in one sequence with the other sequence that is softly aligned with that word. Note that in soft alignment, all the words from one sequence, though with probably different attention weights, will be compared with a word in the other sequence. For easy of demonstration, Fig. 15.5.2 pairs words with aligned words in a hard way. For example, suppose that the attending step determines that “need” and “sleep” in the premise are both aligned with “tired” in the hypothesis, the pair “tired–need sleep” will be compared.

In the comparing step, we feed the concatenation (operator \([\cdot, \cdot]\)) of words from one sequence and aligned words from the other sequence into a function \(g\) (a multilayer perceptron):

(15.5.4)\[\begin{split}\mathbf{v}_{A,i} = g([\mathbf{a}_i, \boldsymbol{\beta}_i]), i = 1, \ldots, m\\ \mathbf{v}_{B,j} = g([\mathbf{b}_j, \boldsymbol{\alpha}_j]), j = 1, \ldots, n.\end{split}\]

In (15.5.4), \(\mathbf{v}_{A,i}\) is the comparison between word \(i\) in the premise and all the hypothesis words that are softly aligned with word \(i\); while \(\mathbf{v}_{B,j}\) is the comparison between word \(j\) in the hypothesis and all the premise words that are softly aligned with word \(j\). The following Compare class defines such as comparing step.

class Compare(nn.Block):
    def __init__(self, num_hiddens, **kwargs):
        super(Compare, self).__init__(**kwargs)
        self.g = mlp(num_hiddens=num_hiddens, flatten=False)

    def forward(self, A, B, beta, alpha):
        V_A = self.g(np.concatenate([A, beta], axis=2))
        V_B = self.g(np.concatenate([B, alpha], axis=2))
        return V_A, V_B Aggregating

With two sets of comparison vectors \(\mathbf{v}_{A,i}\) (\(i = 1, \ldots, m\)) and \(\mathbf{v}_{B,j}\) (\(j = 1, \ldots, n\)) on hand, in the last step we will aggregate such information to infer the logical relationship. We begin by summing up both sets:

(15.5.5)\[\mathbf{v}_A = \sum_{i=1}^{m} \mathbf{v}_{A,i}, \quad \mathbf{v}_B = \sum_{j=1}^{n}\mathbf{v}_{B,j}.\]

Next we feed the concatenation of both summarization results into function \(h\) (a multilayer perceptron) to obtain the classification result of the logical relationship:

(15.5.6)\[\hat{\mathbf{y}} = h([\mathbf{v}_A, \mathbf{v}_B]).\]

The aggregation step is defined in the following Aggregate class.

class Aggregate(nn.Block):
    def __init__(self, num_hiddens, num_outputs, **kwargs):
        super(Aggregate, self).__init__(**kwargs)
        self.h = mlp(num_hiddens=num_hiddens, flatten=True)

    def forward(self, V_A, V_B):
        # Sum up both sets of comparison vectors
        V_A = V_A.sum(axis=1)
        V_B = V_B.sum(axis=1)
        # Feed the concatenation of both summarization results into an MLP
        Y_hat = self.h(np.concatenate([V_A, V_B], axis=1))
        return Y_hat Putting All Things Together

By putting the attending, comparing, and aggregating steps together, we define the decomposable attention model to jointly train these three steps.

class DecomposableAttention(nn.Block):
    def __init__(self, vocab, embed_size, num_hiddens, **kwargs):
        super(DecomposableAttention, self).__init__(**kwargs)
        self.embedding = nn.Embedding(len(vocab), embed_size)
        self.attend = Attend(num_hiddens)
        self.compare = Compare(num_hiddens)
        # There are 3 possible outputs: entailment, contradiction, and neutral
        self.aggregate = Aggregate(num_hiddens, 3)

    def forward(self, X):
        premises, hypotheses = X
        A = self.embedding(premises)
        B = self.embedding(hypotheses)
        beta, alpha = self.attend(A, B)
        V_A, V_B = self.compare(A, B, beta, alpha)
        Y_hat = self.aggregate(V_A, V_B)
        return Y_hat

15.5.2. Training and Evaluating the Model

Now we will train and evaluate the defined decomposable attention model on the SNLI dataset. We begin by reading the dataset. Reading the dataset

We download and read the SNLI dataset using the function defined in Section 15.4. The batch size and sequence length are set to \(256\) and \(50\), respectively.

batch_size, num_steps = 256, 50
train_iter, test_iter, vocab = d2l.load_data_snli(batch_size, num_steps)
read 549367 examples
read 9824 examples Creating the Model

We use the pretrained \(100\)-dimensional GloVe embedding to represent the input tokens. Thus, we predefine the dimension of vectors \(\mathbf{a}_i\) and \(\mathbf{b}_j\) in :label: eq_nli_e as \(100\). The output dimension of functions \(f\) in (15.5.1) and \(g\) in (15.5.4) is set to \(200\). Then we create a model instance, initialize its parameters, and load the GloVe embedding to initialize vectors of input tokens.

embed_size, num_hiddens, ctx = 100, 200, d2l.try_all_gpus()
net = DecomposableAttention(vocab, embed_size, num_hiddens)
net.initialize(init.Xavier(), ctx=ctx)
glove_embedding = text.embedding.create(
    'glove', pretrained_file_name='glove.6B.100d.txt')
embeds = glove_embedding.get_vecs_by_tokens(vocab.idx_to_token)
net.embedding.weight.set_data(embeds) Training and Evaluating the Model

In contrast to the split_batch function in Section 12.5 that takes single inputs such as text sequences (or images), we define a split_batch_multi_inputs function to take multiple inputs such as premises and hypotheses in minibatches.

# Saved in the d2l package for later use
def split_batch_multi_inputs(X, y, ctx_list):
    """Split multi-input X and y into multiple devices specified by ctx"""
    X = list(zip(*[gluon.utils.split_and_load(
        feature, ctx_list, even_split=False) for feature in X]))
    return (X, gluon.utils.split_and_load(y, ctx_list, even_split=False))

Now we can train and evaluate the model on the SNLI dataset.

lr, num_epochs = 0.001, 4
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gluon.loss.SoftmaxCrossEntropyLoss()
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx,
loss 0.517, train acc 0.796, test acc 0.820
11442.2 examples/sec on [gpu(0), gpu(1)]
../_images/output_natural-language-inference-attention_8a1ec8_19_1.svg Using the Model

Finally, define the prediction function to output the logical relationship between a pair of premise and hypothesis.

# Saved in the d2l package for later use
def predict_snli(net, premise, hypothesis):
    premise = np.array(vocab[premise], ctx=d2l.try_gpu())
    hypothesis = np.array(vocab[hypothesis], ctx=d2l.try_gpu())
    label = np.argmax(net([premise.reshape((1, -1)),
                           hypothesis.reshape((1, -1))]), axis=1)
    return 'entailment' if label == 0 else 'contradiction' if label == 1 \
            else 'neutral'

We can use the trained model to obtain the NLI result for a sample pair of sentences.

predict_snli(net, ['he', 'is', 'good', '.'], ['he', 'is', 'bad', '.'])

15.5.3. Summary

  • The decomposable attention model consists of three steps for predicting the logical relationships between premises and hypotheses: attending, comparing, and aggregating.

  • With attention mechanisms, we can align words in one text sequence to every word in the other, and vice versa. Such alignment is soft using weighted average, where ideally large weights are associated with the words to be aligned.

  • The decomposition trick leads to a more desirable linear complexity than quadratic complexity when computing attention weights.

  • We can use pretrained word embedding as the input representation for downstream NLP task such as NLI.

15.5.4. Exercises

  1. Train the model with other combinations of hyperparameters. Can you get better accuracy on the test set?

  2. What are major drawbacks of the decomposable attention model for NLI?

  3. Suppose that we want to get the level of semantical similarity (e.g., a continuous value between \(0\) and \(1\)) for any pair of sentences. How shall we collect and label the dataset? Can you design a model with attention mechanisms?

15.5.5. Discussions