.. _sec_word2vec_pretraining:
Pretraining word2vec
====================
We go on to implement the skip-gram model defined in
:numref:`sec_word2vec`. Then we will pretrain word2vec using negative
sampling on the PTB dataset. First of all, let’s obtain the data
iterator and the vocabulary for this dataset by calling the
``d2l.load_data_ptb`` function, which was described in
:numref:`sec_word2vec_data`
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import math
import torch
from torch import nn
from d2l import torch as d2l
batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
num_noise_words)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import math
from mxnet import autograd, gluon, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
num_noise_words)
.. raw:: html
.. raw:: html
The Skip-Gram Model
-------------------
We implement the skip-gram model by using embedding layers and batch
matrix multiplications. First, let’s review how embedding layers work.
Embedding Layer
~~~~~~~~~~~~~~~
As described in :numref:`sec_seq2seq`, an embedding layer maps a
token’s index to its feature vector. The weight of this layer is a
matrix whose number of rows equals to the dictionary size
(``input_dim``) and number of columns equals to the vector dimension for
each token (``output_dim``). After a word embedding model is trained,
this weight is what we need.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
embed = nn.Embedding(num_embeddings=20, embedding_dim=4)
print(f'Parameter embedding_weight ({embed.weight.shape}, '
f'dtype={embed.weight.dtype})')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Parameter embedding_weight (torch.Size([20, 4]), dtype=torch.float32)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
embed = nn.Embedding(input_dim=20, output_dim=4)
embed.initialize()
embed.weight
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[22:26:39] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Parameter embedding0_weight (shape=(20, 4), dtype=float32)
.. raw:: html
.. raw:: html
The input of an embedding layer is the index of a token (word). For any
token index :math:`i`, its vector representation can be obtained from
the :math:`i^\textrm{th}` row of the weight matrix in the embedding
layer. Since the vector dimension (``output_dim``) was set to 4, the
embedding layer returns vectors with shape (2, 3, 4) for a minibatch of
token indices with shape (2, 3).
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
embed(x)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([[[ 0.7606, 0.3872, -0.1864, 1.1732],
[ 1.5035, 2.3623, -1.7542, -1.4990],
[-1.2639, -1.5313, 2.1719, 0.4151]],
[[-1.9079, 0.2434, 1.5395, 1.2990],
[ 0.7470, 1.0129, 0.4039, 0.0591],
[-0.6293, -0.1814, -0.4782, -0.5289]]], grad_fn=)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
x = np.array([[1, 2, 3], [4, 5, 6]])
embed(x)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([[[ 0.01438687, 0.05011239, 0.00628365, 0.04861524],
[-0.01068833, 0.01729892, 0.02042518, -0.01618656],
[-0.00873779, -0.02834515, 0.05484822, -0.06206018]],
[[ 0.06491279, -0.03182812, -0.01631819, -0.00312688],
[ 0.0408415 , 0.04370362, 0.00404529, -0.0028032 ],
[ 0.00952624, -0.01501013, 0.05958354, 0.04705103]]])
.. raw:: html
.. raw:: html
Defining the Forward Propagation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the forward propagation, the input of the skip-gram model includes
the center word indices ``center`` of shape (batch size, 1) and the
concatenated context and noise word indices ``contexts_and_negatives``
of shape (batch size, ``max_len``), where ``max_len`` is defined in
:numref:`subsec_word2vec-minibatch-loading`. These two variables are
first transformed from the token indices into vectors via the embedding
layer, then their batch matrix multiplication (described in
:numref:`subsec_batch_dot`) returns an output of shape (batch size, 1,
``max_len``). Each element in the output is the dot product of a center
word vector and a context or noise word vector.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center)
u = embed_u(contexts_and_negatives)
pred = torch.bmm(v, u.permute(0, 2, 1))
return pred
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center)
u = embed_u(contexts_and_negatives)
pred = npx.batch_dot(v, u.swapaxes(1, 2))
return pred
.. raw:: html
.. raw:: html
Let’s print the output shape of this ``skip_gram`` function for some
example inputs.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
skip_gram(torch.ones((2, 1), dtype=torch.long),
torch.ones((2, 4), dtype=torch.long), embed, embed).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
torch.Size([2, 1, 4])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
skip_gram(np.ones((2, 1)), np.ones((2, 4)), embed, embed).shape
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(2, 1, 4)
.. raw:: html
.. raw:: html
Training
--------
Before training the skip-gram model with negative sampling, let’s first
define its loss function.
Binary Cross-Entropy Loss
~~~~~~~~~~~~~~~~~~~~~~~~~
According to the definition of the loss function for negative sampling
in :numref:`subsec_negative-sampling`, we will use the binary
cross-entropy loss.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class SigmoidBCELoss(nn.Module):
# Binary cross-entropy loss with masking
def __init__(self):
super().__init__()
def forward(self, inputs, target, mask=None):
out = nn.functional.binary_cross_entropy_with_logits(
inputs, target, weight=mask, reduction="none")
return out.mean(dim=1)
loss = SigmoidBCELoss()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
loss = gluon.loss.SigmoidBCELoss()
.. raw:: html
.. raw:: html
Recall our descriptions of the mask variable and the label variable in
:numref:`subsec_word2vec-minibatch-loading`. The following calculates
the binary cross-entropy loss for the given variables.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
pred = torch.tensor([[1.1, -2.2, 3.3, -4.4]] * 2)
label = torch.tensor([[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0]])
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
tensor([0.9352, 1.8462])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
pred = np.array([[1.1, -2.2, 3.3, -4.4]] * 2)
label = np.array([[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0]])
mask = np.array([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
array([0.9352101, 1.8462093])
.. raw:: html
.. raw:: html
Below shows how the above results are calculated (in a less efficient
way) using the sigmoid activation function in the binary cross-entropy
loss. We can consider the two outputs as two normalized losses that are
averaged over non-masked predictions.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def sigmd(x):
return -math.log(1 / (1 + math.exp(-x)))
print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}')
print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
0.9352
1.8462
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def sigmd(x):
return -math.log(1 / (1 + math.exp(-x)))
print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}')
print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}')
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
0.9352
1.8462
.. raw:: html
.. raw:: html
Initializing Model Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We define two embedding layers for all the words in the vocabulary when
they are used as center words and context words, respectively. The word
vector dimension ``embed_size`` is set to 100.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
embed_size = 100
net = nn.Sequential(nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size),
nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(vocab), output_dim=embed_size),
nn.Embedding(input_dim=len(vocab), output_dim=embed_size))
.. raw:: html
.. raw:: html
Defining the Training Loop
~~~~~~~~~~~~~~~~~~~~~~~~~~
The training loop is defined below. Because of the existence of padding,
the calculation of the loss function is slightly different compared to
the previous training functions.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):
def init_weights(module):
if type(module) == nn.Embedding:
nn.init.xavier_uniform_(module.weight)
net.apply(init_weights)
net = net.to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[1, num_epochs])
# Sum of normalized losses, no. of normalized losses
metric = d2l.Accumulator(2)
for epoch in range(num_epochs):
timer, num_batches = d2l.Timer(), len(data_iter)
for i, batch in enumerate(data_iter):
optimizer.zero_grad()
center, context_negative, mask, label = [
data.to(device) for data in batch]
pred = skip_gram(center, context_negative, net[0], net[1])
l = (loss(pred.reshape(label.shape).float(), label.float(), mask)
/ mask.sum(axis=1) * mask.shape[1])
l.sum().backward()
optimizer.step()
metric.add(l.sum(), l.numel())
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, '
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):
net.initialize(ctx=device, force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'adam',
{'learning_rate': lr})
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[1, num_epochs])
# Sum of normalized losses, no. of normalized losses
metric = d2l.Accumulator(2)
for epoch in range(num_epochs):
timer, num_batches = d2l.Timer(), len(data_iter)
for i, batch in enumerate(data_iter):
center, context_negative, mask, label = [
data.as_in_ctx(device) for data in batch]
with autograd.record():
pred = skip_gram(center, context_negative, net[0], net[1])
l = (loss(pred.reshape(label.shape), label, mask) *
mask.shape[1] / mask.sum(axis=1))
l.backward()
trainer.step(batch_size)
metric.add(l.sum(), l.size)
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, '
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')
.. raw:: html
.. raw:: html
Now we can train a skip-gram model using negative sampling.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.002, 5
train(net, data_iter, lr, num_epochs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.410, 223485.0 tokens/sec on cuda:0
.. figure:: output_word2vec-pretraining_d81279_93_1.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
lr, num_epochs = 0.002, 5
train(net, data_iter, lr, num_epochs)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
loss 0.408, 108453.4 tokens/sec on gpu(0)
.. figure:: output_word2vec-pretraining_d81279_96_1.svg
.. raw:: html
.. raw:: html
.. _subsec_apply-word-embed:
Applying Word Embeddings
------------------------
After training the word2vec model, we can use the cosine similarity of
word vectors from the trained model to find words from the dictionary
that are most semantically similar to an input word.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def get_similar_tokens(query_token, k, embed):
W = embed.weight.data
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) *
torch.sum(x * x) + 1e-9)
topk = torch.topk(cos, k=k+1)[1].cpu().numpy().astype('int32')
for i in topk[1:]: # Remove the input words
print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}')
get_similar_tokens('chip', 3, net[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
cosine sim=0.702: microprocessor
cosine sim=0.649: mips
cosine sim=0.643: intel
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def get_similar_tokens(query_token, k, embed):
W = embed.weight.data()
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos = np.dot(W, x) / np.sqrt(np.sum(W * W, axis=1) * np.sum(x * x) + 1e-9)
topk = npx.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
for i in topk[1:]: # Remove the input words
print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}')
get_similar_tokens('chip', 3, net[0])
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
cosine sim=0.681: intel
cosine sim=0.662: microprocessor
cosine sim=0.619: memory
.. raw:: html
.. raw:: html
Summary
-------
- We can train a skip-gram model with negative sampling using embedding
layers and the binary cross-entropy loss.
- Applications of word embeddings include finding semantically similar
words for a given word based on the cosine similarity of word
vectors.
Exercises
---------
1. Using the trained model, find semantically similar words for other
input words. Can you improve the results by tuning hyperparameters?
2. When a training corpus is huge, we often sample context words and
noise words for the center words in the current minibatch *when
updating model parameters*. In other words, the same center word may
have different context words or noise words in different training
epochs. What are the benefits of this method? Try to implement this
training method.
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html