.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = torch.tensor([corpus[i:i+num_steps+1]
for i in range(len(corpus)-num_steps)])
self.X, self.Y = array[:,:-1], array[:,1:]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = np.array([corpus[i:i+num_steps+1]
for i in range(len(corpus)-num_steps)])
self.X, self.Y = array[:,:-1], array[:,1:]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = jnp.array([corpus[i:i+num_steps+1]
for i in range(len(corpus)-num_steps)])
self.X, self.Y = array[:,:-1], array[:,1:]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
self.save_hyperparameters()
corpus, self.vocab = self.build(self._download())
array = tf.constant([corpus[i:i+num_steps+1]
for i in range(len(corpus)-num_steps)])
self.X, self.Y = array[:,:-1], array[:,1:]
.. raw:: html
.. raw:: html
To train language models, we will randomly sample pairs of input
sequences and target sequences in minibatches. The following data loader
randomly generates a minibatch from the dataset each time. The argument
``batch_size`` specifies the number of subsequence examples in each
minibatch and ``num_steps`` is the subsequence length in tokens.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.TimeMachine) #@save
def get_dataloader(self, train):
idx = slice(0, self.num_train) if train else slice(
self.num_train, self.num_train + self.num_val)
return self.get_tensorloader([self.X, self.Y], train, idx)
As we can see in the following, a minibatch of target sequences can be
obtained by shifting the input sequences by one token.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
X: tensor([[10, 4, 2, 21, 10, 16, 15, 0, 20, 2],
[21, 9, 6, 19, 0, 24, 2, 26, 0, 16]])
Y: tensor([[ 4, 2, 21, 10, 16, 15, 0, 20, 2, 10],
[ 9, 6, 19, 0, 24, 2, 26, 0, 16, 9]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X: [[ 7. 7. 6. 19. 6. 15. 4. 6. 0. 3.]
[ 6. 19. 0. 4. 2. 22. 8. 9. 21. 0.]]
Y: [[ 7. 6. 19. 6. 15. 4. 6. 0. 3. 6.]
[19. 0. 4. 2. 22. 8. 9. 21. 0. 21.]]
[22:08:04] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X: [[13 5 0 19 6 17 19 6 20 6]
[20 10 14 10 13 2 19 13 26 0]]
Y: [[ 5 0 19 6 17 19 6 20 6 15]
[10 14 10 13 2 19 13 26 0 21]]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
data = d2l.TimeMachine(batch_size=2, num_steps=10)
for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
X: tf.Tensor(
[[10 19 14 10 21 26 0 16 7 0]
[ 0 14 16 14 6 15 21 0 4 2]], shape=(2, 10), dtype=int32)
Y: tf.Tensor(
[[19 14 10 21 26 0 16 7 0 21]
[14 16 14 6 15 21 0 4 2 15]], shape=(2, 10), dtype=int32)
.. raw:: html
.. raw:: html
Summary and Discussion
----------------------
Language models estimate the joint probability of a text sequence. For
long sequences, :math:`n`-grams provide a convenient model by truncating
the dependence. However, there is a lot of structure but not enough
frequency to deal efficiently with infrequent word combinations via
Laplace smoothing. Thus, we will focus on neural language modeling in
subsequent sections. To train language models, we can randomly sample
pairs of input sequences and target sequences in minibatches. After
training, we will use perplexity to measure the language model quality.
Language models can be scaled up with increased data size, model size,
and amount in training compute. Large language models can perform
desired tasks by predicting output text given input text instructions.
As we will discuss later (e.g.,
:numref:`sec_large-pretraining-transformers`), at the present moment
large language models form the basis of state-of-the-art systems across
diverse tasks.
Exercises
---------
1. Suppose there are 100,000 words in the training dataset. How much
word frequency and multi-word adjacent frequency does a four-gram
need to store?
2. How would you model a dialogue?
3. What other methods can you think of for reading long sequence data?
4. Consider our method for discarding a uniformly random number of the
first few tokens at the beginning of each epoch.
1. Does it really lead to a perfectly uniform distribution over the
sequences on the document?
2. What would you have to do to make things even more uniform?
5. If we want a sequence example to be a complete sentence, what kind of
problem does this introduce in minibatch sampling? How can we fix it?
.. raw:: html