']
We now construct a vocabulary for our dataset, converting the sequence
of strings into a list of numerical indices. Note that we have not lost
any information and can easily convert our dataset back to its original
(string) representation.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]
words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']
.. raw:: html
.. raw:: html
Putting It All Together
-----------------------
Using the above classes and methods, we package everything into the
following ``build`` method of the ``TimeMachine`` class, which returns
``corpus``, a list of token indices, and ``vocab``, the vocabulary of
*The Time Machine* corpus. The modifications we did here are: (i) we
tokenize text into characters, not words, to simplify the training in
later sections; (ii) ``corpus`` is a single list, not a list of token
lists, since each text line in *The Time Machine* dataset is not
necessarily a sentence or paragraph.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(TimeMachine) #@save
def build(self, raw_text, vocab=None):
tokens = self._tokenize(self._preprocess(raw_text))
if vocab is None: vocab = Vocab(tokens)
corpus = [vocab[token] for token in tokens]
return corpus, vocab
corpus, vocab = data.build(raw_text)
len(corpus), len(vocab)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(173428, 28)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(TimeMachine) #@save
def build(self, raw_text, vocab=None):
tokens = self._tokenize(self._preprocess(raw_text))
if vocab is None: vocab = Vocab(tokens)
corpus = [vocab[token] for token in tokens]
return corpus, vocab
corpus, vocab = data.build(raw_text)
len(corpus), len(vocab)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(173428, 28)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(TimeMachine) #@save
def build(self, raw_text, vocab=None):
tokens = self._tokenize(self._preprocess(raw_text))
if vocab is None: vocab = Vocab(tokens)
corpus = [vocab[token] for token in tokens]
return corpus, vocab
corpus, vocab = data.build(raw_text)
len(corpus), len(vocab)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(173428, 28)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(TimeMachine) #@save
def build(self, raw_text, vocab=None):
tokens = self._tokenize(self._preprocess(raw_text))
if vocab is None: vocab = Vocab(tokens)
corpus = [vocab[token] for token in tokens]
return corpus, vocab
corpus, vocab = data.build(raw_text)
len(corpus), len(vocab)
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
(173428, 28)
.. raw:: html
.. raw:: html
.. _subsec_natural-lang-stat:
Exploratory Language Statistics
-------------------------------
Using the real corpus and the ``Vocab`` class defined over words, we can
inspect basic statistics concerning word use in our corpus. Below, we
construct a vocabulary from words used in *The Time Machine* and print
the ten most frequently occurring of them.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
words = text.split()
vocab = Vocab(words)
vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the', 2261),
('i', 1267),
('and', 1245),
('of', 1155),
('a', 816),
('to', 695),
('was', 552),
('in', 541),
('that', 443),
('my', 440)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
words = text.split()
vocab = Vocab(words)
vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the', 2261),
('i', 1267),
('and', 1245),
('of', 1155),
('a', 816),
('to', 695),
('was', 552),
('in', 541),
('that', 443),
('my', 440)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
words = text.split()
vocab = Vocab(words)
vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the', 2261),
('i', 1267),
('and', 1245),
('of', 1155),
('a', 816),
('to', 695),
('was', 552),
('in', 541),
('that', 443),
('my', 440)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
words = text.split()
vocab = Vocab(words)
vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the', 2261),
('i', 1267),
('and', 1245),
('of', 1155),
('a', 816),
('to', 695),
('was', 552),
('in', 541),
('that', 443),
('my', 440)]
.. raw:: html
.. raw:: html
Note that the ten most frequent words are not all that descriptive. You
might even imagine that we might see a very similar list if we had
chosen any book at random. Articles like “the” and “a”, pronouns like
“i” and “my”, and prepositions like “of”, “to”, and “in” occur often
because they serve common syntactic roles. Such words that are common
but not particularly descriptive are often called *stop words* and, in
previous generations of text classifiers based on so-called bag-of-words
representations, they were most often filtered out. However, they carry
meaning and it is not necessary to filter them out when working with
modern RNN- and Transformer-based neural models. If you look further
down the list, you will notice that word frequency decays quickly. The
:math:`10^{\textrm{th}}` most frequent word is less than :math:`1/5` as
common as the most popular. Word frequency tends to follow a power law
distribution (specifically the Zipfian) as we go down the ranks. To get
a better idea, we plot the figure of the word frequency.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
xscale='log', yscale='log')
.. figure:: output_text-sequence_e0a8c3_110_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
xscale='log', yscale='log')
.. figure:: output_text-sequence_e0a8c3_113_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
xscale='log', yscale='log')
.. figure:: output_text-sequence_e0a8c3_116_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
xscale='log', yscale='log')
.. figure:: output_text-sequence_e0a8c3_119_0.svg
.. raw:: html
.. raw:: html
After dealing with the first few words as exceptions, all the remaining
words roughly follow a straight line on a log–log plot. This phenomenon
is captured by *Zipf’s law*, which states that the frequency :math:`n_i`
of the :math:`i^\textrm{th}` most frequent word is:
.. math:: n_i \propto \frac{1}{i^\alpha},
:label: eq_zipf_law
which is equivalent to
.. math:: \log n_i = -\alpha \log i + c,
where :math:`\alpha` is the exponent that characterizes the distribution
and :math:`c` is a constant. This should already give us pause for
thought if we want to model words by counting statistics. After all, we
will significantly overestimate the frequency of the tail, also known as
the infrequent words. But what about the other word combinations, such
as two consecutive words (bigrams), three consecutive words (trigrams),
and beyond? Let’s see whether the bigram frequency behaves in the same
manner as the single word (unigram) frequency.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]
bigram_vocab = Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('of--the', 309),
('in--the', 169),
('i--had', 130),
('i--was', 112),
('and--the', 109),
('the--time', 102),
('it--was', 99),
('to--the', 85),
('as--i', 78),
('of--a', 73)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]
bigram_vocab = Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('of--the', 309),
('in--the', 169),
('i--had', 130),
('i--was', 112),
('and--the', 109),
('the--time', 102),
('it--was', 99),
('to--the', 85),
('as--i', 78),
('of--a', 73)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]
bigram_vocab = Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('of--the', 309),
('in--the', 169),
('i--had', 130),
('i--was', 112),
('and--the', 109),
('the--time', 102),
('it--was', 99),
('to--the', 85),
('as--i', 78),
('of--a', 73)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]
bigram_vocab = Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('of--the', 309),
('in--the', 169),
('i--had', 130),
('i--was', 112),
('and--the', 109),
('the--time', 102),
('it--was', 99),
('to--the', 85),
('as--i', 78),
('of--a', 73)]
.. raw:: html
.. raw:: html
One thing is notable here. Out of the ten most frequent word pairs, nine
are composed of both stop words and only one is relevant to the actual
book—“the time”. Furthermore, let’s see whether the trigram frequency
behaves in the same manner.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trigram_tokens = ['--'.join(triple) for triple in zip(
words[:-2], words[1:-1], words[2:])]
trigram_vocab = Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trigram_tokens = ['--'.join(triple) for triple in zip(
words[:-2], words[1:-1], words[2:])]
trigram_vocab = Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trigram_tokens = ['--'.join(triple) for triple in zip(
words[:-2], words[1:-1], words[2:])]
trigram_vocab = Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
trigram_tokens = ['--'.join(triple) for triple in zip(
words[:-2], words[1:-1], words[2:])]
trigram_vocab = Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]
.. raw:: html
.. raw:: html
Now, let’s visualize the token frequency among these three models:
unigrams, bigrams, and trigrams.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
.. figure:: output_text-sequence_e0a8c3_155_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
.. figure:: output_text-sequence_e0a8c3_158_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
.. figure:: output_text-sequence_e0a8c3_161_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
.. figure:: output_text-sequence_e0a8c3_164_0.svg
.. raw:: html
.. raw:: html
This figure is quite exciting. First, beyond unigram words, sequences of
words also appear to be following Zipf’s law, albeit with a smaller
exponent :math:`\alpha` in :eq:`eq_zipf_law`, depending on the
sequence length. Second, the number of distinct :math:`n`-grams is not
that large. This gives us hope that there is quite a lot of structure in
language. Third, many :math:`n`-grams occur very rarely. This makes
certain methods unsuitable for language modeling and motivates the use
of deep learning models. We will discuss this in the next section.
Summary
-------
Text is among the most common forms of sequence data encountered in deep
learning. Common choices for what constitutes a token are characters,
words, and word pieces. To preprocess text, we usually (i) split text
into tokens; (ii) build a vocabulary to map token strings to numerical
indices; and (iii) convert text data into token indices for models to
manipulate. In practice, the frequency of words tends to follow Zipf’s
law. This is true not just for individual words (unigrams), but also for
:math:`n`-grams.
Exercises
---------
1. In the experiment of this section, tokenize text into words and vary
the ``min_freq`` argument value of the ``Vocab`` instance.
Qualitatively characterize how changes in ``min_freq`` impact the
size of the resulting vocabulary.
2. Estimate the exponent of Zipfian distribution for unigrams, bigrams,
and trigrams in this corpus.
3. Find some other sources of data (download a standard machine learning
dataset, pick another public domain book, scrape a website, etc). For
each, tokenize the data at both the word and character levels. How do
the vocabulary sizes compare with *The Time Machine* corpus at
equivalent values of ``min_freq``. Estimate the exponent of the
Zipfian distribution corresponding to the unigram and bigram
distributions for these corpora. How do they compare with the values
that you observed for *The Time Machine* corpus?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html