.. _sec_text-sequence: Converting Raw Text into Sequence Data ====================================== Throughout this book, we will often work with text data represented as sequences of words, characters, or word pieces. To get going, we will need some basic tools for converting raw text into sequences of the appropriate form. Typical preprocessing pipelines execute the following steps: 1. Load text as strings into memory. 2. Split the strings into tokens (e.g., words or characters). 3. Build a vocabulary dictionary to associate each vocabulary element with a numerical index. 4. Convert the text into sequences of numerical indices. .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import collections import random import re import torch from d2l import torch as d2l .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import collections import random import re from mxnet import np, npx from d2l import mxnet as d2l npx.set_np() .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import collections import random import re import jax from jax import numpy as jnp from d2l import jax as d2l .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python import collections import random import re import tensorflow as tf from d2l import tensorflow as d2l .. raw:: html

.. raw:: html

Reading the Dataset ------------------- Here, we will work with H. G. Wells’ `The Time Machine `__, a book containing just over 30,000 words. While real applications will typically involve significantly larger datasets, this is sufficient to demonstrate the preprocessing pipeline. The following ``_download`` method reads the raw text into a string. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class TimeMachine(d2l.DataModule): #@save """The Time Machine dataset.""" def _download(self): fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root, '090b5e7e70c295757f55df93cb0a180b9691891a') with open(fname) as f: return f.read() data = TimeMachine() raw_text = data._download() raw_text[:60] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra' .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python class TimeMachine(d2l.DataModule): #@save """The Time Machine dataset.""" def _download(self): fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root, '090b5e7e70c295757f55df93cb0a180b9691891a') with open(fname) as f: return f.read() data = TimeMachine() raw_text = data._download() raw_text[:60] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt... .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'The Time Machine, by H. G. Wells [1898]\n\n\n\n\nI\n\n\nThe Time Tra' .. raw:: html

.. raw:: html

For simplicity, we ignore punctuation and capitalization when preprocessing the raw text. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(TimeMachine) #@save def _preprocess(self, text): return re.sub('[^A-Za-z]+', ' ', text).lower() text = data._preprocess(raw_text) text[:60] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'the time machine by h g wells i the time traveller for so it' .. raw:: html

.. raw:: html

Tokenization ------------ *Tokens* are the atomic (indivisible) units of text. Each time step corresponds to 1 token, but what precisely constitutes a token is a design choice. For example, we could represent the sentence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words comprise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary (there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a sequence of characters. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(TimeMachine) #@save def _tokenize(self, text): return list(text) tokens = data._tokenize(text) ','.join(tokens[:30]) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, ' .. raw:: html

.. raw:: html

Vocabulary ---------- These tokens are still strings. However, the inputs to our models must ultimately consist of numerical inputs. Next, we introduce a class for constructing *vocabularies*, i.e., objects that associate each distinct token value with a unique index. First, we determine the set of unique tokens in our training *corpus*. We then assign a numerical index to each unique token. Rare vocabulary elements are often dropped for convenience. Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an *unknown* value. .. raw:: latex \diilbookstyleinputcell .. code:: python class Vocab: #@save """Vocabulary for text.""" def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]): # Flatten a 2D list if needed if tokens and isinstance(tokens[0], list): tokens = [token for line in tokens for token in line] # Count token frequencies counter = collections.Counter(tokens) self.token_freqs = sorted(counter.items(), key=lambda x: x[1], reverse=True) # The list of unique tokens self.idx_to_token = list(sorted(set([''] + reserved_tokens + [ token for token, freq in self.token_freqs if freq >= min_freq]))) self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)} def __len__(self): return len(self.idx_to_token) def __getitem__(self, tokens): if not isinstance(tokens, (list, tuple)): return self.token_to_idx.get(tokens, self.unk) return [self.__getitem__(token) for token in tokens] def to_tokens(self, indices): if hasattr(indices, '__len__') and len(indices) > 1: return [self.idx_to_token[int(index)] for index in indices] return self.idx_to_token[indices] @property def unk(self): # Index for the unknown token return self.token_to_idx[''] We now construct a vocabulary for our dataset, converting the sequence of strings into a list of numerical indices. Note that we have not lost any information and can easily convert our dataset back to its original (string) representation. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python vocab = Vocab(tokens) indices = vocab[tokens[:10]] print('indices:', indices) print('words:', vocab.to_tokens(indices)) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14] words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm'] .. raw:: html

.. raw:: html

Putting It All Together ----------------------- Using the above classes and methods, we package everything into the following ``build`` method of the ``TimeMachine`` class, which returns ``corpus``, a list of token indices, and ``vocab``, the vocabulary of *The Time Machine* corpus. The modifications we did here are: (i) we tokenize text into characters, not words, to simplify the training in later sections; (ii) ``corpus`` is a single list, not a list of token lists, since each text line in *The Time Machine* dataset is not necessarily a sentence or paragraph. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python @d2l.add_to_class(TimeMachine) #@save def build(self, raw_text, vocab=None): tokens = self._tokenize(self._preprocess(raw_text)) if vocab is None: vocab = Vocab(tokens) corpus = [vocab[token] for token in tokens] return corpus, vocab corpus, vocab = data.build(raw_text) len(corpus), len(vocab) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (173428, 28) .. raw:: html

.. raw:: html

.. _subsec_natural-lang-stat: Exploratory Language Statistics ------------------------------- Using the real corpus and the ``Vocab`` class defined over words, we can inspect basic statistics concerning word use in our corpus. Below, we construct a vocabulary from words used in *The Time Machine* and print the ten most frequently occurring of them. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python words = text.split() vocab = Vocab(words) vocab.token_freqs[:10] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [('the', 2261), ('i', 1267), ('and', 1245), ('of', 1155), ('a', 816), ('to', 695), ('was', 552), ('in', 541), ('that', 443), ('my', 440)] .. raw:: html

.. raw:: html

Note that the ten most frequent words are not all that descriptive. You might even imagine that we might see a very similar list if we had chosen any book at random. Articles like “the” and “a”, pronouns like “i” and “my”, and prepositions like “of”, “to”, and “in” occur often because they serve common syntactic roles. Such words that are common but not particularly descriptive are often called *stop words* and, in previous generations of text classifiers based on so-called bag-of-words representations, they were most often filtered out. However, they carry meaning and it is not necessary to filter them out when working with modern RNN- and Transformer-based neural models. If you look further down the list, you will notice that word frequency decays quickly. The :math:`10^{\textrm{th}}` most frequent word is less than :math:`1/5` as common as the most popular. Word frequency tends to follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better idea, we plot the figure of the word frequency. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python freqs = [freq for token, freq in vocab.token_freqs] d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)', xscale='log', yscale='log') .. figure:: output_text-sequence_e0a8c3_110_0.svg .. raw:: html

.. raw:: html

After dealing with the first few words as exceptions, all the remaining words roughly follow a straight line on a log–log plot. This phenomenon is captured by *Zipf’s law*, which states that the frequency :math:`n_i` of the :math:`i^\textrm{th}` most frequent word is: .. math:: n_i \propto \frac{1}{i^\alpha}, :label: eq_zipf_law which is equivalent to .. math:: \log n_i = -\alpha \log i + c, where :math:`\alpha` is the exponent that characterizes the distribution and :math:`c` is a constant. This should already give us pause for thought if we want to model words by counting statistics. After all, we will significantly overestimate the frequency of the tail, also known as the infrequent words. But what about the other word combinations, such as two consecutive words (bigrams), three consecutive words (trigrams), and beyond? Let’s see whether the bigram frequency behaves in the same manner as the single word (unigram) frequency. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])] bigram_vocab = Vocab(bigram_tokens) bigram_vocab.token_freqs[:10] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [('of--the', 309), ('in--the', 169), ('i--had', 130), ('i--was', 112), ('and--the', 109), ('the--time', 102), ('it--was', 99), ('to--the', 85), ('as--i', 78), ('of--a', 73)] .. raw:: html

.. raw:: html

One thing is notable here. Out of the ten most frequent word pairs, nine are composed of both stop words and only one is relevant to the actual book—“the time”. Furthermore, let’s see whether the trigram frequency behaves in the same manner. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python trigram_tokens = ['--'.join(triple) for triple in zip( words[:-2], words[1:-1], words[2:])] trigram_vocab = Vocab(trigram_tokens) trigram_vocab.token_freqs[:10] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output [('the--time--traveller', 59), ('the--time--machine', 30), ('the--medical--man', 24), ('it--seemed--to', 16), ('it--was--a', 15), ('here--and--there', 15), ('seemed--to--me', 14), ('i--did--not', 14), ('i--saw--the', 13), ('i--began--to', 13)] .. raw:: html

.. raw:: html

Now, let’s visualize the token frequency among these three models: unigrams, bigrams, and trigrams. .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs] trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs] d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x', ylabel='frequency: n(x)', xscale='log', yscale='log', legend=['unigram', 'bigram', 'trigram']) .. figure:: output_text-sequence_e0a8c3_155_0.svg .. raw:: html

.. raw:: html

This figure is quite exciting. First, beyond unigram words, sequences of words also appear to be following Zipf’s law, albeit with a smaller exponent :math:`\alpha` in :eq:`eq_zipf_law`, depending on the sequence length. Second, the number of distinct :math:`n`-grams is not that large. This gives us hope that there is quite a lot of structure in language. Third, many :math:`n`-grams occur very rarely. This makes certain methods unsuitable for language modeling and motivates the use of deep learning models. We will discuss this in the next section. Summary ------- Text is among the most common forms of sequence data encountered in deep learning. Common choices for what constitutes a token are characters, words, and word pieces. To preprocess text, we usually (i) split text into tokens; (ii) build a vocabulary to map token strings to numerical indices; and (iii) convert text data into token indices for models to manipulate. In practice, the frequency of words tends to follow Zipf’s law. This is true not just for individual words (unigrams), but also for :math:`n`-grams. Exercises --------- 1. In the experiment of this section, tokenize text into words and vary the ``min_freq`` argument value of the ``Vocab`` instance. Qualitatively characterize how changes in ``min_freq`` impact the size of the resulting vocabulary. 2. Estimate the exponent of Zipfian distribution for unigrams, bigrams, and trigrams in this corpus. 3. Find some other sources of data (download a standard machine learning dataset, pick another public domain book, scrape a website, etc). For each, tokenize the data at both the word and character levels. How do the vocabulary sizes compare with *The Time Machine* corpus at equivalent values of ``min_freq``. Estimate the exponent of the Zipfian distribution corresponding to the unigram and bigram distributions for these corpora. How do they compare with the values that you observed for *The Time Machine* corpus? .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html

`Discussions `__ .. raw:: html

.. raw:: html