14. Natural Language Processing: Pretraining¶
Humans need to communicate. Out of this basic need of the human condition, a vast amount of written text has been generated on an everyday basis. Given rich text in social media, chat apps, emails, product reviews, news articles, research papers, and books, it becomes vital to enable computers to understand them to offer assistance or make decisions based on human languages.
Natural language processing studies interactions between computers and humans using natural languages. In practice, it is very common to use natural language processing techniques to process and analyze text (human natural language) data, such as language models in Section 8.3 and machine translation models in Section 9.5.
To understand text, we can begin with its representation, such as treating each word or subword as an individual text token. As we will see in this chapter, the representation of each token can be pretrained on a large corpus, using word2vec, GloVe, or subword embedding models. After pretraining, representation of each token can be a vector, however, it remains the same no matter what the context is. For instance, the vector representation of “bank” is the same in both “go to the bank to deposit some money” and “go to the bank to sit down”. Thus, many more recent pretraining models adapt representation of the same token to different contexts. Among them is BERT, a much deeper model based on the Transformer encoder. In this chapter, we will focus on how to pretrain such representations for text, as highlighted in Fig. 14.1.
As shown in Fig. 14.1, the pretrained text representations can be fed to a variety of deep learning architectures for different downstream natural language processing applications. We will cover them in Section 15.
- 14.1. Word Embedding (word2vec)
- 14.2. Approximate Training
- 14.3. The Dataset for Pretraining Word Embedding
- 14.4. Pretraining word2vec
- 14.5. Word Embedding with Global Vectors (GloVe)
- 14.6. Subword Embedding
- 14.7. Finding Synonyms and Analogies
- 14.8. Bidirectional Encoder Representations from Transformers (BERT)
- 14.9. The Dataset for Pretraining BERT
- 14.10. Pretraining BERT