etiennefd commited on
Commit
191fa46
1 Parent(s): 6b8e80e
.ipynb_checkpoints/04_mnist_basics-checkpoint.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
.ipynb_checkpoints/10_nlp-checkpoint.ipynb ADDED
@@ -0,0 +1,2286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "#hide\n",
10
+ "! [ -e /content ] && pip install -Uqq fastbook\n",
11
+ "import fastbook\n",
12
+ "fastbook.setup_book()"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "code",
17
+ "execution_count": null,
18
+ "metadata": {},
19
+ "outputs": [],
20
+ "source": [
21
+ "#hide\n",
22
+ "from fastbook import *\n",
23
+ "from IPython.display import display,HTML"
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "raw",
28
+ "metadata": {},
29
+ "source": [
30
+ "[[chapter_nlp]]"
31
+ ]
32
+ },
33
+ {
34
+ "cell_type": "markdown",
35
+ "metadata": {},
36
+ "source": [
37
+ "# NLP Deep Dive: RNNs"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "metadata": {},
43
+ "source": [
44
+ "In <<chapter_intro>> we saw that deep learning can be used to get great results with natural language datasets. Our example relied on using a pretrained language model and fine-tuning it to classify reviews. That example highlighted a difference between transfer learning in NLP and computer vision: in general in NLP the pretrained model is trained on a different task.\n",
45
+ "\n",
46
+ "What we call a language model is a model that has been trained to guess what the next word in a text is (having read the ones before). This kind of task is called *self-supervised learning*: we do not need to give labels to our model, just feed it lots and lots of texts. It has a process to automatically get labels from the data, and this task isn't trivial: to properly guess the next word in a sentence, the model will have to develop an understanding of the English (or other) language. Self-supervised learning can also be used in other domains; for instance, see [\"Self-Supervised Learning and Computer Vision\"](https://www.fast.ai/2020/01/13/self_supervised/) for an introduction to vision applications. Self-supervised learning is not usually used for the model that is trained directly, but instead is used for pretraining a model used for transfer learning."
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "markdown",
51
+ "metadata": {},
52
+ "source": [
53
+ "> jargon: Self-supervised learning: Training a model using labels that are embedded in the independent variable, rather than requiring external labels. For instance, training a model to predict the next word in a text."
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "markdown",
58
+ "metadata": {},
59
+ "source": [
60
+ "The language model we used in <<chapter_intro>> to classify IMDb reviews was pretrained on Wikipedia. We got great results by directly fine-tuning this language model to a movie review classifier, but with one extra step, we can do even better. The Wikipedia English is slightly different from the IMDb English, so instead of jumping directly to the classifier, we could fine-tune our pretrained language model to the IMDb corpus and then use *that* as the base for our classifier.\n",
61
+ "\n",
62
+ "Even if our language model knows the basics of the language we are using in the task (e.g., our pretrained model is in English), it helps to get used to the style of the corpus we are targeting. It may be more informal language, or more technical, with new words to learn or different ways of composing sentences. In the case of the IMDb dataset, there will be lots of names of movie directors and actors, and often a less formal style of language than that seen in Wikipedia.\n",
63
+ "\n",
64
+ "We already saw that with fastai, we can download a pretrained English language model and use it to get state-of-the-art results for NLP classification. (We expect pretrained models in many more languages to be available soon—they might well be available by the time you are reading this book, in fact.) So, why are we learning how to train a language model in detail?\n",
65
+ "\n",
66
+ "One reason, of course, is that it is helpful to understand the foundations of the models that you are using. But there is another very practical reason, which is that you get even better results if you fine-tune the (sequence-based) language model prior to fine-tuning the classification model. For instance, for the IMDb sentiment analysis task, the dataset includes 50,000 additional movie reviews that do not have any positive or negative labels attached. Since there are 25,000 labeled reviews in the training set and 25,000 in the validation set, that makes 100,000 movie reviews altogether. We can use all of these reviews to fine-tune the pretrained language model, which was trained only on Wikipedia articles; this will result in a language model that is particularly good at predicting the next word of a movie review.\n",
67
+ "\n",
68
+ "This is known as the Universal Language Model Fine-tuning (ULMFit) approach. The [paper](https://arxiv.org/abs/1801.06146) showed that this extra stage of fine-tuning of the language model, prior to transfer learning to a classification task, resulted in significantly better predictions. Using this approach, we have three stages for transfer learning in NLP, as summarized in <<ulmfit_process>>."
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "metadata": {},
74
+ "source": [
75
+ "<img alt=\"Diagram of the ULMFiT process\" width=\"700\" caption=\"The ULMFiT process\" id=\"ulmfit_process\" src=\"images/att_00027.png\">"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "metadata": {},
81
+ "source": [
82
+ "We'll now explore how to apply a neural network to this language modeling problem, using the concepts introduced in the last two chapters. But before reading further, pause and think about how *you* would approach this."
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "markdown",
87
+ "metadata": {},
88
+ "source": [
89
+ "## Text Preprocessing"
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "markdown",
94
+ "metadata": {},
95
+ "source": [
96
+ "It's not at all obvious how we're going to use what we've learned so far to build a language model. Sentences can be different lengths, and documents can be very long. So, how can we predict the next word of a sentence using a neural network? Let's find out!\n",
97
+ "\n",
98
+ "We've already seen how categorical variables can be used as independent variables for a neural network. The approach we took for a single categorical variable was to:\n",
99
+ "\n",
100
+ "1. Make a list of all possible levels of that categorical variable (we'll call this list the *vocab*).\n",
101
+ "1. Replace each level with its index in the vocab.\n",
102
+ "1. Create an embedding matrix for this containing a row for each level (i.e., for each item of the vocab).\n",
103
+ "1. Use this embedding matrix as the first layer of a neural network. (A dedicated embedding matrix can take as inputs the raw vocab indexes created in step 2; this is equivalent to but faster and more efficient than a matrix that takes as input one-hot-encoded vectors representing the indexes.)\n",
104
+ "\n",
105
+ "We can do nearly the same thing with text! What is new is the idea of a sequence. First we concatenate all of the documents in our dataset into one big long string and split it into words, giving us a very long list of words (or \"tokens\"). Our independent variable will be the sequence of words starting with the first word in our very long list and ending with the second to last, and our dependent variable will be the sequence of words starting with the second word and ending with the last word. \n",
106
+ "\n",
107
+ "Our vocab will consist of a mix of common words that are already in the vocabulary of our pretrained model and new words specific to our corpus (cinematographic terms or actors names, for instance). Our embedding matrix will be built accordingly: for words that are in the vocabulary of our pretrained model, we will take the corresponding row in the embedding matrix of the pretrained model; but for new words we won't have anything, so we will just initialize the corresponding row with a random vector."
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "markdown",
112
+ "metadata": {},
113
+ "source": [
114
+ "Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:\n",
115
+ "\n",
116
+ "- Tokenization:: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)\n",
117
+ "- Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab\n",
118
+ "- Language model data loader creation:: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required\n",
119
+ "- Language model creation:: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small. There are a number of ways to do this; in this chapter we will be using a *recurrent neural network* (RNN). We will get to the details of these RNNs in the <<chapter_nlp_dive>>, but for now, you can think of it as just another deep neural network.\n",
120
+ "\n",
121
+ "Let's take a look at how each step works in detail."
122
+ ]
123
+ },
124
+ {
125
+ "cell_type": "markdown",
126
+ "metadata": {},
127
+ "source": [
128
+ "### Tokenization"
129
+ ]
130
+ },
131
+ {
132
+ "cell_type": "markdown",
133
+ "metadata": {},
134
+ "source": [
135
+ "When we said \"convert the text into a list of words,\" we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like \"don't\"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Polish where we can create really long words from many, many pieces? What about languages like Japanese and Chinese that don't use bases at all, and don't really have a well-defined idea of *word*?\n",
136
+ "\n",
137
+ "Because there is no one correct answer to these questions, there is no one approach to tokenization. There are three main approaches:\n",
138
+ "\n",
139
+ "- Word-based:: Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning \"don't\" into \"do n't\"). Generally, punctuation marks are also split into separate tokens.\n",
140
+ "- Subword based:: Split words into smaller parts, based on the most commonly occurring substrings. For instance, \"occasion\" might be tokenized as \"o c ca sion.\"\n",
141
+ "- Character-based:: Split a sentence into its individual characters.\n",
142
+ "\n",
143
+ "We'll be looking at word and subword tokenization here, and we'll leave character-based tokenization for you to implement in the questionnaire at the end of this chapter."
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "markdown",
148
+ "metadata": {},
149
+ "source": [
150
+ "> jargon: token: One element of a list created by the tokenization process. It could be a word, part of a word (a _subword_), or a single character."
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "markdown",
155
+ "metadata": {},
156
+ "source": [
157
+ "### Word Tokenization with fastai"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "markdown",
162
+ "metadata": {},
163
+ "source": [
164
+ "Rather than providing its own tokenizers, fastai instead provides a consistent interface to a range of tokenizers in external libraries. Tokenization is an active field of research, and new and improved tokenizers are coming out all the time, so the defaults that fastai uses change too. However, the API and options shouldn't change too much, since fastai tries to maintain a consistent API even as the underlying technology changes.\n",
165
+ "\n",
166
+ "Let's try it out with the IMDb dataset that we used in <<chapter_intro>>:"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "execution_count": null,
172
+ "metadata": {},
173
+ "outputs": [],
174
+ "source": [
175
+ "from fastai.text.all import *\n",
176
+ "path = untar_data(URLs.IMDB)"
177
+ ]
178
+ },
179
+ {
180
+ "cell_type": "markdown",
181
+ "metadata": {},
182
+ "source": [
183
+ "We'll need to grab the text files in order to try out a tokenizer. Just like `get_image_files`, which we've used many times already, gets all the image files in a path, `get_text_files` gets all the text files in a path. We can also optionally pass `folders` to restrict the search to a particular list of subfolders:"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": null,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "files = get_text_files(path, folders = ['train', 'test', 'unsup'])"
193
+ ]
194
+ },
195
+ {
196
+ "cell_type": "markdown",
197
+ "metadata": {},
198
+ "source": [
199
+ "Here's a review that we'll tokenize (we'll just print the start of it here to save space):"
200
+ ]
201
+ },
202
+ {
203
+ "cell_type": "code",
204
+ "execution_count": null,
205
+ "metadata": {},
206
+ "outputs": [
207
+ {
208
+ "data": {
209
+ "text/plain": [
210
+ "'This movie, which I just discovered at the video store, has apparently sit '"
211
+ ]
212
+ },
213
+ "execution_count": null,
214
+ "metadata": {},
215
+ "output_type": "execute_result"
216
+ }
217
+ ],
218
+ "source": [
219
+ "txt = files[0].open().read(); txt[:75]"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "markdown",
224
+ "metadata": {},
225
+ "source": [
226
+ "As we write this book, the default English word tokenizer for fastai uses a library called *spaCy*. It has a sophisticated rules engine with special rules for URLs, individual special English words, and much more. Rather than directly using `SpacyTokenizer`, however, we'll use `WordTokenizer`, since that will always point to fastai's current default word tokenizer (which may not necessarily be spaCy, depending when you're reading this).\n",
227
+ "\n",
228
+ "Let's try it out. We'll use fastai's `coll_repr(collection, n)` function to display the results. This displays the first *`n`* items of *`collection`*, along with the full size—it's what `L` uses by default. Note that fastai's tokenizers take a collection of documents to tokenize, so we have to wrap `txt` in a list:"
229
+ ]
230
+ },
231
+ {
232
+ "cell_type": "code",
233
+ "execution_count": null,
234
+ "metadata": {},
235
+ "outputs": [
236
+ {
237
+ "name": "stdout",
238
+ "output_type": "stream",
239
+ "text": [
240
+ "(#201) ['This','movie',',','which','I','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','It',\"'s\",'easy','to','see'...]\n"
241
+ ]
242
+ }
243
+ ],
244
+ "source": [
245
+ "spacy = WordTokenizer()\n",
246
+ "toks = first(spacy([txt]))\n",
247
+ "print(coll_repr(toks, 30))"
248
+ ]
249
+ },
250
+ {
251
+ "cell_type": "markdown",
252
+ "metadata": {},
253
+ "source": [
254
+ "As you see, spaCy has mainly just separated out the words and punctuation. But it does something else here too: it has split \"it's\" into \"it\" and \"'s\". That makes intuitive sense; these are separate words, really. Tokenization is a surprisingly subtle task, when you think about all the little details that have to be handled. Fortunately, spaCy handles these pretty well for us—for instance, here we see that \".\" is separated when it terminates a sentence, but not in an acronym or number:"
255
+ ]
256
+ },
257
+ {
258
+ "cell_type": "code",
259
+ "execution_count": null,
260
+ "metadata": {},
261
+ "outputs": [
262
+ {
263
+ "data": {
264
+ "text/plain": [
265
+ "(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']"
266
+ ]
267
+ },
268
+ "execution_count": null,
269
+ "metadata": {},
270
+ "output_type": "execute_result"
271
+ }
272
+ ],
273
+ "source": [
274
+ "first(spacy(['The U.S. dollar $1 is $1.00.']))"
275
+ ]
276
+ },
277
+ {
278
+ "cell_type": "markdown",
279
+ "metadata": {},
280
+ "source": [
281
+ "fastai then adds some additional functionality to the tokenization process with the `Tokenizer` class:"
282
+ ]
283
+ },
284
+ {
285
+ "cell_type": "code",
286
+ "execution_count": null,
287
+ "metadata": {},
288
+ "outputs": [
289
+ {
290
+ "name": "stdout",
291
+ "output_type": "stream",
292
+ "text": [
293
+ "(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',\"'s\",'easy'...]\n"
294
+ ]
295
+ }
296
+ ],
297
+ "source": [
298
+ "tkn = Tokenizer(spacy)\n",
299
+ "print(coll_repr(tkn(txt), 31))"
300
+ ]
301
+ },
302
+ {
303
+ "cell_type": "markdown",
304
+ "metadata": {},
305
+ "source": [
306
+ "Notice that there are now some tokens that start with the characters \"xx\", which is not a common word prefix in English. These are *special tokens*.\n",
307
+ "\n",
308
+ "For example, the first item in the list, `xxbos`, is a special token that indicates the start of a new text (\"BOS\" is a standard NLP acronym that means \"beginning of stream\"). By recognizing this start token, the model will be able to learn it needs to \"forget\" what was said previously and focus on upcoming words.\n",
309
+ "\n",
310
+ "These special tokens don't come from spaCy directly. They are there because fastai adds them by default, by applying a number of rules when processing text. These rules are designed to make it easier for a model to recognize the important parts of a sentence. In a sense, we are translating the original English language sequence into a simplified tokenized language—a language that is designed to be easy for a model to learn.\n",
311
+ "\n",
312
+ "For instance, the rules will replace a sequence of four exclamation points with a special *repeated character* token, followed by the number four, and then a single exclamation point. In this way, the model's embedding matrix can encode information about general concepts such as repeated punctuation rather than requiring a separate token for every number of repetitions of every punctuation mark. Similarly, a capitalized word will be replaced with a special capitalization token, followed by the lowercase version of the word. This way, the embedding matrix only needs the lowercase versions of the words, saving compute and memory resources, but can still learn the concept of capitalization.\n",
313
+ "\n",
314
+ "Here are some of the main special tokens you'll see:\n",
315
+ "\n",
316
+ "- `xxbos`:: Indicates the beginning of a text (here, a review)\n",
317
+ "- `xxmaj`:: Indicates the next word begins with a capital (since we lowercased everything)\n",
318
+ "- `xxunk`:: Indicates the word is unknown\n",
319
+ "\n",
320
+ "To see the rules that were used, you can check the default rules:"
321
+ ]
322
+ },
323
+ {
324
+ "cell_type": "code",
325
+ "execution_count": null,
326
+ "metadata": {},
327
+ "outputs": [
328
+ {
329
+ "data": {
330
+ "text/plain": [
331
+ "[<function fastai.text.core.fix_html(x)>,\n",
332
+ " <function fastai.text.core.replace_rep(t)>,\n",
333
+ " <function fastai.text.core.replace_wrep(t)>,\n",
334
+ " <function fastai.text.core.spec_add_spaces(t)>,\n",
335
+ " <function fastai.text.core.rm_useless_spaces(t)>,\n",
336
+ " <function fastai.text.core.replace_all_caps(t)>,\n",
337
+ " <function fastai.text.core.replace_maj(t)>,\n",
338
+ " <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]"
339
+ ]
340
+ },
341
+ "execution_count": null,
342
+ "metadata": {},
343
+ "output_type": "execute_result"
344
+ }
345
+ ],
346
+ "source": [
347
+ "defaults.text_proc_rules"
348
+ ]
349
+ },
350
+ {
351
+ "cell_type": "markdown",
352
+ "metadata": {},
353
+ "source": [
354
+ "As always, you can look at the source code of each of them in a notebook by typing:\n",
355
+ "\n",
356
+ "```\n",
357
+ "??replace_rep\n",
358
+ "```\n",
359
+ "\n",
360
+ "Here is a brief summary of what each does:\n",
361
+ "\n",
362
+ "- `fix_html`:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)\n",
363
+ "- `replace_rep`:: Replaces any character repeated three times or more with a special token for repetition (`xxrep`), the number of times it's repeated, then the character\n",
364
+ "- `replace_wrep`:: Replaces any word repeated three times or more with a special token for word repetition (`xxwrep`), the number of times it's repeated, then the word\n",
365
+ "- `spec_add_spaces`:: Adds spaces around / and #\n",
366
+ "- `rm_useless_spaces`:: Removes all repetitions of the space character\n",
367
+ "- `replace_all_caps`:: Lowercases a word written in all caps and adds a special token for all caps (`xxup`) in front of it\n",
368
+ "- `replace_maj`:: Lowercases a capitalized word and adds a special token for capitalized (`xxmaj`) in front of it\n",
369
+ "- `lowercase`:: Lowercases all text and adds a special token at the beginning (`xxbos`) and/or the end (`xxeos`)"
370
+ ]
371
+ },
372
+ {
373
+ "cell_type": "markdown",
374
+ "metadata": {},
375
+ "source": [
376
+ "Let's take a look at a few of them in action:"
377
+ ]
378
+ },
379
+ {
380
+ "cell_type": "code",
381
+ "execution_count": null,
382
+ "metadata": {},
383
+ "outputs": [
384
+ {
385
+ "data": {
386
+ "text/plain": [
387
+ "\"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index'...]\""
388
+ ]
389
+ },
390
+ "execution_count": null,
391
+ "metadata": {},
392
+ "output_type": "execute_result"
393
+ }
394
+ ],
395
+ "source": [
396
+ "coll_repr(tkn('&copy; Fast.ai www.fast.ai/INDEX'), 31)"
397
+ ]
398
+ },
399
+ {
400
+ "cell_type": "markdown",
401
+ "metadata": {},
402
+ "source": [
403
+ "Now let's take a look at how subword tokenization would work."
404
+ ]
405
+ },
406
+ {
407
+ "cell_type": "markdown",
408
+ "metadata": {},
409
+ "source": [
410
+ "### Subword Tokenization"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "markdown",
415
+ "metadata": {},
416
+ "source": [
417
+ "In addition to the *word tokenization* approach seen in the last section, another popular tokenization method is *subword tokenization*. Word tokenization relies on an assumption that spaces provide a useful separation of components of meaning in a sentence. However, this assumption is not always appropriate. For instance, consider this sentence: 我的名字是郝杰瑞 (\"My name is Jeremy Howard\" in Chinese). That's not going to work very well with a word tokenizer, because there are no spaces in it! Languages like Chinese and Japanese don't use spaces, and in fact they don't even have a well-defined concept of a \"word.\" There are also languages, like Turkish and Hungarian, that can add many subwords together without spaces, creating very long words that include a lot of separate pieces of information.\n",
418
+ "\n",
419
+ "To handle these cases, it's generally best to use subword tokenization. This proceeds in two steps:\n",
420
+ "\n",
421
+ "1. Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.\n",
422
+ "2. Tokenize the corpus using this vocab of *subword units*.\n",
423
+ "\n",
424
+ "Let's look at an example. For our corpus, we'll use the first 2,000 movie reviews:"
425
+ ]
426
+ },
427
+ {
428
+ "cell_type": "code",
429
+ "execution_count": null,
430
+ "metadata": {},
431
+ "outputs": [],
432
+ "source": [
433
+ "txts = L(o.open().read() for o in files[:2000])"
434
+ ]
435
+ },
436
+ {
437
+ "cell_type": "markdown",
438
+ "metadata": {},
439
+ "source": [
440
+ "We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to \"train\" it. That is, we need to have it read our documents and find the common sequences of characters to create the vocab. This is done with `setup`. As we'll see shortly, `setup` is a special fastai method that is called automatically in our usual data processing pipelines. Since we're doing everything manually at the moment, however, we have to call it ourselves. Here's a function that does these steps for a given vocab size, and shows an example output:"
441
+ ]
442
+ },
443
+ {
444
+ "cell_type": "code",
445
+ "execution_count": null,
446
+ "metadata": {},
447
+ "outputs": [],
448
+ "source": [
449
+ "def subword(sz):\n",
450
+ " sp = SubwordTokenizer(vocab_sz=sz)\n",
451
+ " sp.setup(txts)\n",
452
+ " return ' '.join(first(sp([txt]))[:40])"
453
+ ]
454
+ },
455
+ {
456
+ "cell_type": "markdown",
457
+ "metadata": {},
458
+ "source": [
459
+ "Let's try it out:"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "code",
464
+ "execution_count": null,
465
+ "metadata": {},
466
+ "outputs": [
467
+ {
468
+ "data": {
469
+ "text/html": [],
470
+ "text/plain": [
471
+ "<IPython.core.display.HTML object>"
472
+ ]
473
+ },
474
+ "metadata": {},
475
+ "output_type": "display_data"
476
+ },
477
+ {
478
+ "data": {
479
+ "text/plain": [
480
+ "'▁This ▁movie , ▁which ▁I ▁just ▁dis c over ed ▁at ▁the ▁video ▁st or e , ▁has ▁a p par ent ly ▁s it ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁dis t ri but or . ▁It'"
481
+ ]
482
+ },
483
+ "execution_count": null,
484
+ "metadata": {},
485
+ "output_type": "execute_result"
486
+ }
487
+ ],
488
+ "source": [
489
+ "subword(1000)"
490
+ ]
491
+ },
492
+ {
493
+ "cell_type": "markdown",
494
+ "metadata": {},
495
+ "source": [
496
+ "When using fastai's subword tokenizer, the special character `▁` represents a space character in the original text.\n",
497
+ "\n",
498
+ "If we use a smaller vocab, then each token will represent fewer characters, and it will take more tokens to represent a sentence:"
499
+ ]
500
+ },
501
+ {
502
+ "cell_type": "code",
503
+ "execution_count": null,
504
+ "metadata": {},
505
+ "outputs": [
506
+ {
507
+ "data": {
508
+ "text/html": [],
509
+ "text/plain": [
510
+ "<IPython.core.display.HTML object>"
511
+ ]
512
+ },
513
+ "metadata": {},
514
+ "output_type": "display_data"
515
+ },
516
+ {
517
+ "data": {
518
+ "text/plain": [
519
+ "'▁ T h i s ▁movie , ▁w h i ch ▁I ▁ j us t ▁ d i s c o ver ed ▁a t ▁the ▁ v id e o ▁ st or e , ▁h a s'"
520
+ ]
521
+ },
522
+ "execution_count": null,
523
+ "metadata": {},
524
+ "output_type": "execute_result"
525
+ }
526
+ ],
527
+ "source": [
528
+ "subword(200)"
529
+ ]
530
+ },
531
+ {
532
+ "cell_type": "markdown",
533
+ "metadata": {},
534
+ "source": [
535
+ "On the other hand, if we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:"
536
+ ]
537
+ },
538
+ {
539
+ "cell_type": "code",
540
+ "execution_count": null,
541
+ "metadata": {},
542
+ "outputs": [
543
+ {
544
+ "data": {
545
+ "text/html": [],
546
+ "text/plain": [
547
+ "<IPython.core.display.HTML object>"
548
+ ]
549
+ },
550
+ "metadata": {},
551
+ "output_type": "display_data"
552
+ },
553
+ {
554
+ "data": {
555
+ "text/plain": [
556
+ "\"▁This ▁movie , ▁which ▁I ▁just ▁discover ed ▁at ▁the ▁video ▁store , ▁has ▁apparently ▁sit ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁distributor . ▁It ' s ▁easy ▁to ▁see ▁why . ▁The ▁story ▁of ▁two ▁friends ▁living\""
557
+ ]
558
+ },
559
+ "execution_count": null,
560
+ "metadata": {},
561
+ "output_type": "execute_result"
562
+ }
563
+ ],
564
+ "source": [
565
+ "subword(10000)"
566
+ ]
567
+ },
568
+ {
569
+ "cell_type": "markdown",
570
+ "metadata": {},
571
+ "source": [
572
+ "Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn.\n",
573
+ "\n",
574
+ "Overall, subword tokenization provides a way to easily scale between character tokenization (i.e., using a small subword vocab) and word tokenization (i.e., using a large subword vocab), and handles every human language without needing language-specific algorithms to be developed. It can even handle other \"languages\" such as genomic sequences or MIDI music notation! For this reason, in the last year its popularity has soared, and it seems likely to become the most common tokenization approach (it may well already be, by the time you read this!)."
575
+ ]
576
+ },
577
+ {
578
+ "cell_type": "markdown",
579
+ "metadata": {},
580
+ "source": [
581
+ "Once our texts have been split into tokens, we need to convert them to numbers. We'll look at that next."
582
+ ]
583
+ },
584
+ {
585
+ "cell_type": "markdown",
586
+ "metadata": {},
587
+ "source": [
588
+ "### Numericalization with fastai"
589
+ ]
590
+ },
591
+ {
592
+ "cell_type": "markdown",
593
+ "metadata": {},
594
+ "source": [
595
+ "*Numericalization* is the process of mapping tokens to integers. The steps are basically identical to those necessary to create a `Category` variable, such as the dependent variable of digits in MNIST:\n",
596
+ "\n",
597
+ "1. Make a list of all possible levels of that categorical variable (the vocab).\n",
598
+ "1. Replace each level with its index in the vocab.\n",
599
+ "\n",
600
+ "Let's take a look at this in action on the word-tokenized text we saw earlier:"
601
+ ]
602
+ },
603
+ {
604
+ "cell_type": "code",
605
+ "execution_count": null,
606
+ "metadata": {},
607
+ "outputs": [
608
+ {
609
+ "name": "stdout",
610
+ "output_type": "stream",
611
+ "text": [
612
+ "(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','xxmaj','it',\"'s\",'easy'...]\n"
613
+ ]
614
+ }
615
+ ],
616
+ "source": [
617
+ "toks = tkn(txt)\n",
618
+ "print(coll_repr(tkn(txt), 31))"
619
+ ]
620
+ },
621
+ {
622
+ "cell_type": "markdown",
623
+ "metadata": {},
624
+ "source": [
625
+ "Just like with `SubwordTokenizer`, we need to call `setup` on `Numericalize`; this is how we create the vocab. That means we'll need our tokenized corpus first. Since tokenization takes a while, it's done in parallel by fastai; but for this manual walkthrough, we'll use a small subset:"
626
+ ]
627
+ },
628
+ {
629
+ "cell_type": "code",
630
+ "execution_count": null,
631
+ "metadata": {},
632
+ "outputs": [
633
+ {
634
+ "data": {
635
+ "text/plain": [
636
+ "(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...]"
637
+ ]
638
+ },
639
+ "execution_count": null,
640
+ "metadata": {},
641
+ "output_type": "execute_result"
642
+ }
643
+ ],
644
+ "source": [
645
+ "toks200 = txts[:200].map(tkn)\n",
646
+ "toks200[0]"
647
+ ]
648
+ },
649
+ {
650
+ "cell_type": "markdown",
651
+ "metadata": {},
652
+ "source": [
653
+ "We can pass this to `setup` to create our vocab:"
654
+ ]
655
+ },
656
+ {
657
+ "cell_type": "code",
658
+ "execution_count": null,
659
+ "metadata": {},
660
+ "outputs": [
661
+ {
662
+ "data": {
663
+ "text/plain": [
664
+ "\"(#2000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it'...]\""
665
+ ]
666
+ },
667
+ "execution_count": null,
668
+ "metadata": {},
669
+ "output_type": "execute_result"
670
+ }
671
+ ],
672
+ "source": [
673
+ "num = Numericalize()\n",
674
+ "num.setup(toks200)\n",
675
+ "coll_repr(num.vocab,20)"
676
+ ]
677
+ },
678
+ {
679
+ "cell_type": "markdown",
680
+ "metadata": {},
681
+ "source": [
682
+ "Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to `Numericalize` are `min_freq=3,max_vocab=60000`. `max_vocab=60000` results in fastai replacing all words other than the most common 60,000 with a special *unknown word* token, `xxunk`. This is useful to avoid having an overly large embedding matrix, since that can slow down training and use up too much memory, and can also mean that there isn't enough data to train useful representations for rare words. However, this last issue is better handled by setting `min_freq`; the default `min_freq=3` means that any word appearing less than three times is replaced with `xxunk`.\n",
683
+ "\n",
684
+ "fastai can also numericalize your dataset using a vocab that you provide, by passing a list of words as the `vocab` parameter.\n",
685
+ "\n",
686
+ "Once we've created our `Numericalize` object, we can use it as if it were a function:"
687
+ ]
688
+ },
689
+ {
690
+ "cell_type": "code",
691
+ "execution_count": null,
692
+ "metadata": {},
693
+ "outputs": [
694
+ {
695
+ "data": {
696
+ "text/plain": [
697
+ "tensor([ 2, 8, 21, 28, 11, 90, 18, 59, 0, 45, 9, 351, 499, 11, 72, 533, 584, 146, 29, 12])"
698
+ ]
699
+ },
700
+ "execution_count": null,
701
+ "metadata": {},
702
+ "output_type": "execute_result"
703
+ }
704
+ ],
705
+ "source": [
706
+ "nums = num(toks)[:20]; nums"
707
+ ]
708
+ },
709
+ {
710
+ "cell_type": "markdown",
711
+ "metadata": {},
712
+ "source": [
713
+ "This time, our tokens have been converted to a tensor of integers that our model can receive. We can check that they map back to the original text:"
714
+ ]
715
+ },
716
+ {
717
+ "cell_type": "code",
718
+ "execution_count": null,
719
+ "metadata": {},
720
+ "outputs": [
721
+ {
722
+ "data": {
723
+ "text/plain": [
724
+ "'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'"
725
+ ]
726
+ },
727
+ "execution_count": null,
728
+ "metadata": {},
729
+ "output_type": "execute_result"
730
+ }
731
+ ],
732
+ "source": [
733
+ "' '.join(num.vocab[o] for o in nums)"
734
+ ]
735
+ },
736
+ {
737
+ "cell_type": "markdown",
738
+ "metadata": {},
739
+ "source": [
740
+ "Now that we have numbers, we need to put them in batches for our model."
741
+ ]
742
+ },
743
+ {
744
+ "cell_type": "markdown",
745
+ "metadata": {},
746
+ "source": [
747
+ "### Putting Our Texts into Batches for a Language Model"
748
+ ]
749
+ },
750
+ {
751
+ "cell_type": "markdown",
752
+ "metadata": {},
753
+ "source": [
754
+ "When dealing with images, we needed to resize them all to the same height and width before grouping them together in a mini-batch so they could stack together efficiently in a single tensor. Here it's going to be a little different, because one cannot simply resize text to a desired length. Also, we want our language model to read text in order, so that it can efficiently predict what the next word is. This means that each new batch should begin precisely where the previous one left off.\n",
755
+ "\n",
756
+ "Suppose we have the following text:\n",
757
+ "\n",
758
+ "> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\n",
759
+ "\n",
760
+ "The tokenization process will add special tokens and deal with punctuation to return this text:\n",
761
+ "\n",
762
+ "> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \\n xxmaj then we will study how we build a language model and train it for a while .\n",
763
+ "\n",
764
+ "We now have 90 tokens, separated by spaces. Let's say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:"
765
+ ]
766
+ },
767
+ {
768
+ "cell_type": "code",
769
+ "execution_count": null,
770
+ "metadata": {
771
+ "hide_input": false
772
+ },
773
+ "outputs": [
774
+ {
775
+ "data": {
776
+ "text/html": [
777
+ "<table border=\"1\" class=\"dataframe\">\n",
778
+ " <tbody>\n",
779
+ " <tr>\n",
780
+ " <td>xxbos</td>\n",
781
+ " <td>xxmaj</td>\n",
782
+ " <td>in</td>\n",
783
+ " <td>this</td>\n",
784
+ " <td>chapter</td>\n",
785
+ " <td>,</td>\n",
786
+ " <td>we</td>\n",
787
+ " <td>will</td>\n",
788
+ " <td>go</td>\n",
789
+ " <td>back</td>\n",
790
+ " <td>over</td>\n",
791
+ " <td>the</td>\n",
792
+ " <td>example</td>\n",
793
+ " <td>of</td>\n",
794
+ " <td>classifying</td>\n",
795
+ " </tr>\n",
796
+ " <tr>\n",
797
+ " <td>movie</td>\n",
798
+ " <td>reviews</td>\n",
799
+ " <td>we</td>\n",
800
+ " <td>studied</td>\n",
801
+ " <td>in</td>\n",
802
+ " <td>chapter</td>\n",
803
+ " <td>1</td>\n",
804
+ " <td>and</td>\n",
805
+ " <td>dig</td>\n",
806
+ " <td>deeper</td>\n",
807
+ " <td>under</td>\n",
808
+ " <td>the</td>\n",
809
+ " <td>surface</td>\n",
810
+ " <td>.</td>\n",
811
+ " <td>xxmaj</td>\n",
812
+ " </tr>\n",
813
+ " <tr>\n",
814
+ " <td>first</td>\n",
815
+ " <td>we</td>\n",
816
+ " <td>will</td>\n",
817
+ " <td>look</td>\n",
818
+ " <td>at</td>\n",
819
+ " <td>the</td>\n",
820
+ " <td>processing</td>\n",
821
+ " <td>steps</td>\n",
822
+ " <td>necessary</td>\n",
823
+ " <td>to</td>\n",
824
+ " <td>convert</td>\n",
825
+ " <td>text</td>\n",
826
+ " <td>into</td>\n",
827
+ " <td>numbers</td>\n",
828
+ " <td>and</td>\n",
829
+ " </tr>\n",
830
+ " <tr>\n",
831
+ " <td>how</td>\n",
832
+ " <td>to</td>\n",
833
+ " <td>customize</td>\n",
834
+ " <td>it</td>\n",
835
+ " <td>.</td>\n",
836
+ " <td>xxmaj</td>\n",
837
+ " <td>by</td>\n",
838
+ " <td>doing</td>\n",
839
+ " <td>this</td>\n",
840
+ " <td>,</td>\n",
841
+ " <td>we</td>\n",
842
+ " <td>'ll</td>\n",
843
+ " <td>have</td>\n",
844
+ " <td>another</td>\n",
845
+ " <td>example</td>\n",
846
+ " </tr>\n",
847
+ " <tr>\n",
848
+ " <td>of</td>\n",
849
+ " <td>the</td>\n",
850
+ " <td>preprocessor</td>\n",
851
+ " <td>used</td>\n",
852
+ " <td>in</td>\n",
853
+ " <td>the</td>\n",
854
+ " <td>data</td>\n",
855
+ " <td>block</td>\n",
856
+ " <td>xxup</td>\n",
857
+ " <td>api</td>\n",
858
+ " <td>.</td>\n",
859
+ " <td>\\n</td>\n",
860
+ " <td>xxmaj</td>\n",
861
+ " <td>then</td>\n",
862
+ " <td>we</td>\n",
863
+ " </tr>\n",
864
+ " <tr>\n",
865
+ " <td>will</td>\n",
866
+ " <td>study</td>\n",
867
+ " <td>how</td>\n",
868
+ " <td>we</td>\n",
869
+ " <td>build</td>\n",
870
+ " <td>a</td>\n",
871
+ " <td>language</td>\n",
872
+ " <td>model</td>\n",
873
+ " <td>and</td>\n",
874
+ " <td>train</td>\n",
875
+ " <td>it</td>\n",
876
+ " <td>for</td>\n",
877
+ " <td>a</td>\n",
878
+ " <td>while</td>\n",
879
+ " <td>.</td>\n",
880
+ " </tr>\n",
881
+ " </tbody>\n",
882
+ "</table>"
883
+ ],
884
+ "text/plain": [
885
+ "<IPython.core.display.HTML object>"
886
+ ]
887
+ },
888
+ "metadata": {},
889
+ "output_type": "display_data"
890
+ }
891
+ ],
892
+ "source": [
893
+ "#hide_input\n",
894
+ "stream = \"In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\\nThen we will study how we build a language model and train it for a while.\"\n",
895
+ "tokens = tkn(stream)\n",
896
+ "bs,seq_len = 6,15\n",
897
+ "d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])\n",
898
+ "df = pd.DataFrame(d_tokens)\n",
899
+ "display(HTML(df.to_html(index=False,header=None)))"
900
+ ]
901
+ },
902
+ {
903
+ "cell_type": "markdown",
904
+ "metadata": {},
905
+ "source": [
906
+ "In a perfect world, we could then give this one batch to our model. But that approach doesn't scale, because outside of this toy example it's unlikely that a single batch containing all the texts would fit in our GPU memory (here we have 90 tokens, but all the IMDb reviews together give several million).\n",
907
+ "\n",
908
+ "So, we need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next. \n",
909
+ "\n",
910
+ "Going back to our previous example with 6 batches of length 15, if we chose a sequence length of 5, that would mean we first feed the following array:"
911
+ ]
912
+ },
913
+ {
914
+ "cell_type": "code",
915
+ "execution_count": null,
916
+ "metadata": {
917
+ "hide_input": true
918
+ },
919
+ "outputs": [
920
+ {
921
+ "data": {
922
+ "text/html": [
923
+ "<table border=\"1\" class=\"dataframe\">\n",
924
+ " <tbody>\n",
925
+ " <tr>\n",
926
+ " <td>xxbos</td>\n",
927
+ " <td>xxmaj</td>\n",
928
+ " <td>in</td>\n",
929
+ " <td>this</td>\n",
930
+ " <td>chapter</td>\n",
931
+ " </tr>\n",
932
+ " <tr>\n",
933
+ " <td>movie</td>\n",
934
+ " <td>reviews</td>\n",
935
+ " <td>we</td>\n",
936
+ " <td>studied</td>\n",
937
+ " <td>in</td>\n",
938
+ " </tr>\n",
939
+ " <tr>\n",
940
+ " <td>first</td>\n",
941
+ " <td>we</td>\n",
942
+ " <td>will</td>\n",
943
+ " <td>look</td>\n",
944
+ " <td>at</td>\n",
945
+ " </tr>\n",
946
+ " <tr>\n",
947
+ " <td>how</td>\n",
948
+ " <td>to</td>\n",
949
+ " <td>customize</td>\n",
950
+ " <td>it</td>\n",
951
+ " <td>.</td>\n",
952
+ " </tr>\n",
953
+ " <tr>\n",
954
+ " <td>of</td>\n",
955
+ " <td>the</td>\n",
956
+ " <td>preprocessor</td>\n",
957
+ " <td>used</td>\n",
958
+ " <td>in</td>\n",
959
+ " </tr>\n",
960
+ " <tr>\n",
961
+ " <td>will</td>\n",
962
+ " <td>study</td>\n",
963
+ " <td>how</td>\n",
964
+ " <td>we</td>\n",
965
+ " <td>build</td>\n",
966
+ " </tr>\n",
967
+ " </tbody>\n",
968
+ "</table>"
969
+ ],
970
+ "text/plain": [
971
+ "<IPython.core.display.HTML object>"
972
+ ]
973
+ },
974
+ "metadata": {},
975
+ "output_type": "display_data"
976
+ }
977
+ ],
978
+ "source": [
979
+ "#hide_input\n",
980
+ "bs,seq_len = 6,5\n",
981
+ "d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])\n",
982
+ "df = pd.DataFrame(d_tokens)\n",
983
+ "display(HTML(df.to_html(index=False,header=None)))"
984
+ ]
985
+ },
986
+ {
987
+ "cell_type": "markdown",
988
+ "metadata": {},
989
+ "source": [
990
+ "Then this one:"
991
+ ]
992
+ },
993
+ {
994
+ "cell_type": "code",
995
+ "execution_count": null,
996
+ "metadata": {
997
+ "hide_input": true
998
+ },
999
+ "outputs": [
1000
+ {
1001
+ "data": {
1002
+ "text/html": [
1003
+ "<table border=\"1\" class=\"dataframe\">\n",
1004
+ " <tbody>\n",
1005
+ " <tr>\n",
1006
+ " <td>,</td>\n",
1007
+ " <td>we</td>\n",
1008
+ " <td>will</td>\n",
1009
+ " <td>go</td>\n",
1010
+ " <td>back</td>\n",
1011
+ " </tr>\n",
1012
+ " <tr>\n",
1013
+ " <td>chapter</td>\n",
1014
+ " <td>1</td>\n",
1015
+ " <td>and</td>\n",
1016
+ " <td>dig</td>\n",
1017
+ " <td>deeper</td>\n",
1018
+ " </tr>\n",
1019
+ " <tr>\n",
1020
+ " <td>the</td>\n",
1021
+ " <td>processing</td>\n",
1022
+ " <td>steps</td>\n",
1023
+ " <td>necessary</td>\n",
1024
+ " <td>to</td>\n",
1025
+ " </tr>\n",
1026
+ " <tr>\n",
1027
+ " <td>xxmaj</td>\n",
1028
+ " <td>by</td>\n",
1029
+ " <td>doing</td>\n",
1030
+ " <td>this</td>\n",
1031
+ " <td>,</td>\n",
1032
+ " </tr>\n",
1033
+ " <tr>\n",
1034
+ " <td>the</td>\n",
1035
+ " <td>data</td>\n",
1036
+ " <td>block</td>\n",
1037
+ " <td>xxup</td>\n",
1038
+ " <td>api</td>\n",
1039
+ " </tr>\n",
1040
+ " <tr>\n",
1041
+ " <td>a</td>\n",
1042
+ " <td>language</td>\n",
1043
+ " <td>model</td>\n",
1044
+ " <td>and</td>\n",
1045
+ " <td>train</td>\n",
1046
+ " </tr>\n",
1047
+ " </tbody>\n",
1048
+ "</table>"
1049
+ ],
1050
+ "text/plain": [
1051
+ "<IPython.core.display.HTML object>"
1052
+ ]
1053
+ },
1054
+ "metadata": {},
1055
+ "output_type": "display_data"
1056
+ }
1057
+ ],
1058
+ "source": [
1059
+ "#hide_input\n",
1060
+ "bs,seq_len = 6,5\n",
1061
+ "d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])\n",
1062
+ "df = pd.DataFrame(d_tokens)\n",
1063
+ "display(HTML(df.to_html(index=False,header=None)))"
1064
+ ]
1065
+ },
1066
+ {
1067
+ "cell_type": "markdown",
1068
+ "metadata": {},
1069
+ "source": [
1070
+ "And finally:"
1071
+ ]
1072
+ },
1073
+ {
1074
+ "cell_type": "code",
1075
+ "execution_count": null,
1076
+ "metadata": {
1077
+ "hide_input": true
1078
+ },
1079
+ "outputs": [
1080
+ {
1081
+ "data": {
1082
+ "text/html": [
1083
+ "<table border=\"1\" class=\"dataframe\">\n",
1084
+ " <tbody>\n",
1085
+ " <tr>\n",
1086
+ " <td>over</td>\n",
1087
+ " <td>the</td>\n",
1088
+ " <td>example</td>\n",
1089
+ " <td>of</td>\n",
1090
+ " <td>classifying</td>\n",
1091
+ " </tr>\n",
1092
+ " <tr>\n",
1093
+ " <td>under</td>\n",
1094
+ " <td>the</td>\n",
1095
+ " <td>surface</td>\n",
1096
+ " <td>.</td>\n",
1097
+ " <td>xxmaj</td>\n",
1098
+ " </tr>\n",
1099
+ " <tr>\n",
1100
+ " <td>convert</td>\n",
1101
+ " <td>text</td>\n",
1102
+ " <td>into</td>\n",
1103
+ " <td>numbers</td>\n",
1104
+ " <td>and</td>\n",
1105
+ " </tr>\n",
1106
+ " <tr>\n",
1107
+ " <td>we</td>\n",
1108
+ " <td>'ll</td>\n",
1109
+ " <td>have</td>\n",
1110
+ " <td>another</td>\n",
1111
+ " <td>example</td>\n",
1112
+ " </tr>\n",
1113
+ " <tr>\n",
1114
+ " <td>.</td>\n",
1115
+ " <td>\\n</td>\n",
1116
+ " <td>xxmaj</td>\n",
1117
+ " <td>then</td>\n",
1118
+ " <td>we</td>\n",
1119
+ " </tr>\n",
1120
+ " <tr>\n",
1121
+ " <td>it</td>\n",
1122
+ " <td>for</td>\n",
1123
+ " <td>a</td>\n",
1124
+ " <td>while</td>\n",
1125
+ " <td>.</td>\n",
1126
+ " </tr>\n",
1127
+ " </tbody>\n",
1128
+ "</table>"
1129
+ ],
1130
+ "text/plain": [
1131
+ "<IPython.core.display.HTML object>"
1132
+ ]
1133
+ },
1134
+ "metadata": {},
1135
+ "output_type": "display_data"
1136
+ }
1137
+ ],
1138
+ "source": [
1139
+ "#hide_input\n",
1140
+ "bs,seq_len = 6,5\n",
1141
+ "d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])\n",
1142
+ "df = pd.DataFrame(d_tokens)\n",
1143
+ "display(HTML(df.to_html(index=False,header=None)))"
1144
+ ]
1145
+ },
1146
+ {
1147
+ "cell_type": "markdown",
1148
+ "metadata": {},
1149
+ "source": [
1150
+ "Going back to our movie reviews dataset, the first step is to transform the individual texts into a stream by concatenating them together. As with images, it's best to randomize the order of the inputs, so at the beginning of each epoch we will shuffle the entries to make a new stream (we shuffle the order of the documents, not the order of the words inside them, or the texts would not make sense anymore!).\n",
1151
+ "\n",
1152
+ "We then cut this stream into a certain number of batches (which is our *batch size*). For instance, if the stream has 50,000 tokens and we set a batch size of 10, this will give us 10 mini-streams of 5,000 tokens. What is important is that we preserve the order of the tokens (so from 1 to 5,000 for the first mini-stream, then from 5,001 to 10,000...), because we want the model to read continuous rows of text (as in the preceding example). An `xxbos` token is added at the start of each during preprocessing, so that the model knows when it reads the stream when a new entry is beginning.\n",
1153
+ "\n",
1154
+ "So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens. We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order, and thanks to an inner state, it will produce the same activation whatever sequence length we picked.\n",
1155
+ "\n",
1156
+ "This is all done behind the scenes by the fastai library when we create an `LMDataLoader`. We do this by first applying our `Numericalize` object to the tokenized texts:"
1157
+ ]
1158
+ },
1159
+ {
1160
+ "cell_type": "code",
1161
+ "execution_count": null,
1162
+ "metadata": {},
1163
+ "outputs": [],
1164
+ "source": [
1165
+ "nums200 = toks200.map(num)"
1166
+ ]
1167
+ },
1168
+ {
1169
+ "cell_type": "markdown",
1170
+ "metadata": {},
1171
+ "source": [
1172
+ "and then passing that to `LMDataLoader`:"
1173
+ ]
1174
+ },
1175
+ {
1176
+ "cell_type": "code",
1177
+ "execution_count": null,
1178
+ "metadata": {},
1179
+ "outputs": [],
1180
+ "source": [
1181
+ "dl = LMDataLoader(nums200)"
1182
+ ]
1183
+ },
1184
+ {
1185
+ "cell_type": "markdown",
1186
+ "metadata": {},
1187
+ "source": [
1188
+ "Let's confirm that this gives the expected results, by grabbing the first batch:"
1189
+ ]
1190
+ },
1191
+ {
1192
+ "cell_type": "code",
1193
+ "execution_count": null,
1194
+ "metadata": {},
1195
+ "outputs": [
1196
+ {
1197
+ "data": {
1198
+ "text/plain": [
1199
+ "(torch.Size([64, 72]), torch.Size([64, 72]))"
1200
+ ]
1201
+ },
1202
+ "execution_count": null,
1203
+ "metadata": {},
1204
+ "output_type": "execute_result"
1205
+ }
1206
+ ],
1207
+ "source": [
1208
+ "x,y = first(dl)\n",
1209
+ "x.shape,y.shape"
1210
+ ]
1211
+ },
1212
+ {
1213
+ "cell_type": "markdown",
1214
+ "metadata": {},
1215
+ "source": [
1216
+ "and then looking at the first row of the independent variable, which should be the start of the first text:"
1217
+ ]
1218
+ },
1219
+ {
1220
+ "cell_type": "code",
1221
+ "execution_count": null,
1222
+ "metadata": {},
1223
+ "outputs": [
1224
+ {
1225
+ "data": {
1226
+ "text/plain": [
1227
+ "'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'"
1228
+ ]
1229
+ },
1230
+ "execution_count": null,
1231
+ "metadata": {},
1232
+ "output_type": "execute_result"
1233
+ }
1234
+ ],
1235
+ "source": [
1236
+ "' '.join(num.vocab[o] for o in x[0][:20])"
1237
+ ]
1238
+ },
1239
+ {
1240
+ "cell_type": "markdown",
1241
+ "metadata": {},
1242
+ "source": [
1243
+ "The dependent variable is the same thing offset by one token:"
1244
+ ]
1245
+ },
1246
+ {
1247
+ "cell_type": "code",
1248
+ "execution_count": null,
1249
+ "metadata": {},
1250
+ "outputs": [
1251
+ {
1252
+ "data": {
1253
+ "text/plain": [
1254
+ "'xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a couple'"
1255
+ ]
1256
+ },
1257
+ "execution_count": null,
1258
+ "metadata": {},
1259
+ "output_type": "execute_result"
1260
+ }
1261
+ ],
1262
+ "source": [
1263
+ "' '.join(num.vocab[o] for o in y[0][:20])"
1264
+ ]
1265
+ },
1266
+ {
1267
+ "cell_type": "markdown",
1268
+ "metadata": {},
1269
+ "source": [
1270
+ "This concludes all the preprocessing steps we need to apply to our data. We are now ready to train our text classifier."
1271
+ ]
1272
+ },
1273
+ {
1274
+ "cell_type": "markdown",
1275
+ "metadata": {},
1276
+ "source": [
1277
+ "## Training a Text Classifier"
1278
+ ]
1279
+ },
1280
+ {
1281
+ "cell_type": "markdown",
1282
+ "metadata": {},
1283
+ "source": [
1284
+ "As we saw at the beginning of this chapter, there are two steps to training a state-of-the-art text classifier using transfer learning: first we need to fine-tune our language model pretrained on Wikipedia to the corpus of IMDb reviews, and then we can use that model to train a classifier.\n",
1285
+ "\n",
1286
+ "As usual, let's start with assembling our data."
1287
+ ]
1288
+ },
1289
+ {
1290
+ "cell_type": "markdown",
1291
+ "metadata": {},
1292
+ "source": [
1293
+ "### Language Model Using DataBlock"
1294
+ ]
1295
+ },
1296
+ {
1297
+ "cell_type": "markdown",
1298
+ "metadata": {},
1299
+ "source": [
1300
+ "fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`. In the next chapter we'll discuss the easiest ways to run each of these steps separately, to ease debugging—but you can always just debug by running them manually on a subset of your data as shown in the previous sections. And don't forget about `DataBlock`'s handy `summary` method, which is very useful for debugging data issues.\n",
1301
+ "\n",
1302
+ "Here's how we use `TextBlock` to create a language model, using fastai's defaults:"
1303
+ ]
1304
+ },
1305
+ {
1306
+ "cell_type": "code",
1307
+ "execution_count": null,
1308
+ "metadata": {},
1309
+ "outputs": [],
1310
+ "source": [
1311
+ "get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])\n",
1312
+ "\n",
1313
+ "dls_lm = DataBlock(\n",
1314
+ " blocks=TextBlock.from_folder(path, is_lm=True),\n",
1315
+ " get_items=get_imdb, splitter=RandomSplitter(0.1)\n",
1316
+ ").dataloaders(path, path=path, bs=128, seq_len=80)"
1317
+ ]
1318
+ },
1319
+ {
1320
+ "cell_type": "markdown",
1321
+ "metadata": {},
1322
+ "source": [
1323
+ "One thing that's different to previous types we've used in `DataBlock` is that we're not just using the class directly (i.e., `TextBlock(...)`, but instead are calling a *class method*. A class method is a Python method that, as the name suggests, belongs to a *class* rather than an *object*. (Be sure to search online for more information about class methods if you're not familiar with them, since they're commonly used in many Python libraries and applications; we've used them a few times previously in the book, but haven't called attention to them.) The reason that `TextBlock` is special is that setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible it performs a few optimizations: \n",
1324
+ "\n",
1325
+ "- It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once\n",
1326
+ "- It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs\n",
1327
+ "\n",
1328
+ "We need to tell `TextBlock` how to access the texts, so that it can do this initial preprocessing—that's what `from_folder` does.\n",
1329
+ "\n",
1330
+ "`show_batch` then works in the usual way:"
1331
+ ]
1332
+ },
1333
+ {
1334
+ "cell_type": "code",
1335
+ "execution_count": null,
1336
+ "metadata": {},
1337
+ "outputs": [
1338
+ {
1339
+ "data": {
1340
+ "text/html": [
1341
+ "<table border=\"1\" class=\"dataframe\">\n",
1342
+ " <thead>\n",
1343
+ " <tr style=\"text-align: right;\">\n",
1344
+ " <th></th>\n",
1345
+ " <th>text</th>\n",
1346
+ " <th>text_</th>\n",
1347
+ " </tr>\n",
1348
+ " </thead>\n",
1349
+ " <tbody>\n",
1350
+ " <tr>\n",
1351
+ " <th>0</th>\n",
1352
+ " <td>xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard</td>\n",
1353
+ " <td>xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode , your going from punk to pro . xxmaj you have to complete goals that involve skating , driving , and walking . xxmaj you create your own skater and give it a name , and you can make it look stupid or realistic . xxmaj you are with your friend xxmaj eric throughout the game until he betrays you and gets you kicked off of the skateboard xxunk</td>\n",
1354
+ " </tr>\n",
1355
+ " <tr>\n",
1356
+ " <th>1</th>\n",
1357
+ " <td>what xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this</td>\n",
1358
+ " <td>xxmaj i 've read , xxmaj death xxmaj bed is based on an actual dream , xxmaj george xxmaj barry , the director , successfully transferred dream to film , only a genius could accomplish such a task . \\n\\n xxmaj old mansions make for good quality horror , as do portraits , not sure what to make of the killer bed with its killer yellow liquid , quite a bizarre dream , indeed . xxmaj also , this is</td>\n",
1359
+ " </tr>\n",
1360
+ " </tbody>\n",
1361
+ "</table>"
1362
+ ],
1363
+ "text/plain": [
1364
+ "<IPython.core.display.HTML object>"
1365
+ ]
1366
+ },
1367
+ "metadata": {},
1368
+ "output_type": "display_data"
1369
+ }
1370
+ ],
1371
+ "source": [
1372
+ "dls_lm.show_batch(max_n=2)"
1373
+ ]
1374
+ },
1375
+ {
1376
+ "cell_type": "markdown",
1377
+ "metadata": {},
1378
+ "source": [
1379
+ "Now that our data is ready, we can fine-tune the pretrained language model."
1380
+ ]
1381
+ },
1382
+ {
1383
+ "cell_type": "markdown",
1384
+ "metadata": {},
1385
+ "source": [
1386
+ "### Fine-Tuning the Language Model"
1387
+ ]
1388
+ },
1389
+ {
1390
+ "cell_type": "markdown",
1391
+ "metadata": {},
1392
+ "source": [
1393
+ "To convert the integer word indices into activations that we can use for our neural network, we will use embeddings, just like we did for collaborative filtering and tabular modeling. Then we'll feed those embeddings into a *recurrent neural network* (RNN), using an architecture called *AWD-LSTM* (we will show you how to write such a model from scratch in <<chapter_nlp_dive>>). As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren't in the pretraining vocabulary. This is handled automatically inside `language_model_learner`:"
1394
+ ]
1395
+ },
1396
+ {
1397
+ "cell_type": "code",
1398
+ "execution_count": null,
1399
+ "metadata": {},
1400
+ "outputs": [],
1401
+ "source": [
1402
+ "learn = language_model_learner(\n",
1403
+ " dls_lm, AWD_LSTM, drop_mult=0.3, \n",
1404
+ " metrics=[accuracy, Perplexity()]).to_fp16()"
1405
+ ]
1406
+ },
1407
+ {
1408
+ "cell_type": "markdown",
1409
+ "metadata": {},
1410
+ "source": [
1411
+ "The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab). The *perplexity* metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`). We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.\n",
1412
+ "\n",
1413
+ "Let's go back to the process diagram from the beginning of this chapter. The first arrow has been completed for us and made available as a pretrained model in fastai, and we've just built the `DataLoaders` and `Learner` for the second stage. Now we're ready to fine-tune our language model!"
1414
+ ]
1415
+ },
1416
+ {
1417
+ "cell_type": "markdown",
1418
+ "metadata": {},
1419
+ "source": [
1420
+ "<img alt=\"Diagram of the ULMFiT process\" width=\"450\" src=\"images/att_00027.png\">"
1421
+ ]
1422
+ },
1423
+ {
1424
+ "cell_type": "markdown",
1425
+ "metadata": {},
1426
+ "source": [
1427
+ "It takes quite a while to train each epoch, so we'll be saving the intermediate model results during the training process. Since `fine_tune` doesn't do that for us, we'll use `fit_one_cycle`. Just like `vision_learner`, `language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights—i.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):"
1428
+ ]
1429
+ },
1430
+ {
1431
+ "cell_type": "code",
1432
+ "execution_count": null,
1433
+ "metadata": {},
1434
+ "outputs": [
1435
+ {
1436
+ "data": {
1437
+ "text/html": [
1438
+ "<table border=\"1\" class=\"dataframe\">\n",
1439
+ " <thead>\n",
1440
+ " <tr style=\"text-align: left;\">\n",
1441
+ " <th>epoch</th>\n",
1442
+ " <th>train_loss</th>\n",
1443
+ " <th>valid_loss</th>\n",
1444
+ " <th>accuracy</th>\n",
1445
+ " <th>perplexity</th>\n",
1446
+ " <th>time</th>\n",
1447
+ " </tr>\n",
1448
+ " </thead>\n",
1449
+ " <tbody>\n",
1450
+ " <tr>\n",
1451
+ " <td>0</td>\n",
1452
+ " <td>4.120048</td>\n",
1453
+ " <td>3.912788</td>\n",
1454
+ " <td>0.299565</td>\n",
1455
+ " <td>50.038246</td>\n",
1456
+ " <td>11:39</td>\n",
1457
+ " </tr>\n",
1458
+ " </tbody>\n",
1459
+ "</table>"
1460
+ ],
1461
+ "text/plain": [
1462
+ "<IPython.core.display.HTML object>"
1463
+ ]
1464
+ },
1465
+ "metadata": {},
1466
+ "output_type": "display_data"
1467
+ }
1468
+ ],
1469
+ "source": [
1470
+ "learn.fit_one_cycle(1, 2e-2)"
1471
+ ]
1472
+ },
1473
+ {
1474
+ "cell_type": "markdown",
1475
+ "metadata": {},
1476
+ "source": [
1477
+ "This model takes a while to train, so it's a good opportunity to talk about saving intermediary results. "
1478
+ ]
1479
+ },
1480
+ {
1481
+ "cell_type": "markdown",
1482
+ "metadata": {},
1483
+ "source": [
1484
+ "### Saving and Loading Models"
1485
+ ]
1486
+ },
1487
+ {
1488
+ "cell_type": "markdown",
1489
+ "metadata": {},
1490
+ "source": [
1491
+ "You can easily save the state of your model like so:"
1492
+ ]
1493
+ },
1494
+ {
1495
+ "cell_type": "code",
1496
+ "execution_count": null,
1497
+ "metadata": {},
1498
+ "outputs": [],
1499
+ "source": [
1500
+ "learn.save('1epoch')"
1501
+ ]
1502
+ },
1503
+ {
1504
+ "cell_type": "markdown",
1505
+ "metadata": {},
1506
+ "source": [
1507
+ "This will create a file in `learn.path/models/` named *1epoch.pth*. If you want to load your model in another machine after creating your `Learner` the same way, or resume training later, you can load the content of this file with:"
1508
+ ]
1509
+ },
1510
+ {
1511
+ "cell_type": "code",
1512
+ "execution_count": null,
1513
+ "metadata": {},
1514
+ "outputs": [],
1515
+ "source": [
1516
+ "learn = learn.load('1epoch')"
1517
+ ]
1518
+ },
1519
+ {
1520
+ "cell_type": "markdown",
1521
+ "metadata": {},
1522
+ "source": [
1523
+ "Once the initial training has completed, we can continue fine-tuning the model after unfreezing:"
1524
+ ]
1525
+ },
1526
+ {
1527
+ "cell_type": "code",
1528
+ "execution_count": null,
1529
+ "metadata": {},
1530
+ "outputs": [
1531
+ {
1532
+ "data": {
1533
+ "text/html": [
1534
+ "<table border=\"1\" class=\"dataframe\">\n",
1535
+ " <thead>\n",
1536
+ " <tr style=\"text-align: left;\">\n",
1537
+ " <th>epoch</th>\n",
1538
+ " <th>train_loss</th>\n",
1539
+ " <th>valid_loss</th>\n",
1540
+ " <th>accuracy</th>\n",
1541
+ " <th>perplexity</th>\n",
1542
+ " <th>time</th>\n",
1543
+ " </tr>\n",
1544
+ " </thead>\n",
1545
+ " <tbody>\n",
1546
+ " <tr>\n",
1547
+ " <td>0</td>\n",
1548
+ " <td>3.893486</td>\n",
1549
+ " <td>3.772820</td>\n",
1550
+ " <td>0.317104</td>\n",
1551
+ " <td>43.502548</td>\n",
1552
+ " <td>12:37</td>\n",
1553
+ " </tr>\n",
1554
+ " <tr>\n",
1555
+ " <td>1</td>\n",
1556
+ " <td>3.820479</td>\n",
1557
+ " <td>3.717197</td>\n",
1558
+ " <td>0.323790</td>\n",
1559
+ " <td>41.148880</td>\n",
1560
+ " <td>12:30</td>\n",
1561
+ " </tr>\n",
1562
+ " <tr>\n",
1563
+ " <td>2</td>\n",
1564
+ " <td>3.735622</td>\n",
1565
+ " <td>3.659760</td>\n",
1566
+ " <td>0.330321</td>\n",
1567
+ " <td>38.851997</td>\n",
1568
+ " <td>12:09</td>\n",
1569
+ " </tr>\n",
1570
+ " <tr>\n",
1571
+ " <td>3</td>\n",
1572
+ " <td>3.677086</td>\n",
1573
+ " <td>3.624794</td>\n",
1574
+ " <td>0.333960</td>\n",
1575
+ " <td>37.516987</td>\n",
1576
+ " <td>12:12</td>\n",
1577
+ " </tr>\n",
1578
+ " <tr>\n",
1579
+ " <td>4</td>\n",
1580
+ " <td>3.636646</td>\n",
1581
+ " <td>3.601300</td>\n",
1582
+ " <td>0.337017</td>\n",
1583
+ " <td>36.645859</td>\n",
1584
+ " <td>12:05</td>\n",
1585
+ " </tr>\n",
1586
+ " <tr>\n",
1587
+ " <td>5</td>\n",
1588
+ " <td>3.553636</td>\n",
1589
+ " <td>3.584241</td>\n",
1590
+ " <td>0.339355</td>\n",
1591
+ " <td>36.026001</td>\n",
1592
+ " <td>12:04</td>\n",
1593
+ " </tr>\n",
1594
+ " <tr>\n",
1595
+ " <td>6</td>\n",
1596
+ " <td>3.507634</td>\n",
1597
+ " <td>3.571892</td>\n",
1598
+ " <td>0.341353</td>\n",
1599
+ " <td>35.583862</td>\n",
1600
+ " <td>12:08</td>\n",
1601
+ " </tr>\n",
1602
+ " <tr>\n",
1603
+ " <td>7</td>\n",
1604
+ " <td>3.444101</td>\n",
1605
+ " <td>3.565988</td>\n",
1606
+ " <td>0.342194</td>\n",
1607
+ " <td>35.374371</td>\n",
1608
+ " <td>12:08</td>\n",
1609
+ " </tr>\n",
1610
+ " <tr>\n",
1611
+ " <td>8</td>\n",
1612
+ " <td>3.398597</td>\n",
1613
+ " <td>3.566283</td>\n",
1614
+ " <td>0.342647</td>\n",
1615
+ " <td>35.384815</td>\n",
1616
+ " <td>12:11</td>\n",
1617
+ " </tr>\n",
1618
+ " <tr>\n",
1619
+ " <td>9</td>\n",
1620
+ " <td>3.375563</td>\n",
1621
+ " <td>3.568166</td>\n",
1622
+ " <td>0.342528</td>\n",
1623
+ " <td>35.451500</td>\n",
1624
+ " <td>12:05</td>\n",
1625
+ " </tr>\n",
1626
+ " </tbody>\n",
1627
+ "</table>"
1628
+ ],
1629
+ "text/plain": [
1630
+ "<IPython.core.display.HTML object>"
1631
+ ]
1632
+ },
1633
+ "metadata": {},
1634
+ "output_type": "display_data"
1635
+ }
1636
+ ],
1637
+ "source": [
1638
+ "learn.unfreeze()\n",
1639
+ "learn.fit_one_cycle(10, 2e-3)"
1640
+ ]
1641
+ },
1642
+ {
1643
+ "cell_type": "markdown",
1644
+ "metadata": {},
1645
+ "source": [
1646
+ "Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the *encoder*. We can save it with `save_encoder`:"
1647
+ ]
1648
+ },
1649
+ {
1650
+ "cell_type": "code",
1651
+ "execution_count": null,
1652
+ "metadata": {},
1653
+ "outputs": [],
1654
+ "source": [
1655
+ "learn.save_encoder('finetuned')"
1656
+ ]
1657
+ },
1658
+ {
1659
+ "cell_type": "markdown",
1660
+ "metadata": {},
1661
+ "source": [
1662
+ "> jargon: Encoder: The model not including the task-specific final layer(s). This term means much the same thing as _body_ when applied to vision CNNs, but \"encoder\" tends to be more used for NLP and generative models."
1663
+ ]
1664
+ },
1665
+ {
1666
+ "cell_type": "markdown",
1667
+ "metadata": {},
1668
+ "source": [
1669
+ "This completes the second stage of the text classification process: fine-tuning the language model. We can now use it to fine-tune a classifier using the IMDb sentiment labels."
1670
+ ]
1671
+ },
1672
+ {
1673
+ "cell_type": "markdown",
1674
+ "metadata": {},
1675
+ "source": [
1676
+ "### Text Generation"
1677
+ ]
1678
+ },
1679
+ {
1680
+ "cell_type": "markdown",
1681
+ "metadata": {},
1682
+ "source": [
1683
+ "Before we move on to fine-tuning the classifier, let's quickly try something different: using our model to generate random reviews. Since it's trained to guess what the next word of the sentence is, we can use the model to write new reviews:"
1684
+ ]
1685
+ },
1686
+ {
1687
+ "cell_type": "code",
1688
+ "execution_count": null,
1689
+ "metadata": {},
1690
+ "outputs": [
1691
+ {
1692
+ "data": {
1693
+ "text/html": [],
1694
+ "text/plain": [
1695
+ "<IPython.core.display.HTML object>"
1696
+ ]
1697
+ },
1698
+ "metadata": {},
1699
+ "output_type": "display_data"
1700
+ },
1701
+ {
1702
+ "data": {
1703
+ "text/html": [],
1704
+ "text/plain": [
1705
+ "<IPython.core.display.HTML object>"
1706
+ ]
1707
+ },
1708
+ "metadata": {},
1709
+ "output_type": "display_data"
1710
+ }
1711
+ ],
1712
+ "source": [
1713
+ "TEXT = \"I liked this movie because\"\n",
1714
+ "N_WORDS = 40\n",
1715
+ "N_SENTENCES = 2\n",
1716
+ "preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) \n",
1717
+ " for _ in range(N_SENTENCES)]"
1718
+ ]
1719
+ },
1720
+ {
1721
+ "cell_type": "code",
1722
+ "execution_count": null,
1723
+ "metadata": {},
1724
+ "outputs": [
1725
+ {
1726
+ "name": "stdout",
1727
+ "output_type": "stream",
1728
+ "text": [
1729
+ "i liked this movie because of its story and characters . The story line was very strong , very good for a sci - fi film . The main character , Alucard , was very well developed and brought the whole story\n",
1730
+ "i liked this movie because i like the idea of the premise of the movie , the ( very ) convenient virus ( which , when you have to kill a few people , the \" evil \" machine has to be used to protect\n"
1731
+ ]
1732
+ }
1733
+ ],
1734
+ "source": [
1735
+ "print(\"\\n\".join(preds))"
1736
+ ]
1737
+ },
1738
+ {
1739
+ "cell_type": "markdown",
1740
+ "metadata": {},
1741
+ "source": [
1742
+ "As you can see, we add some randomness (we pick a random word based on the probabilities returned by the model) so we don't get exactly the same review twice. Our model doesn't have any programmed knowledge of the structure of a sentence or grammar rules, yet it has clearly learned a lot about English sentences: we can see it capitalizes properly (*I* is just transformed to *i* because our rules require two characters or more to consider a word as capitalized, so it's normal to see it lowercased) and is using consistent tense. The general review makes sense at first glance, and it's only if you read carefully that you can notice something is a bit off. Not bad for a model trained in a couple of hours! \n",
1743
+ "\n",
1744
+ "But our end goal wasn't to train a model to generate reviews, but to classify them... so let's use this model to do just that."
1745
+ ]
1746
+ },
1747
+ {
1748
+ "cell_type": "markdown",
1749
+ "metadata": {},
1750
+ "source": [
1751
+ "### Creating the Classifier DataLoaders"
1752
+ ]
1753
+ },
1754
+ {
1755
+ "cell_type": "markdown",
1756
+ "metadata": {},
1757
+ "source": [
1758
+ "We're now moving from language model fine-tuning to classifier fine-tuning. To recap, a language model predicts the next word of a document, so it doesn't need any external labels. A classifier, however, predicts some external label—in the case of IMDb, it's the sentiment of a document.\n",
1759
+ "\n",
1760
+ "This means that the structure of our `DataBlock` for NLP classification will look very familiar. It's actually nearly the same as we've seen for the many image classification datasets we've worked with:"
1761
+ ]
1762
+ },
1763
+ {
1764
+ "cell_type": "code",
1765
+ "execution_count": null,
1766
+ "metadata": {},
1767
+ "outputs": [],
1768
+ "source": [
1769
+ "dls_clas = DataBlock(\n",
1770
+ " blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),\n",
1771
+ " get_y = parent_label,\n",
1772
+ " get_items=partial(get_text_files, folders=['train', 'test']),\n",
1773
+ " splitter=GrandparentSplitter(valid_name='test')\n",
1774
+ ").dataloaders(path, path=path, bs=128, seq_len=72)"
1775
+ ]
1776
+ },
1777
+ {
1778
+ "cell_type": "markdown",
1779
+ "metadata": {},
1780
+ "source": [
1781
+ "Just like with image classification, `show_batch` shows the dependent variable (sentiment, in this case) with each independent variable (movie review text):"
1782
+ ]
1783
+ },
1784
+ {
1785
+ "cell_type": "code",
1786
+ "execution_count": null,
1787
+ "metadata": {},
1788
+ "outputs": [
1789
+ {
1790
+ "data": {
1791
+ "text/html": [
1792
+ "<table border=\"1\" class=\"dataframe\">\n",
1793
+ " <thead>\n",
1794
+ " <tr style=\"text-align: right;\">\n",
1795
+ " <th></th>\n",
1796
+ " <th>text</th>\n",
1797
+ " <th>category</th>\n",
1798
+ " </tr>\n",
1799
+ " </thead>\n",
1800
+ " <tbody>\n",
1801
+ " <tr>\n",
1802
+ " <th>0</th>\n",
1803
+ " <td>xxbos i rate this movie with 3 skulls , only coz the girls knew how to scream , this could 've been a better movie , if actors were better , the twins were xxup ok , i believed they were evil , but the eldest and youngest brother , they sucked really bad , it seemed like they were reading the scripts instead of acting them … . spoiler : if they 're vampire 's why do they freeze the blood ? vampires ca n't drink frozen blood , the sister in the movie says let 's drink her while she is alive … .but then when they 're moving to another house , they take on a cooler they 're frozen blood . end of spoiler \\n\\n it was a huge waste of time , and that made me mad coz i read all the reviews of how</td>\n",
1804
+ " <td>neg</td>\n",
1805
+ " </tr>\n",
1806
+ " <tr>\n",
1807
+ " <th>1</th>\n",
1808
+ " <td>xxbos i have read all of the xxmaj love xxmaj come xxmaj softly books . xxmaj knowing full well that movies can not use all aspects of the book , but generally they at least have the main point of the book . i was highly disappointed in this movie . xxmaj the only thing that they have in this movie that is in the book is that xxmaj missy 's father comes to xxunk in the book both parents come ) . xxmaj that is all . xxmaj the story line was so twisted and far fetch and yes , sad , from the book , that i just could n't enjoy it . xxmaj even if i did n't read the book it was too sad . i do know that xxmaj pioneer life was rough , but the whole movie was a downer . xxmaj the rating</td>\n",
1809
+ " <td>neg</td>\n",
1810
+ " </tr>\n",
1811
+ " <tr>\n",
1812
+ " <th>2</th>\n",
1813
+ " <td>xxbos xxmaj this , for lack of a better term , movie is lousy . xxmaj where do i start … … \\n\\n xxmaj cinemaphotography - xxmaj this was , perhaps , the worst xxmaj i 've seen this year . xxmaj it looked like the camera was being tossed from camera man to camera man . xxmaj maybe they only had one camera . xxmaj it gives you the sensation of being a volleyball . \\n\\n xxmaj there are a bunch of scenes , haphazardly , thrown in with no continuity at all . xxmaj when they did the ' split screen ' , it was absurd . xxmaj everything was squished flat , it looked ridiculous . \\n\\n xxmaj the color tones were way off . xxmaj these people need to learn how to balance a camera . xxmaj this ' movie ' is poorly made , and</td>\n",
1814
+ " <td>neg</td>\n",
1815
+ " </tr>\n",
1816
+ " </tbody>\n",
1817
+ "</table>"
1818
+ ],
1819
+ "text/plain": [
1820
+ "<IPython.core.display.HTML object>"
1821
+ ]
1822
+ },
1823
+ "metadata": {},
1824
+ "output_type": "display_data"
1825
+ }
1826
+ ],
1827
+ "source": [
1828
+ "dls_clas.show_batch(max_n=3)"
1829
+ ]
1830
+ },
1831
+ {
1832
+ "cell_type": "markdown",
1833
+ "metadata": {},
1834
+ "source": [
1835
+ "Looking at the `DataBlock` definition, every piece is familiar from previous data blocks we've built, with two important exceptions:\n",
1836
+ "\n",
1837
+ "- `TextBlock.from_folder` no longer has the `is_lm=True` parameter.\n",
1838
+ "- We pass the `vocab` we created for the language model fine-tuning.\n",
1839
+ "\n",
1840
+ "The reason that we pass the `vocab` of the language model is to make sure we use the same correspondence of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model, and the fine-tuning step won't be of any use.\n",
1841
+ "\n",
1842
+ "By passing `is_lm=False` (or not passing `is_lm` at all, since it defaults to `False`) we tell `TextBlock` that we have regular labeled data, rather than using the next tokens as labels. There is one challenge we have to deal with, however, which is to do with collating multiple documents into a mini-batch. Let's see with an example, by trying to create a mini-batch containing the first 10 documents. First we'll numericalize them:"
1843
+ ]
1844
+ },
1845
+ {
1846
+ "cell_type": "code",
1847
+ "execution_count": null,
1848
+ "metadata": {},
1849
+ "outputs": [],
1850
+ "source": [
1851
+ "nums_samp = toks200[:10].map(num)"
1852
+ ]
1853
+ },
1854
+ {
1855
+ "cell_type": "markdown",
1856
+ "metadata": {},
1857
+ "source": [
1858
+ "Let's now look at how many tokens each of these 10 movie reviews have:"
1859
+ ]
1860
+ },
1861
+ {
1862
+ "cell_type": "code",
1863
+ "execution_count": null,
1864
+ "metadata": {},
1865
+ "outputs": [
1866
+ {
1867
+ "data": {
1868
+ "text/plain": [
1869
+ "(#10) [228,238,121,290,196,194,533,124,581,155]"
1870
+ ]
1871
+ },
1872
+ "execution_count": null,
1873
+ "metadata": {},
1874
+ "output_type": "execute_result"
1875
+ }
1876
+ ],
1877
+ "source": [
1878
+ "nums_samp.map(len)"
1879
+ ]
1880
+ },
1881
+ {
1882
+ "cell_type": "markdown",
1883
+ "metadata": {},
1884
+ "source": [
1885
+ "Remember, PyTorch `DataLoader`s need to collate all the items in a batch into a single tensor, and a single tensor has a fixed shape (i.e., it has some particular length on every axis, and all items must be consistent). This should sound familiar: we had the same issue with images. In that case, we used cropping, padding, and/or squishing to make all the inputs the same size. Cropping might not be a good idea for documents, because it seems likely we'd remove some key information (having said that, the same issue is true for images, and we use cropping there; data augmentation hasn't been well explored for NLP yet, so perhaps there are actually opportunities to use cropping in NLP too!). You can't really \"squish\" a document. So that leaves padding!\n",
1886
+ "\n",
1887
+ "We will expand the shortest texts to make them all the same size. To do this, we use a special padding token that will be ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths (with some shuffling for the training set). We do this by (approximately, for the training set) sorting the documents by length prior to each epoch. The result of this is that the documents collated into a single batch will tend to be of similar lengths. We won't pad every batch to the same size, but will instead use the size of the largest document in each batch as the target size. (It is possible to do something similar with images, which is especially useful for irregularly sized rectangular images, but at the time of writing no library provides good support for this yet, and there aren't any papers covering it. It's something we're planning to add to fastai soon, however, so keep an eye on the book's website; we'll add information about this as soon as we have it working well.)\n",
1888
+ "\n",
1889
+ "The sorting and padding are automatically done by the data block API for us when using a `TextBlock`, with `is_lm=False`. (We don't have this same issue for language model data, since we concatenate all the documents together first, and then split them into equally sized sections.)\n",
1890
+ "\n",
1891
+ "We can now create a model to classify our texts:"
1892
+ ]
1893
+ },
1894
+ {
1895
+ "cell_type": "code",
1896
+ "execution_count": null,
1897
+ "metadata": {},
1898
+ "outputs": [],
1899
+ "source": [
1900
+ "learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, \n",
1901
+ " metrics=accuracy).to_fp16()"
1902
+ ]
1903
+ },
1904
+ {
1905
+ "cell_type": "markdown",
1906
+ "metadata": {},
1907
+ "source": [
1908
+ "The final step prior to training the classifier is to load the encoder from our fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights available for the encoder; `load` by default raises an exception if an incomplete model is loaded:"
1909
+ ]
1910
+ },
1911
+ {
1912
+ "cell_type": "code",
1913
+ "execution_count": null,
1914
+ "metadata": {},
1915
+ "outputs": [],
1916
+ "source": [
1917
+ "learn = learn.load_encoder('finetuned')"
1918
+ ]
1919
+ },
1920
+ {
1921
+ "cell_type": "markdown",
1922
+ "metadata": {},
1923
+ "source": [
1924
+ "### Fine-Tuning the Classifier"
1925
+ ]
1926
+ },
1927
+ {
1928
+ "cell_type": "markdown",
1929
+ "metadata": {},
1930
+ "source": [
1931
+ "The last step is to train with discriminative learning rates and *gradual unfreezing*. In computer vision we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference:"
1932
+ ]
1933
+ },
1934
+ {
1935
+ "cell_type": "code",
1936
+ "execution_count": null,
1937
+ "metadata": {},
1938
+ "outputs": [
1939
+ {
1940
+ "data": {
1941
+ "text/html": [
1942
+ "<table border=\"1\" class=\"dataframe\">\n",
1943
+ " <thead>\n",
1944
+ " <tr style=\"text-align: left;\">\n",
1945
+ " <th>epoch</th>\n",
1946
+ " <th>train_loss</th>\n",
1947
+ " <th>valid_loss</th>\n",
1948
+ " <th>accuracy</th>\n",
1949
+ " <th>time</th>\n",
1950
+ " </tr>\n",
1951
+ " </thead>\n",
1952
+ " <tbody>\n",
1953
+ " <tr>\n",
1954
+ " <td>0</td>\n",
1955
+ " <td>0.347427</td>\n",
1956
+ " <td>0.184480</td>\n",
1957
+ " <td>0.929320</td>\n",
1958
+ " <td>00:33</td>\n",
1959
+ " </tr>\n",
1960
+ " </tbody>\n",
1961
+ "</table>"
1962
+ ],
1963
+ "text/plain": [
1964
+ "<IPython.core.display.HTML object>"
1965
+ ]
1966
+ },
1967
+ "metadata": {},
1968
+ "output_type": "display_data"
1969
+ }
1970
+ ],
1971
+ "source": [
1972
+ "learn.fit_one_cycle(1, 2e-2)"
1973
+ ]
1974
+ },
1975
+ {
1976
+ "cell_type": "markdown",
1977
+ "metadata": {},
1978
+ "source": [
1979
+ "In just one epoch we get the same result as our training in <<chapter_intro>>: not too bad! We can pass `-2` to `freeze_to` to freeze all except the last two parameter groups:"
1980
+ ]
1981
+ },
1982
+ {
1983
+ "cell_type": "code",
1984
+ "execution_count": null,
1985
+ "metadata": {},
1986
+ "outputs": [
1987
+ {
1988
+ "data": {
1989
+ "text/html": [
1990
+ "<table border=\"1\" class=\"dataframe\">\n",
1991
+ " <thead>\n",
1992
+ " <tr style=\"text-align: left;\">\n",
1993
+ " <th>epoch</th>\n",
1994
+ " <th>train_loss</th>\n",
1995
+ " <th>valid_loss</th>\n",
1996
+ " <th>accuracy</th>\n",
1997
+ " <th>time</th>\n",
1998
+ " </tr>\n",
1999
+ " </thead>\n",
2000
+ " <tbody>\n",
2001
+ " <tr>\n",
2002
+ " <td>0</td>\n",
2003
+ " <td>0.247763</td>\n",
2004
+ " <td>0.171683</td>\n",
2005
+ " <td>0.934640</td>\n",
2006
+ " <td>00:37</td>\n",
2007
+ " </tr>\n",
2008
+ " </tbody>\n",
2009
+ "</table>"
2010
+ ],
2011
+ "text/plain": [
2012
+ "<IPython.core.display.HTML object>"
2013
+ ]
2014
+ },
2015
+ "metadata": {},
2016
+ "output_type": "display_data"
2017
+ }
2018
+ ],
2019
+ "source": [
2020
+ "learn.freeze_to(-2)\n",
2021
+ "learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))"
2022
+ ]
2023
+ },
2024
+ {
2025
+ "cell_type": "markdown",
2026
+ "metadata": {},
2027
+ "source": [
2028
+ "Then we can unfreeze a bit more, and continue training:"
2029
+ ]
2030
+ },
2031
+ {
2032
+ "cell_type": "code",
2033
+ "execution_count": null,
2034
+ "metadata": {},
2035
+ "outputs": [
2036
+ {
2037
+ "data": {
2038
+ "text/html": [
2039
+ "<table border=\"1\" class=\"dataframe\">\n",
2040
+ " <thead>\n",
2041
+ " <tr style=\"text-align: left;\">\n",
2042
+ " <th>epoch</th>\n",
2043
+ " <th>train_loss</th>\n",
2044
+ " <th>valid_loss</th>\n",
2045
+ " <th>accuracy</th>\n",
2046
+ " <th>time</th>\n",
2047
+ " </tr>\n",
2048
+ " </thead>\n",
2049
+ " <tbody>\n",
2050
+ " <tr>\n",
2051
+ " <td>0</td>\n",
2052
+ " <td>0.193377</td>\n",
2053
+ " <td>0.156696</td>\n",
2054
+ " <td>0.941200</td>\n",
2055
+ " <td>00:45</td>\n",
2056
+ " </tr>\n",
2057
+ " </tbody>\n",
2058
+ "</table>"
2059
+ ],
2060
+ "text/plain": [
2061
+ "<IPython.core.display.HTML object>"
2062
+ ]
2063
+ },
2064
+ "metadata": {},
2065
+ "output_type": "display_data"
2066
+ }
2067
+ ],
2068
+ "source": [
2069
+ "learn.freeze_to(-3)\n",
2070
+ "learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))"
2071
+ ]
2072
+ },
2073
+ {
2074
+ "cell_type": "markdown",
2075
+ "metadata": {},
2076
+ "source": [
2077
+ "And finally, the whole model!"
2078
+ ]
2079
+ },
2080
+ {
2081
+ "cell_type": "code",
2082
+ "execution_count": null,
2083
+ "metadata": {},
2084
+ "outputs": [
2085
+ {
2086
+ "data": {
2087
+ "text/html": [
2088
+ "<table border=\"1\" class=\"dataframe\">\n",
2089
+ " <thead>\n",
2090
+ " <tr style=\"text-align: left;\">\n",
2091
+ " <th>epoch</th>\n",
2092
+ " <th>train_loss</th>\n",
2093
+ " <th>valid_loss</th>\n",
2094
+ " <th>accuracy</th>\n",
2095
+ " <th>time</th>\n",
2096
+ " </tr>\n",
2097
+ " </thead>\n",
2098
+ " <tbody>\n",
2099
+ " <tr>\n",
2100
+ " <td>0</td>\n",
2101
+ " <td>0.172888</td>\n",
2102
+ " <td>0.153770</td>\n",
2103
+ " <td>0.943120</td>\n",
2104
+ " <td>01:01</td>\n",
2105
+ " </tr>\n",
2106
+ " <tr>\n",
2107
+ " <td>1</td>\n",
2108
+ " <td>0.161492</td>\n",
2109
+ " <td>0.155567</td>\n",
2110
+ " <td>0.942640</td>\n",
2111
+ " <td>00:57</td>\n",
2112
+ " </tr>\n",
2113
+ " </tbody>\n",
2114
+ "</table>"
2115
+ ],
2116
+ "text/plain": [
2117
+ "<IPython.core.display.HTML object>"
2118
+ ]
2119
+ },
2120
+ "metadata": {},
2121
+ "output_type": "display_data"
2122
+ }
2123
+ ],
2124
+ "source": [
2125
+ "learn.unfreeze()\n",
2126
+ "learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))"
2127
+ ]
2128
+ },
2129
+ {
2130
+ "cell_type": "markdown",
2131
+ "metadata": {},
2132
+ "source": [
2133
+ "We reached 94.3% accuracy, which was state-of-the-art performance just three years ago. By training another model on all the texts read backwards and averaging the predictions of those two models, we can even get to 95.1% accuracy, which was the state of the art introduced by the ULMFiT paper. It was only beaten a few months ago, by fine-tuning a much bigger model and using expensive data augmentation techniques (translating sentences in another language and back, using another model for translation).\n",
2134
+ "\n",
2135
+ "Using a pretrained model let us build a fine-tuned language model that was pretty powerful, to either generate fake reviews or help classify them. This is exciting stuff, but it's good to remember that this technology can also be used for malign purposes."
2136
+ ]
2137
+ },
2138
+ {
2139
+ "cell_type": "markdown",
2140
+ "metadata": {},
2141
+ "source": [
2142
+ "## Disinformation and Language Models"
2143
+ ]
2144
+ },
2145
+ {
2146
+ "cell_type": "markdown",
2147
+ "metadata": {},
2148
+ "source": [
2149
+ "Even simple algorithms based on rules, before the days of widely available deep learning language models, could be used to create fraudulent accounts and try to influence policymakers. Jeff Kao, now a computational journalist at ProPublica, analyzed the comments that were sent to the US Federal Communications Commission (FCC) regarding a 2017 proposal to repeal net neutrality. In his article [\"More than a Million Pro-Repeal Net Neutrality Comments Were Likely Faked\"](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6), he reports how he discovered a large cluster of comments opposing net neutrality that seemed to have been generated by some sort of Mad Libs-style mail merge. In <<disinformation>>, the fake comments have been helpfully color-coded by Kao to highlight their formulaic nature."
2150
+ ]
2151
+ },
2152
+ {
2153
+ "cell_type": "markdown",
2154
+ "metadata": {},
2155
+ "source": [
2156
+ "<img src=\"images/ethics/image16.png\" width=\"700\" id=\"disinformation\" caption=\"Comments received by the FCC during the net neutrality debate\">"
2157
+ ]
2158
+ },
2159
+ {
2160
+ "cell_type": "markdown",
2161
+ "metadata": {},
2162
+ "source": [
2163
+ "Kao estimated that \"less than 800,000 of the 22M+ comments… could be considered truly unique\" and that \"more than 99% of the truly unique comments were in favor of keeping net neutrality.\"\n",
2164
+ "\n",
2165
+ "Given advances in language modeling that have occurred since 2017, such fraudulent campaigns could be nearly impossible to catch now. You now have all the necessary tools at your disposal to create a compelling language model—that is, something that can generate context-appropriate, believable text. It won't necessarily be perfectly accurate or correct, but it will be plausible. Think about what this technology would mean when put together with the kinds of disinformation campaigns we have learned about in recent years. Take a look at the Reddit dialogue shown in <<ethics_reddit>>, where a language model based on OpenAI's GPT-2 algorithm is having a conversation with itself about whether the US government should cut defense spending."
2166
+ ]
2167
+ },
2168
+ {
2169
+ "cell_type": "markdown",
2170
+ "metadata": {},
2171
+ "source": [
2172
+ "<img src=\"images/ethics/image14.png\" id=\"ethics_reddit\" caption=\"An algorithm talking to itself on Reddit\" alt=\"An algorithm talking to itself on Reddit\" width=\"600\">"
2173
+ ]
2174
+ },
2175
+ {
2176
+ "cell_type": "markdown",
2177
+ "metadata": {},
2178
+ "source": [
2179
+ "In this case, it was explicitly said that an algorithm was used, but imagine what would happen if a bad actor decided to release such an algorithm across social networks. They could do it slowly and carefully, allowing the algorithm to gradually develop followers and trust over time. It would not take many resources to have literally millions of accounts doing this. In such a situation we could easily imagine getting to a point where the vast majority of discourse online was from bots, and nobody would have any idea that it was happening.\n",
2180
+ "\n",
2181
+ "We are already starting to see examples of machine learning being used to generate identities. For example, <<katie_jones>> shows a LinkedIn profile for Katie Jones."
2182
+ ]
2183
+ },
2184
+ {
2185
+ "cell_type": "markdown",
2186
+ "metadata": {},
2187
+ "source": [
2188
+ "<img src=\"images/ethics/image15.jpeg\" width=\"400\" id=\"katie_jones\" caption=\"Katie Jones's LinkedIn profile\">"
2189
+ ]
2190
+ },
2191
+ {
2192
+ "cell_type": "markdown",
2193
+ "metadata": {},
2194
+ "source": [
2195
+ "Katie Jones was connected on LinkedIn to several members of mainstream Washington think tanks. But she didn't exist. That image you see was auto-generated by a generative adversarial network, and somebody named Katie Jones has not, in fact, graduated from the Center for Strategic and International Studies.\n",
2196
+ "\n",
2197
+ "Many people assume or hope that algorithms will come to our defense here—that we will develop classification algorithms that can automatically recognise autogenerated content. The problem, however, is that this will always be an arms race, in which better classification (or discriminator) algorithms can be used to create better generation algorithms."
2198
+ ]
2199
+ },
2200
+ {
2201
+ "cell_type": "markdown",
2202
+ "metadata": {},
2203
+ "source": [
2204
+ "## Conclusion"
2205
+ ]
2206
+ },
2207
+ {
2208
+ "cell_type": "markdown",
2209
+ "metadata": {},
2210
+ "source": [
2211
+ "In this chapter we explored the last application covered out of the box by the fastai library: text. We saw two types of models: language models that can generate texts, and a classifier that determines if a review is positive or negative. To build a state-of-the art classifier, we used a pretrained language model, fine-tuned it to the corpus of our task, then used its body (the encoder) with a new head to do the classification.\n",
2212
+ "\n",
2213
+ "Before we end this section, we'll take a look at how the fastai library can help you assemble your data for your specific problems."
2214
+ ]
2215
+ },
2216
+ {
2217
+ "cell_type": "markdown",
2218
+ "metadata": {},
2219
+ "source": [
2220
+ "## Questionnaire"
2221
+ ]
2222
+ },
2223
+ {
2224
+ "cell_type": "markdown",
2225
+ "metadata": {},
2226
+ "source": [
2227
+ "1. What is \"self-supervised learning\"?\n",
2228
+ "1. What is a \"language model\"?\n",
2229
+ "1. Why is a language model considered self-supervised?\n",
2230
+ "1. What are self-supervised models usually used for?\n",
2231
+ "1. Why do we fine-tune language models?\n",
2232
+ "1. What are the three steps to create a state-of-the-art text classifier?\n",
2233
+ "1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?\n",
2234
+ "1. What are the three steps to prepare your data for a language model?\n",
2235
+ "1. What is \"tokenization\"? Why do we need it?\n",
2236
+ "1. Name three different approaches to tokenization.\n",
2237
+ "1. What is `xxbos`?\n",
2238
+ "1. List four rules that fastai applies to text during tokenization.\n",
2239
+ "1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?\n",
2240
+ "1. What is \"numericalization\"?\n",
2241
+ "1. Why might there be words that are replaced with the \"unknown word\" token?\n",
2242
+ "1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)\n",
2243
+ "1. Why do we need padding for text classification? Why don't we need it for language modeling?\n",
2244
+ "1. What does an embedding matrix for NLP contain? What is its shape?\n",
2245
+ "1. What is \"perplexity\"?\n",
2246
+ "1. Why do we have to pass the vocabulary of the language model to the classifier data block?\n",
2247
+ "1. What is \"gradual unfreezing\"?\n",
2248
+ "1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?"
2249
+ ]
2250
+ },
2251
+ {
2252
+ "cell_type": "markdown",
2253
+ "metadata": {},
2254
+ "source": [
2255
+ "### Further Research"
2256
+ ]
2257
+ },
2258
+ {
2259
+ "cell_type": "markdown",
2260
+ "metadata": {},
2261
+ "source": [
2262
+ "1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?\n",
2263
+ "1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?"
2264
+ ]
2265
+ },
2266
+ {
2267
+ "cell_type": "code",
2268
+ "execution_count": null,
2269
+ "metadata": {},
2270
+ "outputs": [],
2271
+ "source": []
2272
+ }
2273
+ ],
2274
+ "metadata": {
2275
+ "jupytext": {
2276
+ "split_at_heading": true
2277
+ },
2278
+ "kernelspec": {
2279
+ "display_name": "Python 3 (ipykernel)",
2280
+ "language": "python",
2281
+ "name": "python3"
2282
+ }
2283
+ },
2284
+ "nbformat": 4,
2285
+ "nbformat_minor": 2
2286
+ }
.ipynb_checkpoints/Modèle électoral 2022-checkpoint.ipynb ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [],
3
+ "metadata": {},
4
+ "nbformat": 4,
5
+ "nbformat_minor": 5
6
+ }
04_mnist_basics.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
10_nlp.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
pet.html DELETED
@@ -1,24 +0,0 @@
1
- ---
2
- title: Pet Classifier
3
- layout: page
4
- ---
5
-
6
- <input id="photo" type="file">
7
- <div id="results"></div>
8
- <script>
9
- async function loaded(reader) {
10
- const response = await fetch('https://hf.space/embed/etiennefd/minimal/+/api/predict',
11
- method: "POST", body: JSON.stringify({"data": [reader.result] }),
12
- headers: { "Content-Type": "application/json"}
13
- )};
14
- const json = await response.json();
15
- const label = json['data'][0]['confidences'][0]['label'];
16
- results.innerHTML = `<br/><img src="${reader.result}" width="300"> <p>${label}</p>`
17
- }
18
- function read() {
19
- const reader = new FileReader();
20
- reader.addEventListener('load', () => loaded(reader))
21
- reader.readAsDataURL(photo.files[0]);
22
- }
23
- photo.addEventListener('input', read);
24
- </script>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tmp/spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48efb23ff1996f5005edbd1f260df7a51863c4b8a123b8991d8c126cb0578209
3
+ size 252436
tmp/spm.vocab ADDED
@@ -0,0 +1,1000 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ▁xxunk 0
2
+ ▁xxpad 0
3
+ ▁xxbos 0
4
+ ▁xxeos 0
5
+ ▁xxfld 0
6
+ ▁xxrep 0
7
+ ▁xxwrep 0
8
+ ▁xxup 0
9
+ ▁xxmaj 0
10
+ <unk> 0
11
+ s -3.19829
12
+ ▁the -3.67774
13
+ . -3.68837
14
+ , -3.82466
15
+ t -4.00543
16
+ ▁a -4.0367
17
+ ▁ -4.26249
18
+ ▁to -4.36181
19
+ ▁and -4.40718
20
+ ▁of -4.42516
21
+ ' -4.48387
22
+ ing -4.55321
23
+ e -4.58733
24
+ n -4.63651
25
+ br -4.64626
26
+ ed -4.65064
27
+ d -4.6589
28
+ y -4.68727
29
+ ▁/> -4.71012
30
+ ▁is -4.72249
31
+ ▁in -4.72949
32
+ < -4.75644
33
+ ▁I -4.80554
34
+ ▁it -4.90788
35
+ o -4.91286
36
+ r -4.91691
37
+ a -4.92049
38
+ ▁that -5.09248
39
+ i -5.13736
40
+ ▁this -5.14387
41
+ l -5.15105
42
+ er -5.18685
43
+ ly -5.24766
44
+ c -5.26258
45
+ u -5.28416
46
+ ▁was -5.2854
47
+ ar -5.3533
48
+ ▁movie -5.35952
49
+ m -5.44068
50
+ p -5.44785
51
+ - -5.46707
52
+ ▁for -5.47157
53
+ re -5.47733
54
+ g -5.48261
55
+ or -5.53825
56
+ b -5.55854
57
+ ▁f -5.61339
58
+ ▁p -5.61612
59
+ ▁with -5.63214
60
+ al -5.6419
61
+ le -5.6781
62
+ ▁The -5.68457
63
+ ▁film -5.68693
64
+ ▁be -5.69314
65
+ in -5.70513
66
+ h -5.74278
67
+ ▁on -5.74765
68
+ ▁c -5.77298
69
+ ▁S -5.78798
70
+ ▁but -5.7927
71
+ " -5.80218
72
+ on -5.8039
73
+ f -5.84405
74
+ w -5.85899
75
+ ▁have -5.85959
76
+ en -5.87657
77
+ ▁" -5.88874
78
+ ▁you -5.89871
79
+ ▁as -5.90426
80
+ k -5.90725
81
+ ▁not -5.91836
82
+ ▁( -5.92484
83
+ ri -5.96125
84
+ ▁re -5.97736
85
+ ▁are -5.97879
86
+ ▁he -5.98644
87
+ an -5.99265
88
+ it -5.99745
89
+ ▁A -6.00678
90
+ ic -6.0074
91
+ ▁so -6.07008
92
+ ur -6.11247
93
+ ve -6.1212
94
+ ▁one -6.16579
95
+ es -6.16973
96
+ ▁like -6.17796
97
+ ▁B -6.18004
98
+ ▁his -6.18848
99
+ ll -6.21352
100
+ ? -6.22826
101
+ ▁me -6.23449
102
+ ▁st -6.237
103
+ ▁an -6.24077
104
+ ) -6.24407
105
+ I -6.24666
106
+ ▁all -6.25179
107
+ ent -6.26858
108
+ ▁at -6.29605
109
+ as -6.29619
110
+ ▁no -6.2962
111
+ ▁they -6.32666
112
+ ! -6.33426
113
+ ▁just -6.33635
114
+ st -6.34189
115
+ ir -6.34934
116
+ ▁w -6.36279
117
+ ▁or -6.36345
118
+ ▁out -6.37112
119
+ ▁who -6.38177
120
+ A -6.38672
121
+ ▁from -6.40097
122
+ v -6.40158
123
+ ▁de -6.40166
124
+ ter -6.40464
125
+ at -6.40627
126
+ th -6.40681
127
+ ▁by -6.41576
128
+ ce -6.42458
129
+ ▁g -6.43046
130
+ ▁some -6.44438
131
+ ch -6.44674
132
+ ra -6.4525
133
+ ▁M -6.45358
134
+ ▁h -6.4551
135
+ ▁about -6.45676
136
+ ▁C -6.46433
137
+ te -6.48912
138
+ il -6.51193
139
+ ▁W -6.52699
140
+ ▁do -6.52986
141
+ ▁t -6.5399
142
+ ne -6.54578
143
+ ▁can -6.5558
144
+ la -6.56588
145
+ ▁her -6.57673
146
+ ▁co -6.58456
147
+ ▁d -6.60329
148
+ ▁get -6.60599
149
+ ▁H -6.61056
150
+ ▁would -6.61342
151
+ ▁time -6.61618
152
+ ▁has -6.62633
153
+ ▁P -6.62853
154
+ ▁bad -6.64626
155
+ ▁up -6.64685
156
+ us -6.65064
157
+ ▁O -6.6668
158
+ ▁even -6.668
159
+ 0 -6.66802
160
+ im -6.66978
161
+ ▁G -6.67474
162
+ un -6.67522
163
+ ... -6.68199
164
+ ▁we -6.68367
165
+ ▁L -6.68711
166
+ ▁D -6.693
167
+ ro -6.69404
168
+ ▁It -6.69552
169
+ ol -6.70147
170
+ ▁good -6.70955
171
+ ad -6.71365
172
+ ▁what -6.71754
173
+ ▁there -6.71768
174
+ ion -6.72014
175
+ S -6.72155
176
+ ▁if -6.74083
177
+ ate -6.74687
178
+ The -6.74721
179
+ ▁make -6.75121
180
+ ul -6.75454
181
+ ▁con -6.76063
182
+ ▁tr -6.76683
183
+ O -6.76776
184
+ ut -6.76845
185
+ se -6.7739
186
+ ▁more -6.78233
187
+ ▁see -6.7866
188
+ el -6.79352
189
+ z -6.7988
190
+ vi -6.80098
191
+ ▁e -6.80375
192
+ ▁only -6.80717
193
+ ▁T -6.81895
194
+ ation -6.81917
195
+ ▁This -6.81976
196
+ ▁were -6.82104
197
+ lo -6.82454
198
+ ▁un -6.82993
199
+ E -6.83126
200
+ is -6.83326
201
+ ▁F -6.84749
202
+ ▁had -6.85633
203
+ am -6.85747
204
+ ▁really -6.85818
205
+ ▁J -6.86472
206
+ ▁could -6.88358
207
+ T -6.8865
208
+ ▁ex -6.88848
209
+ ▁b -6.90019
210
+ ▁look -6.90159
211
+ ▁my -6.90446
212
+ ous -6.90458
213
+ ▁very -6.90655
214
+ ge -6.90738
215
+ all -6.90841
216
+ ▁E -6.91837
217
+ and -6.92315
218
+ ▁story -6.93111
219
+ ▁ch -6.94678
220
+ ▁when -6.96013
221
+ et -6.96299
222
+ ▁mo -6.96556
223
+ age -6.96621
224
+ ers -6.9678
225
+ ▁show -6.97363
226
+ ow -6.97378
227
+ : -6.97822
228
+ ill -6.9801
229
+ ry -6.98264
230
+ able -6.98579
231
+ ▁she -6.98858
232
+ ▁than -6.99564
233
+ ▁been -7.00482
234
+ ▁bo -7.01498
235
+ ▁much -7.01685
236
+ ▁m -7.01945
237
+ ies -7.02172
238
+ N -7.02742
239
+ om -7.03799
240
+ ie -7.03974
241
+ ▁other -7.04342
242
+ ive -7.04574
243
+ ▁any -7.04693
244
+ ho -7.04992
245
+ ver -7.06684
246
+ ke -7.06932
247
+ ▁their -7.06987
248
+ ▁sp -7.0774
249
+ ▁watch -7.08005
250
+ op -7.0821
251
+ ist -7.0833
252
+ ity -7.08495
253
+ oo -7.09093
254
+ ▁into -7.09467
255
+ ick -7.09945
256
+ ▁which -7.10617
257
+ ▁R -7.10765
258
+ ▁N -7.10851
259
+ ec -7.11236
260
+ ud -7.11376
261
+ ▁end -7.11514
262
+ id -7.13008
263
+ ment -7.13082
264
+ ▁off -7.13333
265
+ pp -7.13432
266
+ ▁ma -7.13819
267
+ ally -7.14648
268
+ ▁people -7.14718
269
+ less -7.15402
270
+ L -7.15617
271
+ ish -7.16123
272
+ ▁made -7.16643
273
+ ▁over -7.16809
274
+ ▁know -7.17224
275
+ ting -7.17619
276
+ ▁- -7.17767
277
+ ▁su -7.1843
278
+ ▁don -7.18606
279
+ ▁dis -7.18613
280
+ he -7.18673
281
+ tion -7.18879
282
+ ▁think -7.19165
283
+ ight -7.19307
284
+ x -7.19917
285
+ ▁K -7.20513
286
+ ▁because -7.2127
287
+ ▁say -7.2144
288
+ ap -7.22086
289
+ ▁go -7.22699
290
+ ▁ra -7.22824
291
+ D -7.23126
292
+ R -7.23142
293
+ ▁plot -7.2362
294
+ ▁too -7.23882
295
+ ▁how -7.24131
296
+ ▁way -7.24443
297
+ ▁sc -7.25163
298
+ ▁most -7.25304
299
+ ant -7.25447
300
+ ther -7.25461
301
+ est -7.273
302
+ ▁pro -7.27933
303
+ ▁He -7.28314
304
+ ian -7.2864
305
+ ▁sa -7.29022
306
+ ack -7.31295
307
+ ▁will -7.31959
308
+ end -7.32429
309
+ ▁play -7.33133
310
+ ine -7.3316
311
+ ▁acting -7.3347
312
+ ▁want -7.33772
313
+ ck -7.34363
314
+ ig -7.34391
315
+ ▁la -7.35936
316
+ ▁character -7.3634
317
+ C -7.37096
318
+ ▁them -7.37109
319
+ um -7.3773
320
+ mo -7.37791
321
+ ▁lo -7.38012
322
+ ▁better -7.38159
323
+ ak -7.38241
324
+ ▁pre -7.38616
325
+ ma -7.38906
326
+ ff -7.38932
327
+ ▁him -7.39063
328
+ M -7.40098
329
+ ▁And -7.40131
330
+ ▁should -7.40514
331
+ ▁characters -7.40573
332
+ ▁ever -7.40682
333
+ ▁seen -7.4069
334
+ ▁sh -7.41381
335
+ ▁man -7.41528
336
+ H -7.41691
337
+ ▁did -7.41776
338
+ ▁No -7.41874
339
+ ep -7.41952
340
+ ▁movies -7.42664
341
+ ▁real -7.42755
342
+ ). -7.42851
343
+ ac -7.42949
344
+ ▁never -7.43368
345
+ ▁th -7.43769
346
+ ▁part -7.44063
347
+ no -7.44166
348
+ sh -7.45083
349
+ ▁ba -7.45315
350
+ ▁first -7.45863
351
+ ful -7.46186
352
+ ard -7.46269
353
+ ted -7.46475
354
+ man -7.46628
355
+ W -7.47005
356
+ ving -7.47146
357
+ ▁also -7.47357
358
+ co -7.47436
359
+ tic -7.47486
360
+ ▁But -7.49088
361
+ ▁In -7.50149
362
+ ▁where -7.50191
363
+ pe -7.50295
364
+ ▁your -7.50942
365
+ ▁well -7.51092
366
+ ag -7.51138
367
+ ▁give -7.51585
368
+ ▁work -7.51959
369
+ ▁i -7.53691
370
+ ▁ro -7.54397
371
+ / -7.54456
372
+ if -7.54501
373
+ ang -7.54598
374
+ ▁scene -7.54759
375
+ ress -7.549
376
+ ▁V -7.54954
377
+ ▁2 -7.55105
378
+ B -7.5512
379
+ ance -7.56114
380
+ ▁little -7.56314
381
+ ▁take -7.56332
382
+ ame -7.56395
383
+ ▁guy -7.56564
384
+ ven -7.56661
385
+ ; -7.56784
386
+ ▁after -7.57038
387
+ one -7.57106
388
+ ical -7.57236
389
+ ▁cr -7.57477
390
+ me -7.57487
391
+ ▁through -7.57969
392
+ ▁thing -7.58434
393
+ ▁kill -7.58694
394
+ ▁many -7.59315
395
+ 5 -7.59409
396
+ ▁come -7.59646
397
+ ▁j -7.59729
398
+ ha -7.60331
399
+ ▁something -7.61123
400
+ ▁two -7.61124
401
+ ▁being -7.61193
402
+ ▁didn -7.61546
403
+ mp -7.61968
404
+ ise -7.61989
405
+ ▁li -7.62028
406
+ ence -7.63153
407
+ ▁love -7.63392
408
+ ▁scenes -7.64296
409
+ ▁There -7.6436
410
+ bo -7.65194
411
+ ab -7.65362
412
+ unt -7.67499
413
+ ail -7.67944
414
+ cent -7.68253
415
+ ty -7.68282
416
+ ▁great -7.68521
417
+ ▁nothing -7.6881
418
+ ▁watching -7.70456
419
+ ip -7.71292
420
+ ated -7.71739
421
+ ), -7.7176
422
+ ▁If -7.71935
423
+ em -7.72033
424
+ ▁comp -7.72755
425
+ ▁back -7.72821
426
+ ▁does -7.7301
427
+ 1 -7.73384
428
+ ster -7.73676
429
+ ▁bu -7.74436
430
+ ▁am -7.74828
431
+ ▁' -7.74988
432
+ ▁actually -7.75087
433
+ ▁point -7.75087
434
+ ▁films -7.75378
435
+ mm -7.75443
436
+ son -7.76276
437
+ the -7.76595
438
+ dy -7.7666
439
+ tra -7.76867
440
+ der -7.76905
441
+ ▁o -7.77369
442
+ P -7.77404
443
+ ▁doesn -7.77512
444
+ for -7.77587
445
+ ▁actors -7.77601
446
+ ▁du -7.78241
447
+ ▁seem -7.78246
448
+ ▁director -7.78255
449
+ ▁script -7.78541
450
+ ice -7.78933
451
+ ▁life -7.79133
452
+ ▁again -7.79423
453
+ ▁worst -7.79427
454
+ 9 -7.79976
455
+ lie -7.80095
456
+ ▁< -7.80106
457
+ ▁going -7.80146
458
+ ia -7.80342
459
+ ▁minutes -7.80372
460
+ ▁Mo -7.80565
461
+ ▁big -7.8079
462
+ F -7.81001
463
+ ▁laugh -7.81212
464
+ Y -7.81305
465
+ ock -7.81528
466
+ ▁feel -7.81825
467
+ G -7.82225
468
+ ite -7.82259
469
+ ▁act -7.82332
470
+ ▁sl -7.82445
471
+ ▁Ch -7.82545
472
+ old -7.83268
473
+ ▁these -7.83627
474
+ ▁such -7.83637
475
+ ▁find -7.83676
476
+ ▁poor -7.83988
477
+ ▁why -7.849
478
+ ▁down -7.84901
479
+ ▁1 -7.85051
480
+ ary -7.85203
481
+ ▁funny -7.86149
482
+ per -7.8632
483
+ side -7.86719
484
+ ug -7.86937
485
+ ▁ha -7.87002
486
+ min -7.87068
487
+ ▁fact -7.87085
488
+ ▁U -7.87529
489
+ ▁still -7.8759
490
+ K -7.87607
491
+ ▁old -7.88312
492
+ ▁then -7.88574
493
+ U -7.88587
494
+ j -7.88914
495
+ ▁Co -7.89035
496
+ ▁few -7.89056
497
+ ▁di -7.89306
498
+ io -7.8954
499
+ ▁before -7.90012
500
+ q -7.90639
501
+ ▁start -7.91357
502
+ row -7.91386
503
+ ▁fan -7.91511
504
+ ex -7.91638
505
+ use -7.92177
506
+ ture -7.92356
507
+ ▁around -7.93023
508
+ ▁least -7.93348
509
+ ▁pretty -7.9369
510
+ ▁lot -7.93951
511
+ ▁en -7.93984
512
+ 8 -7.94371
513
+ ▁set -7.95334
514
+ ▁original -7.95406
515
+ ▁those -7.95756
516
+ ▁best -7.95955
517
+ ron -7.96048
518
+ ▁Ma -7.96225
519
+ ▁mean -7.96794
520
+ ▁happen -7.96802
521
+ be -7.97033
522
+ ▁imp -7.97063
523
+ ▁us -7.97146
524
+ ▁here -7.97427
525
+ ▁every -7.98012
526
+ ▁same -7.98416
527
+ ▁reason -7.98582
528
+ ▁mis -7.98717
529
+ ▁car -7.98755
530
+ ▁turn -7.99305
531
+ ▁enough -7.99334
532
+ ness -7.99462
533
+ ▁bl -7.99571
534
+ up -7.99588
535
+ ▁bit -7.99614
536
+ ▁thought -7.99788
537
+ com -8.00483
538
+ ▁need -8.01544
539
+ ough -8.01874
540
+ 2 -8.01917
541
+ ▁another -8.02224
542
+ ▁hard -8.02914
543
+ ▁girl -8.0297
544
+ ▁got -8.03009
545
+ ward -8.03381
546
+ vo -8.03523
547
+ ▁seems -8.03782
548
+ ▁line -8.03883
549
+ ▁read -8.0404
550
+ ▁stupid -8.04091
551
+ ▁vi -8.04496
552
+ ▁while -8.0485
553
+ ial -8.04856
554
+ ure -8.0555
555
+ * -8.05615
556
+ ▁long -8.06008
557
+ ▁far -8.06048
558
+ ▁19 -8.06437
559
+ ▁direct -8.06765
560
+ 7 -8.06773
561
+ ber -8.07274
562
+ ▁To -8.07326
563
+ ▁whole -8.07862
564
+ ▁book -8.08036
565
+ ▁Y -8.08566
566
+ act -8.08646
567
+ ▁friend -8.08736
568
+ ▁An -8.08815
569
+ ▁use -8.09475
570
+ ▁horror -8.09529
571
+ ▁cast -8.10409
572
+ ▁Da -8.10449
573
+ ever -8.10809
574
+ ▁kid -8.1115
575
+ ▁might -8.11388
576
+ ▁screen -8.11549
577
+ ▁10 -8.12123
578
+ ▁na -8.12487
579
+ ▁though -8.13104
580
+ ▁things -8.13329
581
+ ▁ga -8.13485
582
+ ▁Ro -8.13525
583
+ ▁years -8.13538
584
+ ▁interesting -8.13722
585
+ ▁waste -8.13983
586
+ ▁money -8.14026
587
+ ▁De -8.14088
588
+ tain -8.1409
589
+ ach -8.14181
590
+ ▁star -8.14336
591
+ ▁must -8.14876
592
+ ▁You -8.14906
593
+ -- -8.15347
594
+ ▁action -8.15478
595
+ ▁care -8.15676
596
+ ▁sound -8.15768
597
+ ize -8.15893
598
+ ▁shot -8.16161
599
+ ▁name -8.16314
600
+ low -8.16334
601
+ 3 -8.16428
602
+ ▁anything -8.16563
603
+ ▁3 -8.16686
604
+ ▁believe -8.16988
605
+ ▁role -8.1724
606
+ land -8.17916
607
+ ating -8.17937
608
+ ible -8.1853
609
+ ▁fun -8.19114
610
+ ▁now -8.19311
611
+ ▁view -8.19604
612
+ ▁person -8.20052
613
+ ▁sex -8.20129
614
+ ▁kind -8.20488
615
+ V -8.20752
616
+ out -8.20802
617
+ & -8.21382
618
+ ▁idea -8.21425
619
+ ▁They -8.21545
620
+ ▁right -8.21862
621
+ ▁under -8.22036
622
+ ▁high -8.22287
623
+ ▁awful -8.22297
624
+ ▁new -8.22764
625
+ ound -8.23447
626
+ 6 -8.23655
627
+ ▁put -8.23941
628
+ uck -8.24065
629
+ ▁What -8.24125
630
+ ▁quite -8.24579
631
+ ▁enjoy -8.25044
632
+ ▁expect -8.25512
633
+ ▁else -8.25526
634
+ ▁without -8.25906
635
+ ▁She -8.26544
636
+ ▁lack -8.27111
637
+ ▁Le -8.27347
638
+ ▁Be -8.27472
639
+ ▁comedy -8.27508
640
+ ▁effects -8.27825
641
+ ▁tell -8.28402
642
+ ▁away -8.2863
643
+ ▁boring -8.28891
644
+ ▁trying -8.28893
645
+ ▁completely -8.29119
646
+ ▁sure -8.29209
647
+ ▁creat -8.29618
648
+ ▁day -8.29645
649
+ ▁gra -8.29886
650
+ ▁place -8.30313
651
+ ▁let -8.30968
652
+ ▁run -8.31266
653
+ ide -8.31722
654
+ almost -8.31801
655
+ way -8.32261
656
+ atch -8.32541
657
+ ▁someone -8.3266
658
+ ook -8.32891
659
+ ▁saw -8.33095
660
+ ▁done -8.33174
661
+ ▁For -8.33612
662
+ ▁Ha -8.33874
663
+ ▁worth -8.34324
664
+ ▁music -8.34339
665
+ ▁main -8.34516
666
+ ▁anyone -8.34527
667
+ ▁both -8.34589
668
+ que -8.34824
669
+ ▁making -8.34844
670
+ ade -8.35944
671
+ ▁sub -8.36026
672
+ ▁That -8.36037
673
+ ▁war -8.36225
674
+ ▁lead -8.36416
675
+ ▁probably -8.3691
676
+ ▁4 -8.3712
677
+ ▁ru -8.37183
678
+ ▁world -8.37436
679
+ ▁own -8.37497
680
+ ▁goes -8.37724
681
+ ▁obvious -8.37965
682
+ ▁sense -8.37967
683
+ ▁American -8.38706
684
+ ▁Re -8.38836
685
+ ▁terrible -8.3957
686
+ ▁Mar -8.40095
687
+ ▁young -8.40706
688
+ ▁problem -8.412
689
+ ▁talk -8.41208
690
+ ▁performance -8.41215
691
+ day -8.41736
692
+ ▁hour -8.41774
693
+ ink -8.41965
694
+ ( -8.41996
695
+ ▁hand -8.42496
696
+ ER -8.4252
697
+ ative -8.42821
698
+ ▁left -8.42859
699
+ ▁TV -8.42996
700
+ ▁attempt -8.43978
701
+ ▁appear -8.44543
702
+ ▁wonder -8.44544
703
+ ▁help -8.44578
704
+ ▁Di -8.44585
705
+ ▁low -8.45074
706
+ ▁audience -8.45111
707
+ ▁become -8.45132
708
+ spect -8.45187
709
+ ▁Th -8.45489
710
+ ▁review -8.45683
711
+ ▁found -8.46141
712
+ ▁worse -8.46263
713
+ ▁crap -8.47232
714
+ ▁add -8.47675
715
+ ▁instead -8.48002
716
+ qua -8.48061
717
+ 4 -8.48354
718
+ ▁since -8.48592
719
+ ▁always -8.49182
720
+ ▁camera -8.49924
721
+ ▁head -8.50222
722
+ ▁half -8.50387
723
+ produc -8.50392
724
+ ank -8.50595
725
+ ▁live -8.50747
726
+ ▁final -8.50825
727
+ line -8.51122
728
+ This -8.51389
729
+ ▁actor -8.51912
730
+ ▁between -8.52194
731
+ ▁budget -8.52807
732
+ ▁special -8.52807
733
+ ▁When -8.52819
734
+ ▁fight -8.52963
735
+ ▁last -8.53067
736
+ ▁rather -8.53083
737
+ ▁supposed -8.531
738
+ ▁woman -8.5343
739
+ ▁flick -8.53434
740
+ ▁Pa -8.54274
741
+ ▁everything -8.54674
742
+ !!! -8.54676
743
+ port -8.55152
744
+ ▁piece -8.55303
745
+ ident -8.55739
746
+ istic -8.56273
747
+ ▁guess -8.56572
748
+ ▁either -8.56572
749
+ ▁wrong -8.57212
750
+ ▁sit -8.57232
751
+ Z -8.57856
752
+ ▁entire -8.57856
753
+ ▁moment -8.59575
754
+ ▁comment -8.59823
755
+ ▁Why -8.59839
756
+ ▁nu -8.59868
757
+ ▁art -8.599
758
+ ▁horrible -8.60479
759
+ ▁joke -8.61147
760
+ ▁course -8.61148
761
+ ▁dialogue -8.6115
762
+ ▁writer -8.61151
763
+ ▁short -8.61182
764
+ ▁keep -8.61821
765
+ ▁DVD -8.62492
766
+ ▁job -8.625
767
+ ▁sw -8.62633
768
+ where -8.6323
769
+ ▁hav -8.64301
770
+ ▁recommend -8.64547
771
+ ▁second -8.64547
772
+ ▁understand -8.64547
773
+ ▁came -8.64638
774
+ ▁tea -8.64793
775
+ ▁said -8.64876
776
+ ▁different -8.65242
777
+ ▁decide -8.65245
778
+ ▁rent -8.65306
779
+ J -8.65925
780
+ ▁Well -8.65967
781
+ ▁remember -8.66645
782
+ ▁open -8.6667
783
+ ▁serious -8.67354
784
+ ▁nice -8.67392
785
+ ▁word -8.67393
786
+ like -8.67466
787
+ ▁face -8.68217
788
+ ▁our -8.68772
789
+ ▁John -8.68788
790
+ ▁night -8.68848
791
+ ▁each -8.69015
792
+ ▁series -8.69515
793
+ ▁absolutely -8.70243
794
+ ▁monster -8.70246
795
+ ▁talent -8.7025
796
+ ▁three -8.70258
797
+ ▁mind -8.70292
798
+ ▁video -8.70978
799
+ ▁everyone -8.71095
800
+ ▁sort -8.7119
801
+ ▁production -8.7172
802
+ ▁women -8.71721
803
+ ▁case -8.71794
804
+ ▁drama -8.72466
805
+ ▁close -8.73235
806
+ lthough -8.73241
807
+ ▁yet -8.73309
808
+ ▁simply -8.73974
809
+ ▁version -8.73975
810
+ gue -8.74695
811
+ ▁full -8.74742
812
+ ▁together -8.74744
813
+ ▁mention -8.74759
814
+ ▁change -8.74835
815
+ ▁ridiculous -8.75507
816
+ ▁next -8.76283
817
+ ▁possib -8.76283
818
+ ▁example -8.77063
819
+ ▁THE -8.77064
820
+ ▁follow -8.77064
821
+ ▁however -8.77209
822
+ ▁except -8.78646
823
+ ▁slow -8.79004
824
+ Q -8.79444
825
+ ▁death -8.79464
826
+ ▁episode -8.80251
827
+ ▁consider -8.80254
828
+ ▁wife -8.80257
829
+ ▁hope -8.80345
830
+ ▁ski -8.80505
831
+ ▁family -8.81064
832
+ ▁cheap -8.8107
833
+ ▁home -8.81098
834
+ ▁interest -8.81658
835
+ ▁However -8.81884
836
+ ▁school -8.81884
837
+ ▁release -8.8189
838
+ ▁house -8.81922
839
+ ▁black -8.82042
840
+ void -8.82088
841
+ ▁Hollywood -8.8271
842
+ ▁super -8.82945
843
+ play -8.83001
844
+ ▁quality -8.83543
845
+ ▁small -8.83546
846
+ ▁felt -8.83598
847
+ ▁Some -8.83638
848
+ ▁usual -8.83639
849
+ ▁clear -8.84419
850
+ ▁near -8.84456
851
+ ▁year -8.84519
852
+ ▁especially -8.85231
853
+ ▁certain -8.85235
854
+ ▁himself -8.86086
855
+ ▁Chris -8.86252
856
+ ▁murder -8.86949
857
+ ▁cinema -8.87819
858
+ ▁couple -8.87819
859
+ ▁along -8.87819
860
+ ▁matter -8.87839
861
+ ▁town -8.88081
862
+ – -8.88695
863
+ ▁itself -8.88696
864
+ ▁true -8.88734
865
+ ▁plan -8.89006
866
+ some -8.89368
867
+ ▁already -8.8958
868
+ ▁blood -8.89584
869
+ ▁pain -8.89646
870
+ ▁etc -8.89717
871
+ ▁eye -8.90878
872
+ ▁feature -8.91373
873
+ ▁sever -8.91522
874
+ ship -8.92255
875
+ ▁involve -8.92283
876
+ ▁portray -8.932
877
+ ▁sequence -8.932
878
+ ▁humor -8.93203
879
+ ▁After -8.9321
880
+ /10 -8.94126
881
+ ▁God -8.94191
882
+ ▁leave -8.95061
883
+ ▁maybe -8.95063
884
+ ▁title -8.95073
885
+ ▁walk -8.95075
886
+ ▁complete -8.95478
887
+ ▁written -8.96004
888
+ ▁writing -8.9696
889
+ ▁credit -8.96967
890
+ ▁sea -8.97841
891
+ ▁sequel -8.97918
892
+ $ -8.98888
893
+ ▁father -8.98904
894
+ ▁annoying -9.00859
895
+ ▁stuff -9.00864
896
+ ▁myself -9.01859
897
+ ▁emotion -9.0186
898
+ ▁particular -9.02869
899
+ *** -9.02869
900
+ ▁lu -9.03085
901
+ ▁unti -9.03591
902
+ ▁beginning -9.03899
903
+ cap -9.04398
904
+ ▁dialog -9.04914
905
+ ▁silly -9.04929
906
+ ▁song -9.05028
907
+ ▁picture -9.05965
908
+ ▁build -9.05968
909
+ ▁light -9.06016
910
+ ▁Also -9.07018
911
+ ▁stay -9.07128
912
+ ▁female -9.08079
913
+ ING -9.08079
914
+ ▁local -9.08094
915
+ ▁group -9.09155
916
+ ▁dumb -9.09175
917
+ ▁shoot -9.09182
918
+ ▁extreme -9.10241
919
+ ▁predictable -9.10241
920
+ ▁camp -9.10273
921
+ ▁begin -9.11373
922
+ ▁entertaining -9.12451
923
+ ▁human -9.12459
924
+ ▁wait -9.12484
925
+ ▁Sta -9.12527
926
+ ▁develop -9.13575
927
+ ▁themselves -9.13575
928
+ ▁bunch -9.13583
929
+ ▁Every -9.13584
930
+ ▁drive -9.13718
931
+ ▁cliché -9.14711
932
+ ▁disappointed -9.14711
933
+ ▁surprise -9.14711
934
+ ▁dark -9.14728
935
+ X -9.1586
936
+ ▁deliver -9.15861
937
+ tract -9.16371
938
+ ▁beautiful -9.17023
939
+ ▁Robert -9.17025
940
+ ▁white -9.17025
941
+ ▁genre -9.17044
942
+ ▁effort -9.182
943
+ ▁voice -9.18202
944
+ ▁children -9.18206
945
+ ▁child -9.18208
946
+ ▁explain -9.1939
947
+ ▁pathetic -9.1939
948
+ ▁power -9.19419
949
+ ▁question -9.20595
950
+ ▁continu -9.20595
951
+ ▁strange -9.20597
952
+ ▁Just -9.20634
953
+ ▁daughter -9.21815
954
+ ▁self -9.21817
955
+ ▁type -9.2183
956
+ ▁truly -9.23091
957
+ ▁NOT -9.243
958
+ ▁figure -9.24301
959
+ ▁twist -9.25569
960
+ ▁First -9.2557
961
+ ▁huge -9.2558
962
+ ▁With -9.25594
963
+ ▁value -9.26847
964
+ ▁brother -9.26849
965
+ ▁across -9.28146
966
+ _ -9.29462
967
+ ▁speak -9.29467
968
+ ▁footage -9.30795
969
+ ▁situation -9.30796
970
+ ▁experience -9.32146
971
+ … -9.60236
972
+ ` -9.62054
973
+ % -9.94242
974
+ é -9.96807
975
+ # -10.0214
976
+ — -10.2343
977
+ + -10.2343
978
+ > -10.8152
979
+ = -10.8152
980
+ ] -11.0158
981
+ ’ -11.0928
982
+ @ -11.1761
983
+ – -11.367
984
+ ó -11.4781
985
+ ç -11.6031
986
+ í -11.746
987
+ ö -11.9126
988
+ ñ -12.3626
989
+ å -12.6952
990
+ ‘ -12.6953
991
+ ê -12.6954
992
+ £ -12.6955
993
+ { -12.6956
994
+ ã -12.6957
995
+ ́ -12.6958
996
+ ~ -12.6959
997
+ ü -12.696
998
+ [ -12.696
999
+ } -12.696
1000
+ • -12.696