Next token prediction with GPT
Tokens
Language models don't process English, Spanish, or any other human language. Language models process tokens IDs and generate tokens IDs. For example, when I send GPT-2 the sentence "The dog eats the apples.", this sentence is decomposed into the tokens ["The", " dog", " eats", " the", " apples", "."] and GPT-2 process the tokens IDs "[464, 3290, 25365, 262, 22514, 13]" (see the image below).
The tokens decomposition and the tokens IDs will depend on the language model but the general idea holds. To get a feeling of what tokens are and/or to search for a particular token, use this interactive demo: https://alonsosilva-tokenizer.hf.space/Note: The three interactive demos in this post are built with Solara which is a pure Python, React-style framework to build ultra-responsive web apps.
Next token prediction
The language model will receive these tokens and will predict the next token. From an abstract point of view, predicting the next token is a multi-class classification task where there are many classes (50,257 classes for GPT-2 since these are all the possible tokens).
For example, when we send the sentence "One, two," GPT-2 receives the tokens IDs "[3198, 11, 734, 11]" and predicts with a probability of 39.71% that the next token ID will be the "[1115]" which corresponds to the token " three".
Generating a single token is cool and all, but what about entire sentences, paragraphs, etc... Well, we can generate full sentences by iteratively getting the next token prediction from our model. At each iteration, we append the predicted token back into the input. This kind of model is called autoregressive (for example, GPT-4 is an autoregressive model).
The problem then consists of predicting extremely well the next token. Language models will give the probabilities of what is the next token (see the image above).
There are many different sampling strategies, but the simplest strategy is to predict the next token with the highest probability. If we follow that strategy, we obtain the following:
Then by completing with this greedy strategy the prompt "One, two," we obtain: "One, two, three, four, five, six, seven, eight, nine, ten,"...
To explore the next token prediction, you can take a look at the following app I built: https://alonsosilva-nexttokenprediction.hf.space/
Since the model gives us the probabilities of the next token it is expecting, given a text, we can estimate how surprised the model is to find a given token. For example, after having seen "One, two, three, four," the model is very surprised to find that the next token is " mango" (see the image below):
Indeed if we hover over the token " mango", the probability the model estimated for the token " mango" is 0.00%.
To explore how surprised the model is for each token in a given sentence you provide, you can take a look at the following app: https://alonsosilva-perplexity.hf.space/