Rambling
Browse files
README.md
CHANGED
@@ -42,7 +42,7 @@ Here are the improvements that I wanted to make:
|
|
42 |
|
43 |
The original GPTed project relied on the 2 features in the legacy OpenAI /v1/completions API:
|
44 |
|
45 |
-
> logprobs: Include the log probabilities on the
|
46 |
|
47 |
> echo: Echo back the prompt in addition to the completion
|
48 |
|
@@ -203,11 +203,20 @@ The main limitation of using decoder-only models like GPT or Llama for this task
|
|
203 |
### Other potential possibilities / ideas
|
204 |
- Instead of using a local model, investigate using an API of a provider that exposes logprobs e.g. replicate
|
205 |
|
|
|
|
|
|
|
206 |
|
207 |
## Part 2
|
208 |
|
209 |
-
|
|
|
|
|
210 |
|
211 |
-
- We cannot get logprobs for all tokens in a given text
|
212 |
- For multi-token words, it is not clear if replacing them by a sequence of mask tokens would give results (if model is trained to predict multiple adjacent mask tokens)
|
213 |
- Generating replacesments poses an additional difficulty: we don't know beforehand how many tokens the replacement word would consist of, so naively we'd need to try all possible sequences <mask>, <mask><mask>, <mask><mask><mask>, and so on until a reasonable limit of the number of tokens in a word.
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
The original GPTed project relied on the 2 features in the legacy OpenAI /v1/completions API:
|
44 |
|
45 |
+
> logprobs: Include the log probabilities on the `logprobs` most likely output tokens, as well the chosen tokens. For example, if `logprobs` is 5, the API will return a list of the 5 most likely tokens. The API will always return the `logprob` of the sampled token, so there may be up to `logprobs+1` elements in the response. The maximum value for `logprobs` is 5.
|
46 |
|
47 |
> echo: Echo back the prompt in addition to the completion
|
48 |
|
|
|
203 |
### Other potential possibilities / ideas
|
204 |
- Instead of using a local model, investigate using an API of a provider that exposes logprobs e.g. replicate
|
205 |
|
206 |
+
## A digression on encoder vs decoder, unidirectional vs bidirectional attention, and whether we could use bidirectional attention for text generation
|
207 |
+
|
208 |
+
It is a common misconseption that autoregressive text generation _requires_ unidirectional attention, whereas in reality it is only a matter of efficiency (efficiency at both training and inference time). It is possible to use models with bidirectional attention autoregressively, and arguably it would give better quality than unidirectional attention (the bidirectional flow of information between tokens in the current prefix can only be beneficial, e.g. if we are generating the next token in "the quick brown fox jumped over", there is no benefit in not letting "fox" to see "jumped"). However, bidirectional attention would mean that we cannot learn from every token in a text by passing only 1 instance of it through the model, we would have to pass every token individually. And at inference time, it would rule out the techniques such as KV caches which are used ubiquitously at all modern LLM deployments for inference, because all attention would need to be recomputed for every prefix.
|
209 |
|
210 |
## Part 2
|
211 |
|
212 |
+
Applying encoder-only models (those with bidirectional attention) to this task presents several challenges.
|
213 |
+
|
214 |
+
Whereas unidierctional attention in decoder-only models enables them to be efficiently trained on the task of next token predition, and used for autoregressive text generation, with an important property of returning logprobs for every input token as a byporduct, encoder-only only models like BERT are trained on masked token prediction (also on next sentence prediction), and it is from this fact that the difficulties arise:
|
215 |
|
216 |
+
- We cannot get logprobs for all tokens in a given text by passing a single instance of it through the modesl. Instead, because we need to mask individual tokens, replicating the input as many times as there are tokens. It can still be done in 1 pass / 1 batch, but the size of the batch in N_tokens instead of 1 in the decoder-only case.
|
217 |
- For multi-token words, it is not clear if replacing them by a sequence of mask tokens would give results (if model is trained to predict multiple adjacent mask tokens)
|
218 |
- Generating replacesments poses an additional difficulty: we don't know beforehand how many tokens the replacement word would consist of, so naively we'd need to try all possible sequences <mask>, <mask><mask>, <mask><mask><mask>, and so on until a reasonable limit of the number of tokens in a word.
|
219 |
+
|
220 |
+
Even if we get the logprobs for a sequence of mask tokens, how do we interpret them? What we need in order to generate candidate words (and to compute their probability) are _conditional_ probabilities of the second token given the first one, the third one given the first two, and so on, but logporbs for a sequence of mask tokens don't give us that.
|
221 |
+
|
222 |
+
Speculation: either the logprobs of the second <mask> in a sequence represent probabilities of tokens at that place _given that the previous token is <mask>_ (and of course given all other actual non-mask and mask tokens at all other positions), or they represent probabilities for tokens in the second position averaged over all possible tokens in position 1, possibly roughly weighted according to the probabilities of the tokens at position 1 (FIXME: is there even a way to know this?)
|