|
--- |
|
license: cc-by-2.0 |
|
tags: |
|
- logprobs |
|
- logits |
|
- CausalLM |
|
--- |
|
|
|
|
|
The *OpenAI* API allows to retrieve log-probabilities per token (including both prompt and completion tokens) through the ``logprobs`` return argument. Currently, the ``CausalLM`` only provide ``logits`` return values, which should are the prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). |
|
|
|
The following code provides an example of how to retrieve the log-probabilities per token of ``CausalLMs`` for the huggingface API: |
|
|
|
```python |
|
def logprobs_from_prompt(prompt, tokenizer, model): |
|
encoded = tokenizer(prompt, return_tensors="pt").to("cpu") |
|
input_ids = encoded["input_ids"] |
|
output = model(input_ids=input_ids) |
|
shift_labels = input_ids[..., 1:].contiguous() |
|
shift_logits = output.logits[..., :-1, :].contiguous() |
|
log_probs = [] |
|
log_probs.append((tokenizer.decode(input_ids[0].tolist()[0]), None)) |
|
for idx, (label_id, logit) in enumerate(zip(shift_labels[0].tolist(), shift_logits[0])): |
|
logprob = F.log_softmax(logit, dim=0).tolist()[label_id] |
|
log_probs.append((tokenizer.decode(label_id), float(logprob))) |
|
return log_probs |
|
``` |
|
|
|
An example call would be: |
|
```python |
|
tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt") |
|
model = OPTForCausalLM.from_pretrained("facebook/opt") |
|
prompt = "The horse raced past the barn fell." |
|
logprobs = logprobs_from_prompt(prompt, tokenizer, model) |
|
|
|
``` |
|
|
|
For its derivation and explanation see this [discussion](https://huggingface.co/bigscience/bloom/discussions/89#6321dcc9b97c618f9a5e3dac). |