Pwicke
/

logprobs_for_CausalLMs

Model card Files Files and versions Community

Pwicke commited on Oct 7, 2022

Commit

eca6ced

•

1 Parent(s): 96dc616

Create README.md

Files changed (1) hide show

README.md +29 -0

README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+The *OpenAI* API allows to retrieve log-probabilities per token (including both prompt and completion tokens) through the ``logprobs`` return argument. Currently, the ``CausalLM`` only provide ``logits`` return values, which should are the prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+The following code provides an example of how to retrieve the log-probabilities per token of ``CausalLMs`` for the huggingface API:
+```python
+def logprobs_from_prompt(prompt, tokenizer, model):
+      encoded = tokenizer(prompt, return_tensors="pt").to("cpu")
+      input_ids = encoded["input_ids"]
+      output = model(input_ids=input_ids)
+      shift_labels = input_ids[..., 1:].contiguous()
+      shift_logits = output.logits[..., :-1, :].contiguous()
+      log_probs = []
+      log_probs.append((tokenizer.decode(input_ids[0].tolist()[0]), None))
+      for idx, (label_id, logit) in enumerate(zip(shift_labels[0].tolist(), shift_logits[0])):
+            logprob = F.log_softmax(logit, dim=0).tolist()[label_id]
+            log_probs.append((tokenizer.decode(label_id), float(logprob)))
+      return log_probs
+```
+An example call would be:
+```python
+tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt")
+model = OPTForCausalLM.from_pretrained("facebook/opt")
+prompt = "The horse raced past the barn fell."
+logprobs = logprobs_from_prompt(prompt, tokenizer, model)
+```
+For its derivation and explanation see this [discussion](https://huggingface.co/bigscience/bloom/discussions/89#6321dcc9b97c618f9a5e3dac).