_compute raises `ValueError: Asking to pad ...` if called with batch_size=1 and tokenizer w/o pad_token

#8
by spyysalo - opened

The current implementation only tries to add a pad_token if batch_size > 1 (https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py#L122), but the tokenizer invocation expects a pad_token regardless of batch_size:

>>> perplexity.compute(predictions=['Hello.'], model_id='gpt2')
{'perplexities': [394.9745178222656], 'mean_perplexity': 394.9745178222656}
>>> perplexity.compute(predictions=['Hello.'], model_id='gpt2', batch_size=1)
Traceback (most recent call last):
[...]
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Sign up or log in to comment