Batched perplexity incorrect for shorter inputs if padding_side == 'left'

#7
by spyysalo - opened

When using a tokenizer with padding_side == 'left', perplexity results differ depending on whether inputs are provided individually or as a batch if at least one of the tokenized inputs in the batch is shorter than the longest. For example:

>>> from evaluate import load
>>> perplexity = load("perplexity", module_type="metric")
>>> perplexity.compute(predictions=['Hello.'], model_id='bigscience/bloom-560m')
{'perplexities': [1127.75439453125], 'mean_perplexity': 1127.75439453125}
>>> perplexity.compute(predictions=['Hello there.'], model_id='bigscience/bloom-560m')
{'perplexities': [152.9550018310547], 'mean_perplexity': 152.9550018310547}
>>> perplexity.compute(predictions=['Hello.', 'Hello there.'], model_id='bigscience/bloom-560m')
{'perplexities': [230469824.0, 152.96017456054688], 'mean_perplexity': 115234988.48008728}

This is not an issue if padding_side == 'right', for example:

perplexity.compute(predictions=['Hello.'], model_id='gpt2')
{'perplexities': [394.9745178222656], 'mean_perplexity': 394.9745178222656}
>>> perplexity.compute(predictions=['Hello there.'], model_id='gpt2')
{'perplexities': [93.656982421875], 'mean_perplexity': 93.656982421875}
>>> perplexity.compute(predictions=['Hello.', 'Hello there.'], model_id='gpt2')
{'perplexities': [394.9747314453125, 93.65707397460938], 'mean_perplexity': 244.31590270996094}

The current implementation appears to implicitly assume right padding when shifting the attention mask: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py#L183 always drops the first column, which will correspond to pad tokens for shorter inputs with left padding. Replacing this line with

if tokenizer.padding_side == 'left':
    shift_attention_mask_batch = attn_mask[..., :-1].contiguous()
else:
    shift_attention_mask_batch = attn_mask[..., 1:].contiguous()

Appears to resolve the issue.

Sign up or log in to comment