Results on Wikitext-2 with GPT2 don't match paper

#1
by brucardoso2 - opened

Hey, I tested the example code and compared the results achieved on https://huggingface.co/docs/transformers/perplexity using gpt2 and wikitext-2-raw-v1.

  • The values reported on the post range from 16.44 to 19.64 (depending on the size of the stride)
  • The value achieved using the this lib is 546.62

The difference is quite bit. Am I missing something?


Code:

import datasets
import evaluate

input_texts = datasets.load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"]
input_texts = [s for s in input_texts if s!='']
perplexity = evaluate.load("perplexity", module_type="measurement")
results = perplexity.compute(model_id='gpt2', data=input_texts)
print(results['mean_perplexity'])
This comment has been hidden

Sign up or log in to comment