gsm8k score largely different from local run

#591
by mobicham - opened

When I run a model locally I get a GSM8K (5-shot) score of 58.60, while the leaderboard reports 54.89 : https://huggingface.co/datasets/open-llm-leaderboard/details_mobiuslabsgmbh__aanaphi2-v0.1
The rest of the scores are also slightly different, but the GSM8K score is the only one that is quite different (-3.71 points).
Is there some flag to set or something ? I basically run it like this (with the latest lm-eval version):

model.eval();
import lm_eval
model.config.use_cache = False
lm_eval.tasks.initialize_tasks() 
model_eval = lm_eval.models.huggingface.HFLM(pretrained=model, tokenizer=tokenizer)
result  = lm_eval.evaluator.simple_evaluate(model_eval, tasks=["gsm8k"], num_fewshot=5, batch_size=8)['results']

Thank you in advance!

Good question, I'd like to know more about the GSM8K eval as well.
One thing that jumps out is that you'e using a batch_size of 8 when HF using a batch_size of 1.

I run it with different batch sizes and I was still getting 58.60, but I haven't tried a batch_size of 1. Will try that, thanks!

Hugging Face H4 org

Hi ! You can find reproducibility info in the About tab of the leaderboard. Let us know if you encounter any more issues !

Thanks! It says "for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs", that might explain why the scores are widely different. Curious to know why not just use the accuracy from GSM8K ?

Hugging Face H4 org

Regarding the comment you pointed out from the paper, I assume that they simply would have gotten less good of a score without the fine-tuning - a lot of reported scores in papers/tech reports are not done in a reproducible setup, but in a setup that is advantageous for the evaluated model (like using CoT instead of few shot prompting, or reporting results on a fine tuned model instead of the base like the one you pointed out).
This is typically one of the use cases for which the leaderboard is an interesting resource: we evaluate all models in exactly the same setup, so that scores are actually comparable.

Hugging Face H4 org

Side note: when you try to reproduce results from the leaderboard, please make sure that you use the same commit as we do :)
Since it would seem that your issue is explained by the above comment, I'm going to close, but feel free to reopen if needed.

clefourrier changed discussion status to closed

Sign up or log in to comment