Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

727

gsm8k score largely different from local run

#591

by mobicham - opened Feb 9

Discussion

mobicham

Feb 9

When I run a model locally I get a GSM8K (5-shot) score of 58.60, while the leaderboard reports 54.89 : https://huggingface.co/datasets/open-llm-leaderboard/details_mobiuslabsgmbh__aanaphi2-v0.1
The rest of the scores are also slightly different, but the GSM8K score is the only one that is quite different (-3.71 points).
Is there some flag to set or something ? I basically run it like this (with the latest lm-eval version):

model.eval();
import lm_eval
model.config.use_cache = False
lm_eval.tasks.initialize_tasks() 
model_eval = lm_eval.models.huggingface.HFLM(pretrained=model, tokenizer=tokenizer)
result  = lm_eval.evaluator.simple_evaluate(model_eval, tasks=["gsm8k"], num_fewshot=5, batch_size=8)['results']

Thank you in advance!

senseable

Feb 9

Good question, I'd like to know more about the GSM8K eval as well.
One thing that jumps out is that you'e using a batch_size of 8 when HF using a batch_size of 1.

mobicham

Feb 9

I run it with different batch sizes and I was still getting 58.60, but I haven't tried a batch_size of 1. Will try that, thanks!

SaylorTwift

Hugging Face H4 org Feb 11

Hi ! You can find reproducibility info in the About tab of the leaderboard. Let us know if you encounter any more issues !

mobicham

Feb 11

Thanks! It says "for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs", that might explain why the scores are widely different. Curious to know why not just use the accuracy from GSM8K ?

clefourrier

Hugging Face H4 org Feb 14

Regarding the comment you pointed out from the paper, I assume that they simply would have gotten less good of a score without the fine-tuning - a lot of reported scores in papers/tech reports are not done in a reproducible setup, but in a setup that is advantageous for the evaluated model (like using CoT instead of few shot prompting, or reporting results on a fine tuned model instead of the base like the one you pointed out).
This is typically one of the use cases for which the leaderboard is an interesting resource: we evaluate all models in exactly the same setup, so that scores are actually comparable.

clefourrier

Hugging Face H4 org Feb 14

Side note: when you try to reproduce results from the leaderboard, please make sure that you use the same commit as we do :)
Since it would seem that your issue is explained by the above comment, I'm going to close, but feel free to reopen if needed.

clefourrier changed discussion status to closed Feb 14

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment