Cannot reproduce accuracy of mncai/Llama2-7B-guanaco-dolphin-500 gsm8k

#527
by zhentaocc - opened

with batch size = 1, the result I got was 13.12, while the reported is 5.99
I was using python main.py --model=hf-causal-experimental --model_args="pretrained=mncai/Llama2-7B-guanaco-dolphin-500" --tasks=gsm8k --num_fewshot=5 --batch_size=1 --no_cache
And I found different settings for batch size result in different accuracy.

Hugging Face H4 org

Hi! Did you use the specific commit we report in our About page, and the same precision as the evaluation mentioned above?
If yes, could you please link to the request and result files?

Where is the result file
image.png

Hugging Face H4 org

Hi @zhentaocc ,
Please follow the steps in the FAQ (About tab of the leaderboard) to find the request and results files for your specific model of interest.
Can you also confirm that you used the same commit as we did?

yes, I used the same commit. @clefourrier

Hugging Face H4 org

You can find your result file here, could you link it so that we can take a look ?

any update here? @SaylorTwift @clefourrier

Hugging Face H4 org
edited Feb 19

Hi!
Can you compare the predictions of your run to the detailed predictions stored here for this model?

And I found different settings for batch size result in different accuracy.

Yes, this is a known issue of the harness at this commit.

Hugging Face H4 org

Hi @zhentaocc ,
I meant actually logging the different predictions you get for each sample :)

I wonder if you tried to run the model benchmarking and what's the result? Are you able to reproduce the number reported?

Hugging Face H4 org

Hi @zhentaocc ,
Thanks a lot for providing the details of your outputs! 🙏
They allowed me to pinpoint the problem: looking in detail at the difference between your outputs and ours, it seems like outputs on our side were truncated on .\n too early.

It's a known bug we identified last year for some models, and fixed in December by re-running 150 models (a mistake on our side, using a test version in prod accidentally, we communicated about it on twitter at the time).
I'm very sorry we missed your model! I relaunched its evaluations.

Closing as the problem was identified, we'll re-check all results file to make sure no other models fell through.
Feel free to reopen if you need.

clefourrier changed discussion status to closed

@clefourrier I saw the new result now, it's much more reasonable. But still a bit different from my side, which is 13.12 vs 12.74?

Hugging Face H4 org

This is the kind of difference which is within expected margins between different hardwares for example, I'm not suprised

Sign up or log in to comment