HuggingFaceH4/open_llm_leaderboard · cannot reproduce gsm8k score with vllm

Feb 13

Dear Maintainers,
thanks for making and sharing this leaderboard!

This is what I've done so far.

I submitted my model to this leaderboard and I did reproduce the score with the version specified in the about tab (b281b09).
It took me more than 15 hours to evaluate only gsm8k which is too long.

So I tried evaluating it using vllm from main branch. It took about 1.5 hours. (I think it worth waiting)
Even I used the same fewshot examples and batch_size as in the leaderboard, I couldn't reproduce the score.

I got gsm8k score of 72.48 when using vllm, while the leaderboard reports 68.54: https://huggingface.co/datasets/open-llm-leaderboard/details_HanNayeoniee__LHK_DPO_v1

Is there any way to reproduce score using vllm instead of hf-causal?

clefourrier

Hugging Face H4 org Feb 19

•

edited Feb 19

Hi, thanks for your issue!

I don't know what are the differences in implementation between vllm and hf-causal inference in the harness, but we will keep to using the latter for now, to ensure full reproduciblity between the different models evals.
If you want to reproduce the results of your model in an acceptable time, you could run hf-causal using max samples = 20, and check if you get the same logprobs/generations for the selection, wdyt?

If you reproduce our results, then the direspancy between hf-causal and vllm should be raised on the harness. If you don't, we might have a bug somewhere and we'll investigate asap.

clefourrier

Hugging Face H4 org Feb 28

Closing for inactivity, feel free to reopen if needed

clefourrier changed discussion status to closed Feb 28