open-llm-leaderboard/open_llm_leaderboard · The lm-evaluation-harness results are different from the leaderboard results.

Apr 3, 2024

When I run the gsm8k and other metrics with lm-evaluation-harness, the results are very different from the leaderboard results. Has anyone else experienced the same thing? Can anyone tell me what could be the cause?

Below are the commands

lm_eval --model hf --model_args pretrained=$1 --tasks arc_challenge --device cuda:1 --num_fewshot 25 --batch_size 2 --output_path $2/arc
lm_eval --model hf --model_args pretrained=$1 --tasks hellaswag --device cuda:1 --num_fewshot 10 --batch_size 1 --output_path $2/hellaswag
lm_eval --model hf --model_args pretrained=$1 --tasks mmlu --device cuda:1 --num_fewshot 5 --batch_size 2 --output_path $2/mmlu
lm_eval --model hf --model_args pretrained=$1 --tasks truthfulqa --device cuda:1 --num_fewshot 0 --batch_size 2 --output_path $2/truthfulqa
lm_eval --model hf --model_args pretrained=$1 --tasks winogrande --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/winogrande
lm_eval --model hf --model_args pretrained=$1 --tasks gsm8k --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/gsm8k

clefourrier

Open LLM Leaderboard org Apr 3, 2024

Hi!
Did you make sure to follow the steps for reproduciblity in the About, and use the same lm_eval commit as we do?
The way evaluations are computed changed quite a lot in the harness across the last year.

jisukim8873

Apr 3, 2024

Thank you for the answer :)

I checked lm_eval and it seems to be different from the leaderboard commit version, which is causing this issue.

jisukim8873 changed discussion status to closed Apr 3, 2024