The lm-evaluation-harness results are different from the leaderboard results.

#659
by jisukim8873 - opened

When I run the gsm8k and other metrics with lm-evaluation-harness, the results are very different from the leaderboard results. Has anyone else experienced the same thing? Can anyone tell me what could be the cause?

Below are the commands

lm_eval --model hf --model_args pretrained=$1 --tasks arc_challenge --device cuda:1 --num_fewshot 25 --batch_size 2 --output_path $2/arc
lm_eval --model hf --model_args pretrained=$1 --tasks hellaswag --device cuda:1 --num_fewshot 10 --batch_size 1 --output_path $2/hellaswag
lm_eval --model hf --model_args pretrained=$1 --tasks mmlu --device cuda:1 --num_fewshot 5 --batch_size 2 --output_path $2/mmlu
lm_eval --model hf --model_args pretrained=$1 --tasks truthfulqa --device cuda:1 --num_fewshot 0 --batch_size 2 --output_path $2/truthfulqa
lm_eval --model hf --model_args pretrained=$1 --tasks winogrande --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/winogrande
lm_eval --model hf --model_args pretrained=$1 --tasks gsm8k --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/gsm8k

Open LLM Leaderboard org

Hi!
Did you make sure to follow the steps for reproduciblity in the About, and use the same lm_eval commit as we do?
The way evaluations are computed changed quite a lot in the harness across the last year.

Thank you for the answer :)

I checked lm_eval and it seems to be different from the leaderboard commit version, which is causing this issue.

jisukim8873 changed discussion status to closed

Sign up or log in to comment