Spaces:
Running
on
CPU Upgrade
The lm-evaluation-harness results are different from the leaderboard results.
When I run the gsm8k and other metrics with lm-evaluation-harness, the results are very different from the leaderboard results. Has anyone else experienced the same thing? Can anyone tell me what could be the cause?
Below are the commands
lm_eval --model hf --model_args pretrained=$1 --tasks arc_challenge --device cuda:1 --num_fewshot 25 --batch_size 2 --output_path $2/arc
lm_eval --model hf --model_args pretrained=$1 --tasks hellaswag --device cuda:1 --num_fewshot 10 --batch_size 1 --output_path $2/hellaswag
lm_eval --model hf --model_args pretrained=$1 --tasks mmlu --device cuda:1 --num_fewshot 5 --batch_size 2 --output_path $2/mmlu
lm_eval --model hf --model_args pretrained=$1 --tasks truthfulqa --device cuda:1 --num_fewshot 0 --batch_size 2 --output_path $2/truthfulqa
lm_eval --model hf --model_args pretrained=$1 --tasks winogrande --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/winogrande
lm_eval --model hf --model_args pretrained=$1 --tasks gsm8k --device cuda:1 --num_fewshot 5 --batch_size 1 --output_path $2/gsm8k
Hi!
Did you make sure to follow the steps for reproduciblity in the About, and use the same lm_eval
commit as we do?
The way evaluations are computed changed quite a lot in the harness across the last year.
Thank you for the answer :)
I checked lm_eval and it seems to be different from the leaderboard commit version, which is causing this issue.