Fail to reproduce results on server benchmark by using lm-evaluation-harness

#18
by Zhuangl - opened

I am reproducing gemma-2b's accuracy on hellaswag benchmark. But I only got acc@hellaswag: 0.3415 by using lm-evaluation-harness with zero-shot setting, failing to reproduce 0.714 reported in the Model card. In addition, there are results on other benchmarks that fails to meet the reported number, like ARC-e, ARC-c, PIQA.
Is there anything that I missed for reproducing your results ? Command I used are as follows:

for task in wikitext lambada_openai winogrande piqa sciq wsc arc_easy arc_challenge logiqa hellaswag mmlu boolq openbookqa
do
    lm_eval --model hf \
        --model_args pretrained=/path/to/gemma-2b/ \
        --tasks $task \
        --device cuda:0 \
        --batch_size 1
done
Google org

Hi @Zhuangl !
I suggest you to use https://github.com/huggingface/lighteval lighteval library which should support Gemma cc @clefourrier @SaylorTwift

Thanks @ybelkada !
Will try lighteval.

Hi!
If you want to reproduce the numbers we report on the Open LLM Leaderboard, you can use this version of the Eleuther AI Harness with this command: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path> (details in the About page of the Leaderboard)

However, you can also also use lighteval, which is in v0 at the moment, to experiment with evaluation, I think @SaylorTwift bumped transformers to a version supporting gemma yesteday (but it's not there to reproduce Open LLM Leaderboard results).

Google org

Hi @Zhuangl , did this end up fixing the differences?

Sign up or log in to comment