Fail to reproduce results on server benchmark by using lm-evaluation-harness

#18

by Zhuangl - opened Feb 23, 2024

Feb 23, 2024

I am reproducing gemma-2b's accuracy on hellaswag benchmark. But I only got acc@hellaswag: 0.3415 by using lm-evaluation-harness with zero-shot setting, failing to reproduce 0.714 reported in the Model card. In addition, there are results on other benchmarks that fails to meet the reported number, like ARC-e, ARC-c, PIQA.
Is there anything that I missed for reproducing your results ? Command I used are as follows:

for task in wikitext lambada_openai winogrande piqa sciq wsc arc_easy arc_challenge logiqa hellaswag mmlu boolq openbookqa
do
    lm_eval --model hf \
        --model_args pretrained=/path/to/gemma-2b/ \
        --tasks $task \
        --device cuda:0 \
        --batch_size 1
done

ybelkada

Feb 23, 2024

Hi @Zhuangl !
I suggest you to use https://github.com/huggingface/lighteval lighteval library which should support Gemma cc @clefourrier @SaylorTwift

Zhuangl

Feb 23, 2024

Thanks @ybelkada !
Will try lighteval.

clefourrier

Feb 23, 2024

•

edited Feb 29, 2024

Hi!
If you want to reproduce the numbers we report on the Open LLM Leaderboard, you can use this version of the Eleuther AI Harness with this command: python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path> (details in the About page of the Leaderboard)

However, you can also also use lighteval, which is in v0 at the moment, to experiment with evaluation, I think @SaylorTwift bumped transformers to a version supporting gemma yesteday (but it's not there to reproduce Open LLM Leaderboard results).

suryabhupa

Google org Feb 29, 2024

Hi @Zhuangl , did this end up fixing the differences?

lkv

Google org Aug 7, 2024

•

edited Aug 7, 2024

Hi @Zhuangl , Could you please try to use this version and let us know if the issue still persists? Thank you.

Zhuangl changed discussion status to closed Aug 9, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment