rating issue about "gsm8k" eval.

#398
by Korabbit - opened

I did my own evaluation to verify that the leaderboard evaluation was the same as mine.
The following tasks and scores were evaluated.

image.png

As you can see in the image, only the performance of the "GSM8K" is significantly different.
Is there a reason for this result?
My "gsm8k" settings are as follows

python main.py --model=hf-causal \
            --model_args="pretrained=Korabbit/Llama-2-7b-chat-hf-afr-100step-v2" \
            --tasks gsm8k \
            --num_fewshot 5 \
            --batch_size 4 \
            --output_path result/gsm8k \
            --device cuda:2
Hugging Face H4 org
edited Nov 24, 2023

Hi, thank you for opening this issue!

Are you using the same version of the harness as we are? (see the command in the about)
We are also using a batch size of 1, which can change results slightly.

I have the same issue with @Korabbit

I tested on two version of harness: "current master branch" and "b281b0921b".
I believe "b281b0921b" is the same version with the Open LLM Leaderboard.
The performance of my model was 0.5746 in GSM8K on local run, yet I got 0.2631 on the leaderboard.
(num_fewshot=5, batch size=2, dtype=float16)

I think there is some differences between leaderboard setting( or script) and mine.
And it seems other models are also experiencing this.
(The official meta-math/MetaMath-70B-V1.0 model got 0.4466.)
Can you provide more detailed information about how you run the gsm8k task? @clefourrier

Hugging Face H4 org

Hi @feidfoe and @Korabbit ,

We have reproduced some of your claims, and are currently investigating this very seriously, I'll come back to you both asap.

Hugging Face H4 org
edited Dec 4, 2023

Hi @feidfoe and @Korabbit ,

Thanks a lot for your inputs! We inspected our results, and found out we had launched ~150 models with a test version and not our usual pipeline. We re-ran everything this weekend and models should be fixed!
(Communication about it here)

clefourrier changed discussion status to closed
Hugging Face H4 org

(For meta-math specifically, it seems that it requires a very specific system prompt, and we don't support system prompts atm)

Hugging Face H4 org

Hi @aigeek0x0 ,
This was fixed in December and should not be occurring for models evaluated after Dec 4.

To help us investigate your problem, can you:

  • check the details of your model on GSM8K? (you can access the dataset clicking the icon next to the model name, and use datasets to load the subset)
  • try to reproduce our results/outputs using the commands in the About tab of the leaderboard?

Sign up or log in to comment