Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1016

rating issue about "gsm8k" eval.

#398

by Korabbit - opened Nov 24, 2023

Discussion

Korabbit

Nov 24, 2023

I did my own evaluation to verify that the leaderboard evaluation was the same as mine.
The following tasks and scores were evaluated.

As you can see in the image, only the performance of the "GSM8K" is significantly different.
Is there a reason for this result?
My "gsm8k" settings are as follows

python main.py --model=hf-causal \
            --model_args="pretrained=Korabbit/Llama-2-7b-chat-hf-afr-100step-v2" \
            --tasks gsm8k \
            --num_fewshot 5 \
            --batch_size 4 \
            --output_path result/gsm8k \
            --device cuda:2

clefourrier

Open LLM Leaderboard org Nov 24, 2023

•

edited Nov 24, 2023

Hi, thank you for opening this issue!

Are you using the same version of the harness as we are? (see the command in the about)
We are also using a batch size of 1, which can change results slightly.

feidfoe

Dec 1, 2023

I have the same issue with @Korabbit

I tested on two version of harness: "current master branch" and "b281b0921b".
I believe "b281b0921b" is the same version with the Open LLM Leaderboard.
The performance of my model was 0.5746 in GSM8K on local run, yet I got 0.2631 on the leaderboard.
(num_fewshot=5, batch size=2, dtype=float16)

I think there is some differences between leaderboard setting( or script) and mine.
And it seems other models are also experiencing this.
(The official meta-math/MetaMath-70B-V1.0 model got 0.4466.)
Can you provide more detailed information about how you run the gsm8k task? @clefourrier

clefourrier

Open LLM Leaderboard org Dec 1, 2023

Hi @feidfoe and @Korabbit ,

We have reproduced some of your claims, and are currently investigating this very seriously, I'll come back to you both asap.

clefourrier

Open LLM Leaderboard org Dec 4, 2023

•

edited Dec 4, 2023

Hi @feidfoe and @Korabbit ,

Thanks a lot for your inputs! We inspected our results, and found out we had launched ~150 models with a test version and not our usual pipeline. We re-ran everything this weekend and models should be fixed!
(Communication about it here)

clefourrier changed discussion status to closed Dec 4, 2023

clefourrier

Open LLM Leaderboard org Dec 4, 2023

(For meta-math specifically, it seems that it requires a very specific system prompt, and we don't support system prompts atm)

aigeek0x0

Feb 19

Is this still an ongoing issue? Getting a score of 0 for this model: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/Radiantloom/radintloom-mistral-7b-fusion/results_2024-02-19T11-01-46.934466.json

clefourrier

Open LLM Leaderboard org Feb 20

Hi @aigeek0x0 ,
This was fixed in December and should not be occurring for models evaluated after Dec 4.

To help us investigate your problem, can you:

check the details of your model on GSM8K? (you can access the dataset clicking the icon next to the model name, and use datasets to load the subset)
try to reproduce our results/outputs using the commands in the About tab of the leaderboard?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment