Spaces:
Running
on
CPU Upgrade
rating issue about "gsm8k" eval.
I did my own evaluation to verify that the leaderboard evaluation was the same as mine.
The following tasks and scores were evaluated.
As you can see in the image, only the performance of the "GSM8K" is significantly different.
Is there a reason for this result?
My "gsm8k" settings are as follows
python main.py --model=hf-causal \
--model_args="pretrained=Korabbit/Llama-2-7b-chat-hf-afr-100step-v2" \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 4 \
--output_path result/gsm8k \
--device cuda:2
Hi, thank you for opening this issue!
Are you using the same version of the harness as we are? (see the command in the about)
We are also using a batch size of 1, which can change results slightly.
I have the same issue with @Korabbit
I tested on two version of harness: "current master branch" and "b281b0921b".
I believe "b281b0921b" is the same version with the Open LLM Leaderboard.
The performance of my model was 0.5746 in GSM8K on local run, yet I got 0.2631 on the leaderboard.
(num_fewshot=5, batch size=2, dtype=float16)
I think there is some differences between leaderboard setting( or script) and mine.
And it seems other models are also experiencing this.
(The official meta-math/MetaMath-70B-V1.0 model got 0.4466.)
Can you provide more detailed information about how you run the gsm8k task?
@clefourrier
(For meta-math specifically, it seems that it requires a very specific system prompt, and we don't support system prompts atm)
Is this still an ongoing issue? Getting a score of 0 for this model: https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/Radiantloom/radintloom-mistral-7b-fusion/results_2024-02-19T11-01-46.934466.json
Hi
@aigeek0x0
,
This was fixed in December and should not be occurring for models evaluated after Dec 4.
To help us investigate your problem, can you:
- check the details of your model on GSM8K? (you can access the dataset clicking the icon next to the model name, and use
datasets
to load the subset) - try to reproduce our results/outputs using the commands in the About tab of the leaderboard?