Score gap in arc challenge

#115
by wonhosong - opened

Hello, first of all, It is big thanks for running the open llm leaderboards.

I saw our model(upstage/llama-30b-instruct-2048)'s score on the leaderbaord and noticed a gap in the score, so I'm reaching out to you.
The arc_challenge score on the leaderboard is 58.3, but on the local reproduction leaderboard it is 65.19.
Here are scripts what i used for local evaluating.

# download model weights
git clone https://huggingface.co/upstage/llama-30b-instruct-2048

# load evaluation code
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
cd lm-evaluation-harness

# run evaluation scripts
python main.py --model=hf-causal --model_args="pretrained=../llama-30b-instruct-2048" --tasks=arc_challenge --num_fewshot=25 --batch_size=2 --no_cache

I also saw other team reporting score gaps, and I understood that there was an error in the evaluation code.
Could you please rerun our models as well?

The model name is upstage/llama-30b-instruct-2048 and upstage/llama-30b-instruct.

I am also experiencing this with ariellee/SuperPlatty-30B and lilloukas/GPlatty-30B—leaderboard lists 59.2 and 60.1, respectively, instead of 66.1 and 66. Could you rerun those 2 as well? Thank you!

Open LLM Leaderboard org

Hi! @SaylorTwift is re-running all llama based models atm, since llama models have a different management of white space tokens, which means they were handicapped by the previous version of the Harness. We'll update the leaderboard as soon as possible :)

@clefourrier @SaylorTwift Big thanks for your effort!

Open LLM Leaderboard org

Hey @wonhosong ! Thanks for your feedback :) When you say 65.19, is it acc or acc_norm score ? I just reran your model and I get acc=0.620 and acc_norm=0.649 on ARC challenge.

@SaylorTwift It's acc_norm! I've confirmed that our model was evaluated correctly, there is no difference between local and public scores in the updated leaderboard, thank you :)

SaylorTwift changed discussion status to closed

Sign up or log in to comment