Models with multiple submissions.

#322
by xzuyn - opened

Some models are being submitted multiple times.

These are just the 7B ones I saw with 2 submissions or more.

ehartford/dolphin-2.1-mistral-7b

Open-Orca/Mistral-7B-SlimOrca

TheBloke/Llama-2-7B-GPTQ

TheTravellingEngineer/llama2-7b-chat-hf-v2

TheTravellingEngineer/llama2-7b-chat-hf-v3

TheTravellingEngineer/llama2-7b-chat-hf-v4

codellama/CodeLlama-7b-Instruct-hf

codellama/CodeLlama-7b-Python-hf

garage-bAInd/Platypus2-7B

kfkas/Llama-2-ko-7b-Chat

kittn/mistral-7B-v0.1-hf

lmsys/vicuna-7b-v1.5

lmsys/vicuna-7b-v1.5-16k

meta-llama/Llama-2-7b-hf

mosaicml/mpt-7b

mosaicml/mpt-7b-8k-chat

mosaicml/mpt-7b-8k-instruct

mosaicml/mpt-7b-storywriter

PocketDoc/Dans-TotSirocco-7b

tiiuae/falcon-7b-instruct

togethercomputer/LLaMA-2-7B-32K

togethercomputer/Llama-2-7B-32K-Instruct

wenge-research/yayi-7b-llama2

Open LLM Leaderboard org

Hi!
Are these models the same at the precision and commit level?

They are different precisions and categories, but evals seem to be within the margins of error, so basically the same result but multiple times.

Example:
Screenshot from 2023-10-12 05-57-57.png

I don't know what way this should be dealt with though, just thought to bring it up.

Open LLM Leaderboard org

The way we are dealing with it is by having filters on precision. If, however, the same model has two different categories (like the model you just showed), this is a mistake in the request file.

There's probably not going to be any useful difference between float16 and bfloat16 evals though.

Also filtering by precision doesn't exactly solve this since you don't really get to compare all models, since some may be submitted only as float16 or bfloat16 or 8bit, so either you filter it to one of those and then you can't see some models, or you don't filter and you see duplicates.

Another example of no noticeable difference:
Screenshot from 2023-10-12 06-12-03.png

Open LLM Leaderboard org

@xzuyn We don't plan on changing this mechanism - we understand that it brings a bit of redundancy between the bfloat16 and float16 models, but since you can hide the quantized models from a given search, it should still allow people to compare models quite fast. Thank you for your interest in the leaderboard!
Closing as it is not an issue but a feature.

clefourrier changed discussion status to closed

Sign up or log in to comment