bigcode/bigcodebench-leaderboard

Dec 27, 2024

I have been wondering several times why OpenAI's models somehow always and up being at the top.
If there is a non-OpenAI model in the lead, it won't be long until that's changed.

So, I took screenshots to understand what's happening.

Yesterday:

Today:

Why does DeepSeek suddenly have lower scores?

Kind regards,

Ben

terryyz

BigCode org Dec 27, 2024

•

edited Dec 27, 2024

Thanks for the notice!

Several outputs of DeepSeek V3 were changed when we evaluated the full set (after their official announcement), which led to lower scores on the hard subset. We don't know what they did specifically for the API backend. If the inference setup (e.g., hardware) has been changed, the performance will likely change accordingly. This also happens if you use different configurations of batch inference with vLLM.

If you look through the evaluation codebase, you will see that we consistently use temperature 0 by default for all the models unless they cannot be set to 0 (e.g., o1 models). However, temperature is only one factor contributing to output consistency. More factors are related to hardware, which we cannot control if we use the model APIs.

I have been wondering several times why OpenAI's models somehow always and up being at the top.

There is no solid evidence as to why some old OpenAI models rank at the top. Due to the high temperature, it's unfair to compare o1 results directly with other models. However, due to the nature of BigCodeBench tasks, the current leaderboard mainly tells you how these models can be generalized to task solving via Python. Although most task-oriented goals are very practical, the design and structures of these tasks hardly exist in the wild, which should be considered a bit out-of-distribution.

Hope it helps!

Cheers,
Terry

ordisbold

Dec 27, 2024

Hello Therry,

thank you very much for clarifying this.

When I think about it, that actually makes sense.
From using DeepSeek's API over the last couple of days for a repetitive (1000s of times, very similar) task, I have noticed that sometimes, the response is more V2.5 like than V3.
Maybe one or some of their servers are still serving V2.5. I can only guess.

Thank you an kind regards,

Ben

Spaces:

bigcode
/

bigcodebench-leaderboard

Running

Fairness?