lmsys/chatbot-arena-leaderboard · Diference in Elo between HF Leaderboard and Colab Notebook

Apr 6

•

There is a slight difference in Elo Rating between the HF Leaderboard and the one calculated by the Colab Notebook

HF Leaderboard: (elo_results_20240329.pkl file)

Colab Notebook:

When I run the elo_analysis.py script from the lm-sys/FastChat github repository using the default arguments, I also get the exact same Elo values as the notebook version.
My question is: do you guys use different parameters from the elo_analysis.py defaults to generate the elo_results_$DATE.pkl files? Which ones?

weichiang

Large Model Systems Organization org Apr 6

Hey @eduagarcia thanks for reporting this issue. I investigated this and verified that the data & parameters are exactly the same as "elo_analysis.py".
But the difference is from numerical error when solving the MLE problem with logistic regression.

On our machine (this is the one published on 3/29):
with lr = LogisticRegression(fit_intercept=False, penalty=None)

Number of battles: 511252          
claude-3-opus-20240229     1254.64 
gpt-4-1106-preview         1251.88  
gpt-4-0125-preview         1249.17 
bard-jan-24-gemini-pro     1204.35 
claude-3-sonnet-20240229   1200.29

When I set a tighter tolerance 1e-8, with lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-8)

Number of battles: 511252          
claude-3-opus-20240229     1254.40 
gpt-4-1106-preview         1251.44  
gpt-4-0125-preview         1248.89 
bard-jan-24-gemini-pro     1204.28 
claude-3-sonnet-20240229   1200.06

it matches the one on notebook:

I'll update our code to set a tighter tolerance in our next release.

weichiang

Large Model Systems Organization org Apr 6

Does it make sense to you @eduagarcia ?

eduagarcia

Apr 6

Got it, it does.

Thank you for your time

eduagarcia changed discussion status to closed Apr 6