lmarena-ai/chatbot-arena-leaderboard · Human level representation?

Dec 23, 2023

I know it is hard to do online but maybe we can have offline human-written responses to user queries. This way, we can see how models fare against human level intelligence.

endolith

Dec 27, 2023

Can they do research? Otherwise it's going to be a lot of "I don't know" on encyclopedic questions.

endolith

Dec 27, 2023

•

edited Jan 11, 2024

The "Both are bad" button could be counted as a win for human intelligence against both models. ~~Otherwise, the "Both are bad" and "Tie" buttons have no effect on the Elo ranking, since it's based purely on pairwise defeats~~.

        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")

Wait, no, that's wrong. It makes their Elo scores more similar:

In the case of a tie game, the lower-ranked team gains the Elo points (albeit less than if they would have won!) while the higher-ranked team loses that exact amount.

But the "Both are bad" and "Tie" buttons currently have the same effect. They could be changed so that "Both are bad" is counted as a sort of loss against a hypothetical perfect AI, and the ELO score for the "Perfect AI" is also listed for comparison.

binga

Jan 4, 2024

+1 on the above suggestion on having the tie (bothbad) having a different effect than a simple tie. Is there a reason this isn't the current design?

endolith

Jan 25, 2024

Well without the "hypothetical perfect AI" concept to compare to, there isn't anything else you can do with ties. I'm not sure why they have both a 'both are bad" and "tie" buttons, though, since they do the same thing.