jerome-white's picture
Allow Alpaca and Arena results to be presented in the same space
d4dddf1
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
evaluation framework. It maintains a set of prompts, along with
responses to those prompts from a collection of LLMs. It presents
pairs of responses to a judge who determines which response better
addresses the prompt's intention. Rather than compare all response
pairs, the framework sets one model as a baseline, then individually
compares all responses to that. Its primary method of ranking models
is with win percentages over the baseline.
This Space presents an alternative method of ranking based on the
[Bradley–Terry
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
(BT). Given a collection of items, Bradley–Terry estimates the
_ability_ of each item based on pairwise comparisons between
them. Once calculated, ability can be used to estimate the probability
that one item will be better-than another, even if those items have
not been formally compared. In sports, for example, ability might
correspond to a team's strength within their league. Ability could
then be used to predict outcomes between teams that have yet to play.
The Alpaca project presents a good opportunity to apply BT in
practice; especially since BT fits nicely into a Bayesian analysis
framework. As LLMs become more pervasive, so to is considering
evaluation uncertainty when comparing them; something that Bayesian
frameworks do well.
This Space is divided into two primary sections: the first presents a
ranking of models based on estimated ability. The figure on the right
visualizes this ranking for the top 10 models, while the table below
presents the full set. The second section estimates the probability
that one model will be preferred to another.