Spaces:
Sleeping
Sleeping
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM | |
evaluation framework. It maintains a set of prompts, along with | |
responses to those prompts from a collection of LLMs. It presents | |
pairs of responses to a judge who determines which response better | |
addresses the prompt's intention. Rather than compare all response | |
pairs, the framework sets one model as a baseline, then individually | |
compares all responses to that. Its primary method of ranking models | |
is with win percentages over the baseline. | |
This Space presents an alternative method of ranking based on the | |
[Bradley–Terry | |
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) | |
(BT). Given a collection of items, Bradley–Terry estimates the | |
_ability_ of each item based on pairwise comparisons between | |
them. Once calculated, ability can be used to estimate the probability | |
that one item will be better-than another, even if those items have | |
not been formally compared. In sports, for example, ability might | |
correspond to a team's strength within their league. Ability could | |
then be used to predict outcomes between teams that have yet to play. | |
The Alpaca project presents a good opportunity to apply BT in | |
practice; especially since BT fits nicely into a Bayesian analysis | |
framework. As LLMs become more pervasive, so to is considering | |
evaluation uncertainty when comparing them; something that Bayesian | |
frameworks do well. | |
This Space is divided into two primary sections: the first presents a | |
ranking of models based on estimated ability. The figure on the right | |
visualizes this ranking for the top 10 models, while the table below | |
presents the full set. The second section estimates the probability | |
that one model will be preferred to another. | |