Spaces:
Sleeping
Alpaca is an LLM evaluation framework. It maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It presents pairs of responses to a judge who determines which response better addresses the prompt's intention. Rather than compare all response pairs, the framework sets one model as a baseline, then individually compares all responses to that. Its primary method of ranking models is with win percentages over the baseline.
This Space presents an alternative method of ranking based on the Bradley–Terry model (BT). Given a collection of items, Bradley–Terry estimates the ability of each item based on pairwise comparisons between them. Once calculated, ability can be used to estimate the probability that one item will be better-than another, even if those items have not been formally compared. In sports, for example, ability might correspond to a team's strength within their league. Ability could then be used to predict outcomes between teams that have yet to play.
The Alpaca project presents a good opportunity to apply BT in practice; especially since BT fits nicely into a Bayesian analysis framework. As LLMs become more pervasive, so to is considering evaluation uncertainty when comparing them; something that Bayesian frameworks do well.
This Space is divided into two primary sections: the first presents a ranking of models based on estimated ability. The figure on the right visualizes this ranking for the top 10 models, while the table below presents the full set. The second section estimates the probability that one model will be preferred to another.