[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM evaluation framework. It maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It then presents pairs of responses to a judge that determines which response better addresses the prompt. Rather than compare all response pairs, the framework identifies a baseline model and compares all models to that. The standard method of ranking models is to sort by baseline model win percentage. This Space presents an alternative method of ranking based on the [Bradley–Terry model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) (BT). Given a collection of items, Bradley–Terry estimates the _ability_ of each item based on pairwise comparisons between them. In sports, for example, that might be the ability of a given team based on games that team has played within a league. Once calculated, ability can be used to estimate the probability that one item will be better-than another, even if those items have yet to be formally compared. The Alpaca project presents a good opportunity to apply BT in practice; especially since BT fits nicely into a Bayesian analysis framework. As LLMs become more pervasive, quantifying the uncertainty in their evaluation is increasingly important. Bayesian frameworks are good at that. This Space is divided into two primary sections: the first presents a ranking of models based on estimated ability. The figure on the right presents this ranking for the top 10 models, while the table below presents the full set. The second section estimates the probability that one model will be preferred to another. A final section at the bottom is a disclaimer that presents details about the workflow.