Spaces:

jerome-white
/

llm-bradley-terry

Sleeping

App Files Files Community

llm-bradley-terry / docs /alpaca /readme.md

jerome-white

Allow Alpaca and Arena results to be presented in the same space

d4dddf1 4 months ago

preview code

raw

history blame contribute delete

No virus

1.73 kB

	[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
	evaluation framework. It maintains a set of prompts, along with
	responses to those prompts from a collection of LLMs. It presents
	pairs of responses to a judge who determines which response better
	addresses the prompt's intention. Rather than compare all response
	pairs, the framework sets one model as a baseline, then individually
	compares all responses to that. Its primary method of ranking models
	is with win percentages over the baseline.

	This Space presents an alternative method of ranking based on the
	[Bradley–Terry
	model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
	(BT). Given a collection of items, Bradley–Terry estimates the
	_ability_ of each item based on pairwise comparisons between
	them. Once calculated, ability can be used to estimate the probability
	that one item will be better-than another, even if those items have
	not been formally compared. In sports, for example, ability might
	correspond to a team's strength within their league. Ability could
	then be used to predict outcomes between teams that have yet to play.

	The Alpaca project presents a good opportunity to apply BT in
	practice; especially since BT fits nicely into a Bayesian analysis
	framework. As LLMs become more pervasive, so to is considering
	evaluation uncertainty when comparing them; something that Bayesian
	frameworks do well.

	This Space is divided into two primary sections: the first presents a
	ranking of models based on estimated ability. The figure on the right
	visualizes this ranking for the top 10 models, while the table below
	presents the full set. The second section estimates the probability
	that one model will be preferred to another.