Spaces:

jerome-white
/

alpaca-item-response

Runtime error

App Files Files Community

alpaca-item-response / _README.md

jerome-white

Clarity

cbe4373 6 months ago

preview code

raw

history blame

No virus

1.63 kB

	This Space applies [item response
	theory](https://en.wikipedia.org/wiki/Item_response_theory) to the
	[Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation
	framework.

	Alpaca maintains a set of prompts, along with responses to those
	prompts from a collection of LLMs. It evaluates models based on
	average response preference. To establish preference, it first
	designates one LLM as a "baseline." For every other model _m_, and
	every prompt _p_, it presents the baselines response to _p_ and _m_'s
	response to _p_ to a judge. The judge determines which response better
	addresses the prompt, assigning a "win" to the preferred model. Alpaca
	[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
	their win percentage.

	An alternative view on model comparison is presented here. Using IRT,
	models are ranked based on their latent ability, and questions are
	ranked based on latent difficulty and discrimination. These values
	come from the posterior of a fitted 2PL IRT model. From this
	perspective, a model's answer to a question is "correct" if a judge
	feels it is better than the baseline.

	This Space is a work in progress. Comments and suggestions are
	welcome; please use the Community for doing so. Resources:

	* Parameters were estimated using Stan, following
	[this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html)
	tutorial.

	* Code for parsing Alpaca, running Stan, and transforming the output
	can be found [here](https://github.com/jerome-white/alpaca-bda). See
	`bin/item-response.sh` for the workflow overview.

	* Transformed data, including Stan sample `__` variables.