Spaces:

jerome-white
/

alpaca-item-response

Runtime error

App Files Files Community

alpaca-item-response / _README.md

jerome-white

Clarity

855658c 4 months ago

preview code

raw history blame contribute delete

No virus

1.95 kB

	This Space applies [item response
	theory](https://en.wikipedia.org/wiki/Item_response_theory) to the
	[Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation
	framework.

	## Overview

	Alpaca maintains a set of prompts, along with responses to those
	prompts from a collection of LLMs. It evaluates models based on
	average response preference. To establish preference, it first
	designates one LLM as a "baseline." For every other model _m_, and
	every prompt _p_, it presents the baseline's response to _p_ and _m_'s
	response to _p_ to a judge. The judge determines which response better
	addresses the prompt, assigning a "win" to the preferred model. Alpaca
	[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
	their win percentage.

	An alternative view on model comparison based on item response theory
	(IRT) is presented here. Item response theory is sometimes used to
	estimate student ability in conjunction with exam rigor. With respect
	to Alpaca, models are considered students and the collection of
	prompts their exam. A models answer to a question is "correct" if the
	judge feels it is better than the baseline. Alpaca data was fit to a
	two parameter IRT model. Items are ranked based on the medians of
	their respective parameter posteriors. Uncertainty of those posteriors
	is presented as 95% HDIs ([high density
	intervals](https://cran.r-project.org/package=HDInterval)).

	This Space is a work in progress. Comments and suggestions are
	welcome; please use the Community for doing so.

	## Resources

	* Parameters were estimated using Stan, following
	[this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html)
	tutorial.
	* Code for running the workflow can be found
	[here](https://github.com/jerome-white/alpaca-bda). See
	`bin/item-response.sh`.
	* Transformed data, including Stan's `__`-suffixed variables, can be
	found
	[here](https://huggingface.co/datasets/jerome-white/alpaca-irt-stan).