Spaces:
Runtime error
Runtime error
This Space applies [item response | |
theory](https://en.wikipedia.org/wiki/Item_response_theory) to the | |
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation | |
framework. | |
## Overview | |
Alpaca maintains a set of prompts, along with responses to those | |
prompts from a collection of LLMs. It evaluates models based on | |
average response preference. To establish preference, it first | |
designates one LLM as a "baseline." For every other model _m_, and | |
every prompt _p_, it presents the baseline's response to _p_ and _m_'s | |
response to _p_ to a judge. The judge determines which response better | |
addresses the prompt, assigning a "win" to the preferred model. Alpaca | |
[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on | |
their win percentage. | |
An alternative view on model comparison based on item response theory | |
(IRT) is presented here. Item response theory is sometimes used to | |
estimate student ability in conjunction with exam rigor. With respect | |
to Alpaca, models are considered students and the collection of | |
prompts their exam. A models answer to a question is "correct" if the | |
judge feels it is better than the baseline. Alpaca data was fit to a | |
two parameter IRT model. Items are ranked based on the medians of | |
their respective parameter posteriors. Uncertainty of those posteriors | |
is presented as 95% HDIs ([high density | |
intervals](https://cran.r-project.org/package=HDInterval)). | |
This Space is a work in progress. Comments and suggestions are | |
welcome; please use the Community for doing so. | |
## Resources | |
* Parameters were estimated using Stan, following | |
[this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html) | |
tutorial. | |
* Code for running the workflow can be found | |
[here](https://github.com/jerome-white/alpaca-bda). See | |
`bin/item-response.sh`. | |
* Transformed data, including Stan's `__`-suffixed variables, can be | |
found | |
[here](https://huggingface.co/datasets/jerome-white/alpaca-irt-stan). | |