This Space applies [item response theory](https://en.wikipedia.org/wiki/Item_response_theory) to the [Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation framework. ## Overview Alpaca maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It evaluates models based on average response preference. To establish preference, it first designates one LLM as a "baseline." For every other model _m_, and every prompt _p_, it presents the baseline's response to _p_ and _m_'s response to _p_ to a judge. The judge determines which response better addresses the prompt, assigning a "win" to the preferred model. Alpaca [ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on their win percentage. An alternative view on model comparison based on item response theory (IRT) is presented here. Item response theory is sometimes used to estimate student ability in conjunction with exam rigor. With respect to Alpaca, models are considered students and the collection of prompts their exam. A models answer to a question is "correct" if the judge feels it is better than the baseline. Alpaca data was fit to a two parameter IRT model. Items are ranked based on the medians of their respective parameter posteriors. Uncertainty of those posteriors is presented as 95% HDIs ([high density intervals](https://cran.r-project.org/package=HDInterval)). This Space is a work in progress. Comments and suggestions are welcome; please use the Community for doing so. ## Resources * Parameters were estimated using Stan, following [this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html) tutorial. * Code for running the workflow can be found [here](https://github.com/jerome-white/alpaca-bda). See `bin/item-response.sh`. * Transformed data, including Stan's `__`-suffixed variables, can be found [here](https://huggingface.co/datasets/jerome-white/alpaca-irt-stan).