This Space applies [item response theory](https://en.wikipedia.org/wiki/Item_response_theory) to the [Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation framework. Alpaca maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It evaluates models based on average response preference. To establish preference, it first designates one LLM as a "baseline." For every other model _m_, and every prompt _p_, it presents the baselines response to _p_ and _m_'s response to _p_ to a judge. The judge determines which response better addresses the prompt, assigning a "win" to the preferred model. Alpaca [ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on their win percentage. An alternative view on model comparison is presented here. Using IRT, models are ranked based on their latent ability, and questions are ranked based on latent difficulty and discrimination. These values come from the posterior of a fitted 2PL IRT model. From this perspective, a model's answer to a question is "correct" if a judge feels it is better than the baseline. This Space is a work in progress. Comments and suggestions are welcome; please use the Community for doing so. Resources: * Parameters were estimated using Stan, following [this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html) tutorial. * Code for parsing Alpaca, running Stan, and transforming the output can be found [here](https://github.com/jerome-white/alpaca-bda). See `bin/item-response.sh` for the workflow overview. * Transformed data, including Stan sample `__` variables.