This Space applies [item response
theory](https://en.wikipedia.org/wiki/Item_response_theory) to the
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation
framework.

Alpaca maintains a set of prompts, along with responses to those
prompts from a collection of LLMs. It evaluates models based on
average response preference. To establish preference, it first
designates one LLM as a "baseline." For every other model _m_, and
every prompt _p_, it presents the baselines response to _p_ and _m_'s
response to _p_ to a judge. The judge determines which response better
addresses the prompt, assigning a "win" to the preferred model. Alpaca
[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
their win percentage.

An alternative view on model comparison is presented here. Using IRT,
models are ranked based on their latent ability, and questions are
ranked based on latent difficulty and discrimination. These values
come from the posterior of a fitted 2PL IRT model. From this
perspective, a model's answer to a question is "correct" if a judge
feels it is better than the baseline.

This Space is a work in progress. Comments and suggestions are
welcome; please use the Community for doing so. Resources:

* Parameters were estimated using Stan, following
  [this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html)
  tutorial.

* Code for parsing Alpaca, running Stan, and transforming the output
  can be found [here](https://github.com/jerome-white/alpaca-bda). See
  `bin/item-response.sh` for the workflow overview.

* Transformed data, including Stan sample `__` variables.