This Space applies [item response
theory](https://en.wikipedia.org/wiki/Item_response_theory) to the
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) LLM evaluation
framework.

## Overview

Alpaca maintains a set of prompts, along with responses to those
prompts from a collection of LLMs. It evaluates models based on
average response preference. To establish preference, it first
designates one LLM as a "baseline." For every other model _m_, and
every prompt _p_, it presents the baseline's response to _p_ and _m_'s
response to _p_ to a judge. The judge determines which response better
addresses the prompt, assigning a "win" to the preferred model. Alpaca
[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
their win percentage.

An alternative view on model comparison based on item response theory
(IRT) is presented here. Item response theory is sometimes used to
estimate student ability in conjunction with exam rigor. With respect
to Alpaca, models are considered students and the collection of
prompts their exam. A models answer to a question is "correct" if the
judge feels it is better than the baseline. Alpaca data was fit to a
two parameter IRT model. Items are ranked based on the medians of
their respective parameter posteriors. Uncertainty of those posteriors
is presented as 95% HDIs ([high density
intervals](https://cran.r-project.org/package=HDInterval)).

This Space is a work in progress. Comments and suggestions are
welcome; please use the Community for doing so.

## Resources

* Parameters were estimated using Stan, following
  [this](https://mc-stan.org/users/documentation/case-studies/tutorial_twopl.html)
  tutorial.
* Code for running the workflow can be found
  [here](https://github.com/jerome-white/alpaca-bda). See
  `bin/item-response.sh`.
* Transformed data, including Stan's `__`-suffixed variables, can be
  found
  [here](https://huggingface.co/datasets/jerome-white/alpaca-irt-stan).