alpaca-item-response / _README.md
jerome-white's picture
Clarity
cbe4373

This Space applies item response theory to the Alpaca LLM evaluation framework.

Alpaca maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It evaluates models based on average response preference. To establish preference, it first designates one LLM as a "baseline." For every other model m, and every prompt p, it presents the baselines response to p and m's response to p to a judge. The judge determines which response better addresses the prompt, assigning a "win" to the preferred model. Alpaca ranks models based on their win percentage.

An alternative view on model comparison is presented here. Using IRT, models are ranked based on their latent ability, and questions are ranked based on latent difficulty and discrimination. These values come from the posterior of a fitted 2PL IRT model. From this perspective, a model's answer to a question is "correct" if a judge feels it is better than the baseline.

This Space is a work in progress. Comments and suggestions are welcome; please use the Community for doing so. Resources:

  • Parameters were estimated using Stan, following this tutorial.

  • Code for parsing Alpaca, running Stan, and transforming the output can be found here. See bin/item-response.sh for the workflow overview.

  • Transformed data, including Stan sample __ variables.