Spaces:

jerome-white
/

alpaca-item-response

Runtime error

jerome-white commited on Feb 29

Commit

cbe4373

•

1 Parent(s): d641f66

Clarity

Files changed (1) hide show

_README.md CHANGED Viewed

@@ -10,13 +10,15 @@ designates one LLM as a "baseline." For every other model _m_, and
 every prompt _p_, it presents the baselines response to _p_ and _m_'s
 response to _p_ to a judge. The judge determines which response better
 addresses the prompt, assigning a "win" to the preferred model. Alpaca
-ranks models based on their win percentage.
 An alternative view on model comparison is presented here. Using IRT,
-it rank models based on their latent ability, and ranks questions
-based on latent difficulty and discrimination; both from parameters of
-the 2PL IRT model. Under this view, a models answer to a question is
-correct if a judge feels it is better than the baseline.
 This Space is a work in progress. Comments and suggestions are
 welcome; please use the Community for doing so. Resources:

 every prompt _p_, it presents the baselines response to _p_ and _m_'s
 response to _p_ to a judge. The judge determines which response better
 addresses the prompt, assigning a "win" to the preferred model. Alpaca
+[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
+their win percentage.
 An alternative view on model comparison is presented here. Using IRT,
+models are ranked based on their latent ability, and questions are
+ranked based on latent difficulty and discrimination. These values
+come from the posterior of a fitted 2PL IRT model. From this
+perspective, a model's answer to a question is "correct" if a judge
+feels it is better than the baseline.
 This Space is a work in progress. Comments and suggestions are
 welcome; please use the Community for doing so. Resources: