jerome-white commited on
Commit
cbe4373
1 Parent(s): d641f66
Files changed (1) hide show
  1. _README.md +7 -5
_README.md CHANGED
@@ -10,13 +10,15 @@ designates one LLM as a "baseline." For every other model _m_, and
10
  every prompt _p_, it presents the baselines response to _p_ and _m_'s
11
  response to _p_ to a judge. The judge determines which response better
12
  addresses the prompt, assigning a "win" to the preferred model. Alpaca
13
- ranks models based on their win percentage.
 
14
 
15
  An alternative view on model comparison is presented here. Using IRT,
16
- it rank models based on their latent ability, and ranks questions
17
- based on latent difficulty and discrimination; both from parameters of
18
- the 2PL IRT model. Under this view, a models answer to a question is
19
- correct if a judge feels it is better than the baseline.
 
20
 
21
  This Space is a work in progress. Comments and suggestions are
22
  welcome; please use the Community for doing so. Resources:
 
10
  every prompt _p_, it presents the baselines response to _p_ and _m_'s
11
  response to _p_ to a judge. The judge determines which response better
12
  addresses the prompt, assigning a "win" to the preferred model. Alpaca
13
+ [ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
14
+ their win percentage.
15
 
16
  An alternative view on model comparison is presented here. Using IRT,
17
+ models are ranked based on their latent ability, and questions are
18
+ ranked based on latent difficulty and discrimination. These values
19
+ come from the posterior of a fitted 2PL IRT model. From this
20
+ perspective, a model's answer to a question is "correct" if a judge
21
+ feels it is better than the baseline.
22
 
23
  This Space is a work in progress. Comments and suggestions are
24
  welcome; please use the Community for doing so. Resources: