Spaces:
Runtime error
Runtime error
Commit
•
cbe4373
1
Parent(s):
d641f66
Clarity
Browse files- _README.md +7 -5
_README.md
CHANGED
@@ -10,13 +10,15 @@ designates one LLM as a "baseline." For every other model _m_, and
|
|
10 |
every prompt _p_, it presents the baselines response to _p_ and _m_'s
|
11 |
response to _p_ to a judge. The judge determines which response better
|
12 |
addresses the prompt, assigning a "win" to the preferred model. Alpaca
|
13 |
-
ranks models based on
|
|
|
14 |
|
15 |
An alternative view on model comparison is presented here. Using IRT,
|
16 |
-
|
17 |
-
based on latent difficulty and discrimination
|
18 |
-
|
19 |
-
|
|
|
20 |
|
21 |
This Space is a work in progress. Comments and suggestions are
|
22 |
welcome; please use the Community for doing so. Resources:
|
|
|
10 |
every prompt _p_, it presents the baselines response to _p_ and _m_'s
|
11 |
response to _p_ to a judge. The judge determines which response better
|
12 |
addresses the prompt, assigning a "win" to the preferred model. Alpaca
|
13 |
+
[ranks models](https://tatsu-lab.github.io/alpaca_eval/) based on
|
14 |
+
their win percentage.
|
15 |
|
16 |
An alternative view on model comparison is presented here. Using IRT,
|
17 |
+
models are ranked based on their latent ability, and questions are
|
18 |
+
ranked based on latent difficulty and discrimination. These values
|
19 |
+
come from the posterior of a fitted 2PL IRT model. From this
|
20 |
+
perspective, a model's answer to a question is "correct" if a judge
|
21 |
+
feels it is better than the baseline.
|
22 |
|
23 |
This Space is a work in progress. Comments and suggestions are
|
24 |
welcome; please use the Community for doing so. Resources:
|