Spaces:
Sleeping
Sleeping
jerome-white
commited on
Commit
•
99722b5
1
Parent(s):
7b5deb7
Intro clarification
Browse files- _README.md +4 -4
_README.md
CHANGED
@@ -5,7 +5,7 @@ pairs of responses to a judge who determines which response better
|
|
5 |
addresses the prompt's intention. Rather than compare all response
|
6 |
pairs, the framework sets one model as a baseline, then individually
|
7 |
compares all responses to that. Its primary method of ranking models
|
8 |
-
|
9 |
|
10 |
This Space presents an alternative method of ranking based on the
|
11 |
[Bradley–Terry
|
@@ -20,12 +20,12 @@ then be used to predict outcomes between teams that have yet to play.
|
|
20 |
|
21 |
The Alpaca project presents a good opportunity to apply BT in
|
22 |
practice; especially since BT fits nicely into a Bayesian analysis
|
23 |
-
framework. As LLMs become more pervasive,
|
24 |
-
|
25 |
frameworks do well.
|
26 |
|
27 |
This Space is divided into two primary sections: the first presents a
|
28 |
ranking of models based on estimated ability. The figure on the right
|
29 |
visualizes this ranking for the top 10 models, while the table below
|
30 |
-
|
31 |
that one model will be preferred to another.
|
|
|
5 |
addresses the prompt's intention. Rather than compare all response
|
6 |
pairs, the framework sets one model as a baseline, then individually
|
7 |
compares all responses to that. Its primary method of ranking models
|
8 |
+
is with win percentages over the baseline.
|
9 |
|
10 |
This Space presents an alternative method of ranking based on the
|
11 |
[Bradley–Terry
|
|
|
20 |
|
21 |
The Alpaca project presents a good opportunity to apply BT in
|
22 |
practice; especially since BT fits nicely into a Bayesian analysis
|
23 |
+
framework. As LLMs become more pervasive, so to is considering
|
24 |
+
evaluation uncertainty when comparing them; something that Bayesian
|
25 |
frameworks do well.
|
26 |
|
27 |
This Space is divided into two primary sections: the first presents a
|
28 |
ranking of models based on estimated ability. The figure on the right
|
29 |
visualizes this ranking for the top 10 models, while the table below
|
30 |
+
presents the full set. The second section estimates the probability
|
31 |
that one model will be preferred to another.
|