jerome-white commited on
Commit
568b91b
1 Parent(s): 77c903f

Clarify text

Browse files
Files changed (2) hide show
  1. _DISCLAIMER.md +15 -11
  2. _README.md +18 -19
_DISCLAIMER.md CHANGED
@@ -1,12 +1,14 @@
1
  # Disclaimer
2
 
3
- This Space is primarily intended for exploration. Until otherwise
4
- stated, its results should be treated as points of reference rather
5
- than absolute fact. Viewers are encouraged to study the pipeline and
6
- understand the model before broadcasting strong opinions of model
7
- rankings based on what is seen here. Suggestions for improving this
8
- Space from those familiar with Alpaca or Bayesian data analysis are
9
- welcome!
 
 
10
 
11
  ## Resources
12
 
@@ -15,9 +17,11 @@ welcome!
15
 
16
  ## TODO
17
 
18
- [] Extend the Stan model to incorporate ties and response presentation
19
- ordering
 
 
20
 
21
- [] Add details of the MCMC chains
22
 
23
- [] Automate data processing
 
1
  # Disclaimer
2
 
3
+ This Space is primarily intended for exploration. For now its results
4
+ should be treated as points of reference rather than absolute
5
+ facts. Viewers are encouraged to study the pipeline and understand the
6
+ model to help put the results into context.
7
+
8
+ Suggestions for improving this Space from those familiar with Alpaca
9
+ or Bayesian data analysis are welcome! Please use the
10
+ [community](https://huggingface.co/spaces/jerome-white/alpaca-eval/discussions)
11
+ to do so.
12
 
13
  ## Resources
14
 
 
17
 
18
  ## TODO
19
 
20
+ * Extend the Stan model to incorporate ties and response presentation
21
+ ordering
22
+
23
+ * Add details of the MCMC chains
24
 
25
+ * Automate data processing
26
 
27
+ * Explicit documentation of the process
_README.md CHANGED
@@ -1,32 +1,31 @@
1
  [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
2
  evaluation framework. It maintains a set of prompts, along with
3
- responses to those prompts from a collection of LLMs. It then presents
4
- pairs of responses to a judge that determines which response better
5
- addresses the prompt. Rather than compare all response pairs, the
6
- framework identifies a baseline model and compares all models to
7
- that. The standard method of ranking models is to sort by baseline
8
- model win percentage.
9
 
10
  This Space presents an alternative method of ranking based on the
11
  [Bradley–Terry
12
  model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
13
  (BT). Given a collection of items, Bradley–Terry estimates the
14
- _ability_ of each item based on pairwise comparisons between them. In
15
- sports, for example, that might be the ability of a given team based
16
- on games that team has played within a league. Once calculated,
17
- ability can be used to estimate the probability that one item will be
18
- better-than another, even if those items have yet to be formally
19
- compared.
20
 
21
  The Alpaca project presents a good opportunity to apply BT in
22
  practice; especially since BT fits nicely into a Bayesian analysis
23
- framework. As LLMs become more pervasive, quantifying the uncertainty
24
- in their evaluation is increasingly important. Bayesian frameworks are
25
- good at that.
26
 
27
  This Space is divided into two primary sections: the first presents a
28
  ranking of models based on estimated ability. The figure on the right
29
- presents this ranking for the top 10 models, while the table below
30
- presents the full set. The second section estimates the probability
31
- that one model will be preferred to another. A final section at the
32
- bottom is a disclaimer that presents details about the workflow.
 
1
  [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
2
  evaluation framework. It maintains a set of prompts, along with
3
+ responses to those prompts from a collection of LLMs. It presents
4
+ pairs of responses to a judge who determines which response better
5
+ addresses the request of the prompt. Rather than compare all response
6
+ pairs, the framework sets one model as a baseline, then individually
7
+ compares all responses to that. Its primary method of ranking models
8
+ via win percentages over the baseline.
9
 
10
  This Space presents an alternative method of ranking based on the
11
  [Bradley–Terry
12
  model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
13
  (BT). Given a collection of items, Bradley–Terry estimates the
14
+ _ability_ of each item based on pairwise comparisons between
15
+ them. Once calculated, ability can be used to estimate the probability
16
+ that one item will be better-than another, even if those items have
17
+ not been formally compared. In sports, for example, ability might
18
+ correspond to a teams strength within their league. Ability could then
19
+ be used to predict outcomes between teams that have yet to play.
20
 
21
  The Alpaca project presents a good opportunity to apply BT in
22
  practice; especially since BT fits nicely into a Bayesian analysis
23
+ framework. As LLMs become more pervasive, quantifying uncertainty in
24
+ their evaluation is increasingly important; something that Bayesian
25
+ frameworks do well.
26
 
27
  This Space is divided into two primary sections: the first presents a
28
  ranking of models based on estimated ability. The figure on the right
29
+ visualizes this ranking for the top 10 models, while the table below
30
+ it presents the full set. The second section estimates the probability
31
+ that one model will be preferred to another.