Spaces:

jerome-white
/

llm-bradley-terry

Sleeping

App Files Files Community

jerome-white commited on Feb 20

Commit

568b91b

•

1 Parent(s): 77c903f

Clarify text

Browse files

Files changed (2) hide show

_DISCLAIMER.md +15 -11
_README.md +18 -19

_DISCLAIMER.md CHANGED Viewed

@@ -1,12 +1,14 @@
 # Disclaimer
-This Space is primarily intended for exploration. Until otherwise
-stated, its results should be treated as points of reference rather
-than absolute fact. Viewers are encouraged to study the pipeline and
-understand the model before broadcasting strong opinions of model
-rankings based on what is seen here. Suggestions for improving this
-Space from those familiar with Alpaca or Bayesian data analysis are
-welcome!
 ## Resources
@@ -15,9 +17,11 @@ welcome!
 ## TODO
-[] Extend the Stan model to incorporate ties and response presentation
-   ordering
-[] Add details of the MCMC chains
-[] Automate data processing

 # Disclaimer
+This Space is primarily intended for exploration. For now its results
+should be treated as points of reference rather than absolute
+facts. Viewers are encouraged to study the pipeline and understand the
+model to help put the results into context.
+Suggestions for improving this Space from those familiar with Alpaca
+or Bayesian data analysis are welcome! Please use the
+[community](https://huggingface.co/spaces/jerome-white/alpaca-eval/discussions)
+to do so.
 ## Resources
 ## TODO
+* Extend the Stan model to incorporate ties and response presentation
+  ordering
+* Add details of the MCMC chains
+* Automate data processing
+* Explicit documentation of the process

_README.md CHANGED Viewed

@@ -1,32 +1,31 @@
 [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
 evaluation framework. It maintains a set of prompts, along with
-responses to those prompts from a collection of LLMs. It then presents
-pairs of responses to a judge that determines which response better
-addresses the prompt. Rather than compare all response pairs, the
-framework identifies a baseline model and compares all models to
-that. The standard method of ranking models is to sort by baseline
-model win percentage.
 This Space presents an alternative method of ranking based on the
 [Bradley–Terry
 model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
 (BT). Given a collection of items, Bradley–Terry estimates the
-_ability_ of each item based on pairwise comparisons between them. In
-sports, for example, that might be the ability of a given team based
-on games that team has played within a league. Once calculated,
-ability can be used to estimate the probability that one item will be
-better-than another, even if those items have yet to be formally
-compared.
 The Alpaca project presents a good opportunity to apply BT in
 practice; especially since BT fits nicely into a Bayesian analysis
-framework. As LLMs become more pervasive, quantifying the uncertainty
-in their evaluation is increasingly important. Bayesian frameworks are
-good at that.
 This Space is divided into two primary sections: the first presents a
 ranking of models based on estimated ability. The figure on the right
-presents this ranking for the top 10 models, while the table below
-presents the full set. The second section estimates the probability
-that one model will be preferred to another. A final section at the
-bottom is a disclaimer that presents details about the workflow.

 [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
 evaluation framework. It maintains a set of prompts, along with
+responses to those prompts from a collection of LLMs. It presents
+pairs of responses to a judge who determines which response better
+addresses the request of the prompt. Rather than compare all response
+pairs, the framework sets one model as a baseline, then individually
+compares all responses to that. Its primary method of ranking models
+via win percentages over the baseline.
 This Space presents an alternative method of ranking based on the
 [Bradley–Terry
 model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
 (BT). Given a collection of items, Bradley–Terry estimates the
+_ability_ of each item based on pairwise comparisons between
+them. Once calculated, ability can be used to estimate the probability
+that one item will be better-than another, even if those items have
+not been formally compared. In sports, for example, ability might
+correspond to a teams strength within their league. Ability could then
+be used to predict outcomes between teams that have yet to play.
 The Alpaca project presents a good opportunity to apply BT in
 practice; especially since BT fits nicely into a Bayesian analysis
+framework. As LLMs become more pervasive, quantifying uncertainty in
+their evaluation is increasingly important; something that Bayesian
+frameworks do well.
 This Space is divided into two primary sections: the first presents a
 ranking of models based on estimated ability. The figure on the right
+visualizes this ranking for the top 10 models, while the table below
+it presents the full set. The second section estimates the probability
+that one model will be preferred to another.