Spaces:

jerome-white
/

llm-bradley-terry

Sleeping

App Files Files Community

jerome-white commited on Feb 20

Commit

b01565f

•

1 Parent(s): 6ee7b0c

Work around Hugging Face managed writes

Browse files

Files changed (3) hide show

README.md +0 -38
DISCLAIMER.md → _DISCLAIMER.md +0 -0
OVERVIEW.md → _README.md +0 -6

README.md DELETED Viewed

@@ -1,38 +0,0 @@
----
-title: alpaca-bt-eval
-app_file: app.py
-sdk: gradio
-sdk_version: 4.19.1
----
-[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
-evaluation framework. It maintains a set of prompts, along with
-responses to those prompts from a collection of LLMs. It then presents
-pairs of responses to a judge that determines which response better
-addresses the prompt. Rather than compare all response pairs, the
-framework identifies a baseline model and compares all models to
-that. The standard method of ranking models is to sort by baseline
-model win percentage.
-This Space presents an alternative method of ranking based on the
-[Bradley–Terry
-model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
-(BT). Given a collection of items, Bradley–Terry estimates the
-_ability_ of each item based on pairwise comparisons between them. In
-sports, for example, that might be the ability of a given team based
-on games that team has played within a league. Once calculated,
-ability can be used to estimate the probability that one item will be
-better-than another, even if those items have yet to be formally
-compared.
-The Alpaca project presents a good opportunity to apply BT in
-practice; especially since BT fits nicely into a Bayesian analysis
-framework. As LLMs become more pervasive, quantifying the uncertainty
-in their evaluation is increasingly important. Bayesian frameworks are
-good at that.
-This Space is divided into two primary sections: the first presents a
-ranking of models based on estimated ability. The figure on the right
-presents this ranking for the top 10 models, while the table below
-presents the full set. The second section estimates the probability
-that one model will be preferred to another. A final section at the
-bottom is a disclaimer that presents details about the workflow.

DISCLAIMER.md → _DISCLAIMER.md RENAMED Viewed

File without changes

OVERVIEW.md → _README.md RENAMED Viewed

@@ -1,9 +1,3 @@
----
-title: alpaca-bt-eval
-app_file: app.py
-sdk: gradio
-sdk_version: 4.19.1
----
 [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
 evaluation framework. It maintains a set of prompts, along with
 responses to those prompts from a collection of LLMs. It then presents

 [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
 evaluation framework. It maintains a set of prompts, along with
 responses to those prompts from a collection of LLMs. It then presents