pavlichenko commited on
Commit
a084e60
1 Parent(s): 47c80cd

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +5 -5
app.py CHANGED
@@ -11,9 +11,9 @@ We used human evaluation to rate model responses to real prompts."""
11
  description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
12
  The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
13
 
14
- ### The evaluation method
15
 
16
- #### Stage 1: Prompt collection
17
 
18
  We collected our own dataset of organic prompts for LLM evaluation.
19
 
@@ -40,7 +40,7 @@ After collecting the prompts, we manually classified them by category and got th
40
  We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
41
 
42
 
43
- #### Stage 2: Human evaluation
44
 
45
  Human evaluation of prompts was conducted by [Toloka’s domain experts](https://toloka.ai/blog/ai-tutors/).
46
  Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the [harmlessness, truthfulness, and helpfulness principles](https://arxiv.org/pdf/2203.02155.pdf).
@@ -53,7 +53,7 @@ Most importantly, we ensured the accuracy of human judgments by using advanced q
53
  - Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
54
  - Monitoring individual accuracy by comparing each expert’s results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
55
 
56
- #### Ready to compare LLMs?
57
 
58
  Find your AI application’s use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards ([Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?tab=evaluation), [LMSYS](https://leaderboard.lmsys.org/), or others) for the big picture before you pick a model and start experimenting.
59
 
@@ -144,7 +144,7 @@ st.dataframe(
144
  st.markdown(description)
145
  st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
146
  prompt_examples = """
147
- ### Prompt Examples
148
 
149
  | Prompt | Model | Output |
150
  | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 
11
  description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
12
  The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
13
 
14
+ ### 📊 The evaluation method
15
 
16
+ #### 🖊 Stage 1: Prompt collection
17
 
18
  We collected our own dataset of organic prompts for LLM evaluation.
19
 
 
40
  We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
41
 
42
 
43
+ #### 🧠 Stage 2: Human evaluation
44
 
45
  Human evaluation of prompts was conducted by [Toloka’s domain experts](https://toloka.ai/blog/ai-tutors/).
46
  Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the [harmlessness, truthfulness, and helpfulness principles](https://arxiv.org/pdf/2203.02155.pdf).
 
53
  - Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
54
  - Monitoring individual accuracy by comparing each expert’s results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
55
 
56
+ #### 👉 Ready to compare LLMs?
57
 
58
  Find your AI application’s use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards ([Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?tab=evaluation), [LMSYS](https://leaderboard.lmsys.org/), or others) for the big picture before you pick a model and start experimenting.
59
 
 
144
  st.markdown(description)
145
  st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
146
  prompt_examples = """
147
+ ### 🔍 Prompt Examples
148
 
149
  | Prompt | Model | Output |
150
  | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |