pavlichenko commited on
Commit
0d63e38
1 Parent(s): 26c8773

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +3 -3
app.py CHANGED
@@ -7,14 +7,14 @@ header = """Toloka compares and ranks LLM output in multiple categories, using G
7
 
8
  We use human evaluation to rate model responses to real prompts."""
9
 
10
- description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we invite annotators from the [Toloka](https://toloka.ai/) crowdsourcing platform to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
11
- The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "all" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
12
 
13
  ### The evaluation method
14
 
15
  #### Stage 1: Prompt collection
16
 
17
- We collected our own dataset of organicreal-world prompts for LLM evaluation.
18
 
19
  The alternative is to use open-source prompts, but they are not reliable enough for high-quality evaluation. Using open-source datasets can be restrictive for several reasons:
20
 
 
7
 
8
  We use human evaluation to rate model responses to real prompts."""
9
 
10
+ description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
11
+ The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
12
 
13
  ### The evaluation method
14
 
15
  #### Stage 1: Prompt collection
16
 
17
+ We collected our own dataset of organic prompts for LLM evaluation.
18
 
19
  The alternative is to use open-source prompts, but they are not reliable enough for high-quality evaluation. Using open-source datasets can be restrictive for several reasons:
20