Spaces:

toloka
/

open-llm-leaderboard

Running

App Files Files Community

pavlichenko commited on Nov 1, 2023

Commit

a084e60

•

1 Parent(s): 47c80cd

Update app.py

Browse files

Files changed (1) hide show

app.py +5 -5

app.py CHANGED Viewed

@@ -11,9 +11,9 @@ We used human evaluation to rate model responses to real prompts."""
 description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
 The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
-### The evaluation method
-#### Stage 1: Prompt collection
 We collected our own dataset of organic prompts for LLM evaluation.
@@ -40,7 +40,7 @@ After collecting the prompts, we manually classified them by category and got th
 We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
-#### Stage 2: Human evaluation
 Human evaluation of prompts was conducted by [Toloka’s domain experts](https://toloka.ai/blog/ai-tutors/).
 Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the [harmlessness, truthfulness, and helpfulness principles](https://arxiv.org/pdf/2203.02155.pdf).
@@ -53,7 +53,7 @@ Most importantly, we ensured the accuracy of human judgments by using advanced q
 - Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
 - Monitoring individual accuracy by comparing each expert’s results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
-#### Ready to compare LLMs?
 Find your AI application’s use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards ([Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?tab=evaluation), [LMSYS](https://leaderboard.lmsys.org/), or others) for the big picture before you pick a model and start experimenting.
@@ -144,7 +144,7 @@ st.dataframe(
 st.markdown(description)
 st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
 prompt_examples = """
-### Prompt Examples
 | Prompt                                     | Model                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Output                                                                                                                                                                                                                                                                                                                       |
 | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

 description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we ask [Toloka](https://toloka.ai/) domain experts to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
 The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "Total" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.
+### 📊 The evaluation method
+#### 🖊 Stage 1: Prompt collection
 We collected our own dataset of organic prompts for LLM evaluation.
 We intentionally excluded prompts about coding. If you are interested in comparing coding abilities, you can refer to specific benchmarks such as [HumanEval](https://paperswithcode.com/sota/code-generation-on-humaneval).
+#### 🧠 Stage 2: Human evaluation
 Human evaluation of prompts was conducted by [Toloka’s domain experts](https://toloka.ai/blog/ai-tutors/).
 Our experts were given a prompt and responses to this prompt from two different models: the reference model (Guanaco 13B) and the model under evaluation. In a side-by-side comparison, experts selected the best output according to the [harmlessness, truthfulness, and helpfulness principles](https://arxiv.org/pdf/2203.02155.pdf).
 - Overlap of 3 with Dawid-Skene aggregation of the results (each prompt was evaluated by 3 experts and aggregated to achieve a single verdict).
 - Monitoring individual accuracy by comparing each expert’s results with the majority vote; those who fell below the accuracy threshold were removed from the evaluation project.
+#### 👉 Ready to compare LLMs?
 Find your AI application’s use case categories on our leaderboard and see how the models stack up. It never hurts to check more leaderboards ([Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?tab=evaluation), [LMSYS](https://leaderboard.lmsys.org/), or others) for the big picture before you pick a model and start experimenting.
 st.markdown(description)
 st.link_button('🚀 Evaluate my model', url='https://toloka.ai/talk-to-us/')
 prompt_examples = """
+### 🔍 Prompt Examples
 | Prompt                                     | Model                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Output                                                                                                                                                                                                                                                                                                                       |
 | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |