Spaces:

toloka
/

open-llm-leaderboard

Running

pavlichenko commited on Oct 20, 2023

Commit

8ebb2ea

•

1 Parent(s): 34bef94

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -24,6 +24,21 @@ We find it’s tricky to use open-source datasets of prompts due to the followin
 To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
 #### How Did We Set Up Human Evaluation
 Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.

 To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
+Distribution of prompts by categories:
+* Brainstorming: 15.48%
+* Chat: 1.59%
+* Classification: 0.2%
+* Closed QA: 3.77%
+* Extraction: 0.6%
+* Generation: 38.29%
+* Open QA: 32.94%
+* Rewrite: 5.16%
+* Summarization: 1.98%
+We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
 #### How Did We Set Up Human Evaluation
 Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.