pavlichenko commited on
Commit
8ebb2ea
1 Parent(s): 34bef94

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +15 -0
app.py CHANGED
@@ -24,6 +24,21 @@ We find it’s tricky to use open-source datasets of prompts due to the followin
24
 
25
  To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  #### How Did We Set Up Human Evaluation
28
 
29
  Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
 
24
 
25
  To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.
26
 
27
+ Distribution of prompts by categories:
28
+
29
+ * Brainstorming: 15.48%
30
+ * Chat: 1.59%
31
+ * Classification: 0.2%
32
+ * Closed QA: 3.77%
33
+ * Extraction: 0.6%
34
+ * Generation: 38.29%
35
+ * Open QA: 32.94%
36
+ * Rewrite: 5.16%
37
+ * Summarization: 1.98%
38
+
39
+ We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
40
+
41
+
42
  #### How Did We Set Up Human Evaluation
43
 
44
  Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.