Spaces:
Running
Running
pavlichenko
commited on
Commit
β’
3677ce0
1
Parent(s):
8ebb2ea
Update app.py
Browse files
app.py
CHANGED
@@ -39,7 +39,7 @@ Distribution of prompts by categories:
|
|
39 |
We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
|
40 |
|
41 |
|
42 |
-
#### How Did We Set Up Human Evaluation
|
43 |
|
44 |
Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
|
45 |
"""
|
@@ -102,7 +102,7 @@ row = [reference_model_name] + [50.0] * len(pretty_categories)
|
|
102 |
table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
|
103 |
table = table.sort_values(by=['Total'], ascending=False)
|
104 |
|
105 |
-
table.index = range(
|
106 |
|
107 |
for category in pretty_category_names.values():
|
108 |
table[category] = table[category].map('{:,.2f}%'.format)
|
|
|
39 |
We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
|
40 |
|
41 |
|
42 |
+
#### How Did We Set Up Human Evaluation?
|
43 |
|
44 |
Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
|
45 |
"""
|
|
|
102 |
table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
|
103 |
table = table.sort_values(by=['Total'], ascending=False)
|
104 |
|
105 |
+
table.index = ["π₯ 1", "π₯ 2", "π₯ 3"] + list(range(4, len(table) + 1))
|
106 |
|
107 |
for category in pretty_category_names.values():
|
108 |
table[category] = table[category].map('{:,.2f}%'.format)
|