Spaces:

toloka
/

open-llm-leaderboard

Running

pavlichenko commited on Oct 20, 2023

Commit

3677ce0

•

1 Parent(s): 8ebb2ea

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -39,7 +39,7 @@ Distribution of prompts by categories:
 We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
-#### How Did We Set Up Human Evaluation
 Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
 """
@@ -102,7 +102,7 @@ row = [reference_model_name] + [50.0] * len(pretty_categories)
 table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
 table = table.sort_values(by=['Total'], ascending=False)
-table.index = range(1, len(table) + 1)
 for category in pretty_category_names.values():
     table[category] = table[category].map('{:,.2f}%'.format)

 We report win rates only on categories where the number of prompts is large enough to make a comparison fair.
+#### How Did We Set Up Human Evaluation?
 Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
 """
 table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
 table = table.sort_values(by=['Total'], ascending=False)
+table.index = ["🥇 1", "🥈 2", "🥉 3"] + list(range(4, len(table) + 1))
 for category in pretty_category_names.values():
     table[category] = table[category].map('{:,.2f}%'.format)