Spaces:

bigcode
/

bigcodebench-leaderboard

Running

Terry Zhuo commited on Jul 20, 2024

Commit

de4c2d6

1 Parent(s): 2970f67

update

Files changed (2) hide show

app.py CHANGED Viewed

@@ -396,7 +396,7 @@ with main_block as demo:
                     - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This split tests if the models are really capable enough to understand human intents to code.
                 - `Complete` and `Instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark splits.
                 - `Average` is the average of `Complete` and `Instruct` when both are available.
-                - `Elo Rating` represents the task-level Bootstrap of Maximum Likelihood Elo rating on the BigCodeBench-Complete split. The rating starts from 1000 and is bootstrapped 500 times.
                 - `#Act Params (B)` is the number of activated model parameters during inference.
                 - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
                 - For more details check the 📝 About section.

                     - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This split tests if the models are really capable enough to understand human intents to code.
                 - `Complete` and `Instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark splits.
                 - `Average` is the average of `Complete` and `Instruct` when both are available.
+                - `Elo Rating` represents the task-level Bootstrap of Maximum Likelihood Elo rating on the Complete + Instruct splits. The rating starts from 1000 and is bootstrapped 500 times. We only consider the models having both `Complete` and `Instruct` scores.
                 - `#Act Params (B)` is the number of activated model parameters during inference.
                 - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
                 - For more details check the 📝 About section.

src/display/utils.py CHANGED Viewed

@@ -105,7 +105,7 @@ auto_eval_column_dict.append(["size", ColumnContent, ColumnContent(column_map["s
 auto_eval_column_dict.append(["lazy", ColumnContent, ColumnContent(column_map["lazy"], "bool", False, True)])
 auto_eval_column_dict.append(["moe", ColumnContent, ColumnContent(column_map["moe"], "str", False, True)])
 auto_eval_column_dict.append(["openness", ColumnContent, ColumnContent(column_map["openness"], "str", False, True)])
-auto_eval_column_dict.append(["direct_complete", ColumnContent, ColumnContent(column_map["direct_complete"], "bool", False)])
 # We use make dataclass to dynamically fill the scores from Tasks
 AutoEvalColumn = make_dataclass("AutoEvalColumn", auto_eval_column_dict, frozen=True)

 auto_eval_column_dict.append(["lazy", ColumnContent, ColumnContent(column_map["lazy"], "bool", False, True)])
 auto_eval_column_dict.append(["moe", ColumnContent, ColumnContent(column_map["moe"], "str", False, True)])
 auto_eval_column_dict.append(["openness", ColumnContent, ColumnContent(column_map["openness"], "str", False, True)])
+# auto_eval_column_dict.append(["direct_complete", ColumnContent, ColumnContent(column_map["direct_complete"], "bool", False)])
 # We use make dataclass to dynamically fill the scores from Tasks
 AutoEvalColumn = make_dataclass("AutoEvalColumn", auto_eval_column_dict, frozen=True)