Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
·
55aeee4
1
Parent(s):
8595b18
More explanations, default plot, compute average
Browse files- LEADERBOARD.md +24 -11
- app.py +30 -7
- data/2023-06-17/score.csv +21 -21
LEADERBOARD.md
CHANGED
@@ -6,14 +6,16 @@ The energy consumption of running inference on a model will depends on factors s
|
|
6 |
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
|
7 |
That is, when asked the same thing, different models answer in different lengths.
|
8 |
|
9 |
-
##
|
10 |
|
11 |
-
- `gpu`: NVIDIA GPU model name
|
12 |
- `task`: Name of the task. See *Tasks* below for details.
|
|
|
|
|
|
|
13 |
- `throughput` (token/s): The average number of tokens generated per second.
|
14 |
-
- `response_length` (token): The average number of tokens in the model's response.
|
15 |
- `latency` (s): The average time it took for the model to generate a response.
|
16 |
-
- `
|
17 |
- `parameters`: The number of parameters the model has, in units of billion.
|
18 |
|
19 |
## Tasks
|
@@ -27,6 +29,9 @@ For each task, every model uses the same system prompt. We still account for dif
|
|
27 |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
|
28 |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
|
29 |
|
|
|
|
|
|
|
30 |
## Setup
|
31 |
|
32 |
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
|
@@ -34,12 +39,14 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
34 |
### Software
|
35 |
|
36 |
- PyTorch 2.0.1
|
37 |
-
- [
|
38 |
-
- [
|
|
|
39 |
|
40 |
### Hardware
|
41 |
|
42 |
- NVIDIA A40 GPU
|
|
|
43 |
|
44 |
### Parameters
|
45 |
|
@@ -61,18 +68,24 @@ We used identical system prompts for all models (while respecting their own *rol
|
|
61 |
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
62 |
```
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
## Upcoming
|
65 |
|
66 |
-
-
|
67 |
-
- More GPU
|
68 |
-
- More models
|
69 |
|
70 |
# License
|
71 |
|
72 |
This leaderboard is a research preview intended for non-commercial use only.
|
73 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
74 |
-
Please direct inquiries
|
75 |
|
76 |
# Acknowledgements
|
77 |
|
78 |
-
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
|
|
|
6 |
However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
|
7 |
That is, when asked the same thing, different models answer in different lengths.
|
8 |
|
9 |
+
## Columns
|
10 |
|
11 |
+
- `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
|
12 |
- `task`: Name of the task. See *Tasks* below for details.
|
13 |
+
- `energy_efficiency`: The average NLP evaluation metric attained per Joule of energy.
|
14 |
+
- `energy` (J): The average energy consumed by the model to generate a response.
|
15 |
+
- `nlp_average`: The arithmetic average of the NLP evaluation metrics we obtained. See *NLP evaluation metrics* below for details.
|
16 |
- `throughput` (token/s): The average number of tokens generated per second.
|
|
|
17 |
- `latency` (s): The average time it took for the model to generate a response.
|
18 |
+
- `response_length` (token): The average number of tokens in the model's response.
|
19 |
- `parameters`: The number of parameters the model has, in units of billion.
|
20 |
|
21 |
## Tasks
|
|
|
29 |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
|
30 |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
|
31 |
|
32 |
+
You can see that response length is shorter on average for the `-concise` variants of the tasks.
|
33 |
+
This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.
|
34 |
+
|
35 |
## Setup
|
36 |
|
37 |
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
|
|
|
39 |
### Software
|
40 |
|
41 |
- PyTorch 2.0.1
|
42 |
+
- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
|
43 |
+
- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
|
44 |
+
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
|
45 |
|
46 |
### Hardware
|
47 |
|
48 |
- NVIDIA A40 GPU
|
49 |
+
- NVIDIA A100 GPU
|
50 |
|
51 |
### Parameters
|
52 |
|
|
|
68 |
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
|
69 |
```
|
70 |
|
71 |
+
## NLP evaluation metrics
|
72 |
+
|
73 |
+
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
|
74 |
+
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
|
75 |
+
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
|
76 |
+
|
77 |
## Upcoming
|
78 |
|
79 |
+
- More optimized inference runtimes, like TensorRT.
|
80 |
+
- More GPU models, like V100.
|
81 |
+
- More models, like RWKV.
|
82 |
|
83 |
# License
|
84 |
|
85 |
This leaderboard is a research preview intended for non-commercial use only.
|
86 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
87 |
+
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
88 |
|
89 |
# Acknowledgements
|
90 |
|
91 |
+
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
|
app.py
CHANGED
@@ -19,9 +19,9 @@ class TableManager:
|
|
19 |
"""Load leaderboard data from CSV files in data_dir."""
|
20 |
# Load and merge CSV files.
|
21 |
df = self._read_tables(data_dir)
|
22 |
-
models = json.load(open(f"{data_dir}/models.json"))
|
23 |
|
24 |
# Add the #params column.
|
|
|
25 |
df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
|
26 |
|
27 |
# Make the first column (model) an HTML anchor to the model's website.
|
@@ -34,8 +34,8 @@ class TableManager:
|
|
34 |
)
|
35 |
df["model"] = df["model"].apply(format_model_link)
|
36 |
|
37 |
-
# Sort by energy.
|
38 |
-
df = df.sort_values(by="
|
39 |
|
40 |
# The full table where all the data are.
|
41 |
self.full_df = df
|
@@ -48,6 +48,11 @@ class TableManager:
|
|
48 |
"""Read tables."""
|
49 |
df_score = pd.read_csv(f"{data_dir}/score.csv")
|
50 |
|
|
|
|
|
|
|
|
|
|
|
51 |
with open(f"{data_dir}/schema.yaml") as file:
|
52 |
self.schema: dict[str, list] = yaml.safe_load(file)
|
53 |
|
@@ -66,7 +71,24 @@ class TableManager:
|
|
66 |
if res_df.empty:
|
67 |
raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
|
68 |
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
def _format_msg(self, text: str) -> str:
|
72 |
"""Formats into HTML that prints in Monospace font."""
|
@@ -111,8 +133,8 @@ class TableManager:
|
|
111 |
def get_dropdown(self):
|
112 |
columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
|
113 |
return [
|
114 |
-
gr.Dropdown(choices=columns, label="X"),
|
115 |
-
gr.Dropdown(choices=columns, label="Y"),
|
116 |
gr.Dropdown(choices=columns, label="Z (optional)"),
|
117 |
]
|
118 |
|
@@ -320,7 +342,8 @@ with block:
|
|
320 |
plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
|
321 |
plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
|
322 |
with gr.Row():
|
323 |
-
plot
|
|
|
324 |
with gr.Row():
|
325 |
plot_message = gr.HTML("")
|
326 |
add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
|
|
|
19 |
"""Load leaderboard data from CSV files in data_dir."""
|
20 |
# Load and merge CSV files.
|
21 |
df = self._read_tables(data_dir)
|
|
|
22 |
|
23 |
# Add the #params column.
|
24 |
+
models = json.load(open(f"{data_dir}/models.json"))
|
25 |
df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
|
26 |
|
27 |
# Make the first column (model) an HTML anchor to the model's website.
|
|
|
34 |
)
|
35 |
df["model"] = df["model"].apply(format_model_link)
|
36 |
|
37 |
+
# Sort by our 'energy efficiency' score.
|
38 |
+
df = df.sort_values(by="energy_efficiency", ascending=True)
|
39 |
|
40 |
# The full table where all the data are.
|
41 |
self.full_df = df
|
|
|
48 |
"""Read tables."""
|
49 |
df_score = pd.read_csv(f"{data_dir}/score.csv")
|
50 |
|
51 |
+
# Compute average NLP metrics
|
52 |
+
columns = df_score.columns.to_list()
|
53 |
+
columns.remove("model")
|
54 |
+
df_score["nlp_average"] = df_score[columns].mean(axis=1)
|
55 |
+
|
56 |
with open(f"{data_dir}/schema.yaml") as file:
|
57 |
self.schema: dict[str, list] = yaml.safe_load(file)
|
58 |
|
|
|
71 |
if res_df.empty:
|
72 |
raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
|
73 |
|
74 |
+
df = pd.merge(res_df, df_score, on=["model"])
|
75 |
+
|
76 |
+
# Energy efficiency is defined as the amount of average NLP performance
|
77 |
+
# the model gets per Joule of energy.
|
78 |
+
df["energy_efficiency"] = df["nlp_average"] / df["energy"]
|
79 |
+
|
80 |
+
# Order columns.
|
81 |
+
columns = df.columns.to_list()
|
82 |
+
cols_to_order = ["model"]
|
83 |
+
cols_to_order.extend(self.schema.keys())
|
84 |
+
cols_to_order.extend(["energy_efficiency", "energy", "nlp_average"])
|
85 |
+
columns = cols_to_order + [col for col in columns if col not in cols_to_order]
|
86 |
+
df = df[columns]
|
87 |
+
|
88 |
+
# Delete rows with *any* NaN values.
|
89 |
+
df = df.dropna()
|
90 |
+
|
91 |
+
return df.round(2)
|
92 |
|
93 |
def _format_msg(self, text: str) -> str:
|
94 |
"""Formats into HTML that prints in Monospace font."""
|
|
|
133 |
def get_dropdown(self):
|
134 |
columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
|
135 |
return [
|
136 |
+
gr.Dropdown("nlp_average", choices=columns, label="X"),
|
137 |
+
gr.Dropdown("energy_efficiency", choices=columns, label="Y"),
|
138 |
gr.Dropdown(choices=columns, label="Z (optional)"),
|
139 |
]
|
140 |
|
|
|
342 |
plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
|
343 |
plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
|
344 |
with gr.Row():
|
345 |
+
# By default show a plot of average model quality vs energy consumption.
|
346 |
+
plot = gr.Plot(global_tbm.plot_scatter("600", "600", "gpu", "nlp_average", "energy")[0])
|
347 |
with gr.Row():
|
348 |
plot_message = gr.HTML("")
|
349 |
add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
|
data/2023-06-17/score.csv
CHANGED
@@ -1,21 +1,21 @@
|
|
1 |
-
model,
|
2 |
-
lmsys/vicuna-7B,
|
3 |
-
lmsys/vicuna-13B,
|
4 |
-
tatsu-lab/alpaca-7B,
|
5 |
-
metaai/llama-7B,
|
6 |
-
metaai/llama-13B,
|
7 |
-
camel-ai/CAMEL-13B-Combined-Data,
|
8 |
-
BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
|
9 |
-
databricks/dolly-v2-12b,
|
10 |
-
FreedomIntelligence/phoenix-inst-chat-7b,
|
11 |
-
h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,
|
12 |
-
lmsys/fastchat-t5-3b-v1.0,
|
13 |
-
Neutralzz/BiLLa-7B-SFT,
|
14 |
-
nomic-ai/gpt4all-13b-snoozy,
|
15 |
-
openaccess-ai-collective/manticore-13b-chat-pyg,
|
16 |
-
OpenAssistant/oasst-sft-1-pythia-12b,
|
17 |
-
project-baize/baize-v2-7B,
|
18 |
-
BAIR/koala-7b,
|
19 |
-
BAIR/koala-13b,
|
20 |
-
StabilityAI/stablelm-tuned-alpha-7b,
|
21 |
-
togethercomputer/RedPajama-INCITE-7B-Chat,
|
|
|
1 |
+
model,arc,hellaswag,truthfulqa
|
2 |
+
lmsys/vicuna-7B,53.5,77.5,49.0
|
3 |
+
lmsys/vicuna-13B,52.9,80.1,51.8
|
4 |
+
tatsu-lab/alpaca-7B,52.6,76.9,39.6
|
5 |
+
metaai/llama-7B,51.1,77.7,34.1
|
6 |
+
metaai/llama-13B,56.3,80.9,39.9
|
7 |
+
camel-ai/CAMEL-13B-Combined-Data,55.5,79.3,47.3
|
8 |
+
BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
|
9 |
+
databricks/dolly-v2-12b,42.2,71.8,33.4
|
10 |
+
FreedomIntelligence/phoenix-inst-chat-7b,45.0,63.2,47.1
|
11 |
+
h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,36.9,61.6,37.9
|
12 |
+
lmsys/fastchat-t5-3b-v1.0,35.9,46.4,48.8
|
13 |
+
Neutralzz/BiLLa-7B-SFT,27.7,26.0,49.0
|
14 |
+
nomic-ai/gpt4all-13b-snoozy,56.1,78.7,48.4
|
15 |
+
openaccess-ai-collective/manticore-13b-chat-pyg,58.7,82.0,48.9
|
16 |
+
OpenAssistant/oasst-sft-1-pythia-12b,45.6,69.9,39.2
|
17 |
+
project-baize/baize-v2-7B,48.5,75.0,41.7
|
18 |
+
BAIR/koala-7b,47.1,73.7,46.0
|
19 |
+
BAIR/koala-13b,52.9,77.5,50.1
|
20 |
+
StabilityAI/stablelm-tuned-alpha-7b,31.9,53.6,40.2
|
21 |
+
togethercomputer/RedPajama-INCITE-7B-Chat,42.2,70.8,36.1
|