Jae-Won Chung commited on
Commit
55aeee4
1 Parent(s): 8595b18

More explanations, default plot, compute average

Browse files
Files changed (3) hide show
  1. LEADERBOARD.md +24 -11
  2. app.py +30 -7
  3. data/2023-06-17/score.csv +21 -21
LEADERBOARD.md CHANGED
@@ -6,14 +6,16 @@ The energy consumption of running inference on a model will depends on factors s
6
  However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
7
  That is, when asked the same thing, different models answer in different lengths.
8
 
9
- ## Metrics
10
 
11
- - `gpu`: NVIDIA GPU model name
12
  - `task`: Name of the task. See *Tasks* below for details.
 
 
 
13
  - `throughput` (token/s): The average number of tokens generated per second.
14
- - `response_length` (token): The average number of tokens in the model's response.
15
  - `latency` (s): The average time it took for the model to generate a response.
16
- - `energy` (J): The average energy consumed by the model to generate a response.
17
  - `parameters`: The number of parameters the model has, in units of billion.
18
 
19
  ## Tasks
@@ -27,6 +29,9 @@ For each task, every model uses the same system prompt. We still account for dif
27
  | instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
28
  | instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
29
 
 
 
 
30
  ## Setup
31
 
32
  Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
@@ -34,12 +39,14 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
34
  ### Software
35
 
36
  - PyTorch 2.0.1
37
- - [FastChat](https://github.com/lm-sys/fastchat) -- For various model support
38
- - [Zeus](https://ml.energy/zeus) -- For GPU energy measurement
 
39
 
40
  ### Hardware
41
 
42
  - NVIDIA A40 GPU
 
43
 
44
  ### Parameters
45
 
@@ -61,18 +68,24 @@ We used identical system prompts for all models (while respecting their own *rol
61
  A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
62
  ```
63
 
 
 
 
 
 
 
64
  ## Upcoming
65
 
66
- - Compare energy numbers against more optimized inference runtimes, like TensorRT.
67
- - More GPU types
68
- - More models
69
 
70
  # License
71
 
72
  This leaderboard is a research preview intended for non-commercial use only.
73
  The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
74
- Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.
75
 
76
  # Acknowledgements
77
 
78
- We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
 
6
  However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
7
  That is, when asked the same thing, different models answer in different lengths.
8
 
9
+ ## Columns
10
 
11
+ - `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
12
  - `task`: Name of the task. See *Tasks* below for details.
13
+ - `energy_efficiency`: The average NLP evaluation metric attained per Joule of energy.
14
+ - `energy` (J): The average energy consumed by the model to generate a response.
15
+ - `nlp_average`: The arithmetic average of the NLP evaluation metrics we obtained. See *NLP evaluation metrics* below for details.
16
  - `throughput` (token/s): The average number of tokens generated per second.
 
17
  - `latency` (s): The average time it took for the model to generate a response.
18
+ - `response_length` (token): The average number of tokens in the model's response.
19
  - `parameters`: The number of parameters the model has, in units of billion.
20
 
21
  ## Tasks
 
29
  | instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
30
  | instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
31
 
32
+ You can see that response length is shorter on average for the `-concise` variants of the tasks.
33
+ This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.
34
+
35
  ## Setup
36
 
37
  Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
 
39
  ### Software
40
 
41
  - PyTorch 2.0.1
42
+ - [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
43
+ - [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
44
+ - [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
45
 
46
  ### Hardware
47
 
48
  - NVIDIA A40 GPU
49
+ - NVIDIA A100 GPU
50
 
51
  ### Parameters
52
 
 
68
  A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
69
  ```
70
 
71
+ ## NLP evaluation metrics
72
+
73
+ - `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
74
+ - `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
75
+ - `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
76
+
77
  ## Upcoming
78
 
79
+ - More optimized inference runtimes, like TensorRT.
80
+ - More GPU models, like V100.
81
+ - More models, like RWKV.
82
 
83
  # License
84
 
85
  This leaderboard is a research preview intended for non-commercial use only.
86
  The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
87
+ Please direct inquiries/reports of potential violation to Jae-Won Chung.
88
 
89
  # Acknowledgements
90
 
91
+ We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
app.py CHANGED
@@ -19,9 +19,9 @@ class TableManager:
19
  """Load leaderboard data from CSV files in data_dir."""
20
  # Load and merge CSV files.
21
  df = self._read_tables(data_dir)
22
- models = json.load(open(f"{data_dir}/models.json"))
23
 
24
  # Add the #params column.
 
25
  df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
26
 
27
  # Make the first column (model) an HTML anchor to the model's website.
@@ -34,8 +34,8 @@ class TableManager:
34
  )
35
  df["model"] = df["model"].apply(format_model_link)
36
 
37
- # Sort by energy.
38
- df = df.sort_values(by="energy", ascending=True)
39
 
40
  # The full table where all the data are.
41
  self.full_df = df
@@ -48,6 +48,11 @@ class TableManager:
48
  """Read tables."""
49
  df_score = pd.read_csv(f"{data_dir}/score.csv")
50
 
 
 
 
 
 
51
  with open(f"{data_dir}/schema.yaml") as file:
52
  self.schema: dict[str, list] = yaml.safe_load(file)
53
 
@@ -66,7 +71,24 @@ class TableManager:
66
  if res_df.empty:
67
  raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
68
 
69
- return pd.merge(res_df, df_score, on=["model"]).round(2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  def _format_msg(self, text: str) -> str:
72
  """Formats into HTML that prints in Monospace font."""
@@ -111,8 +133,8 @@ class TableManager:
111
  def get_dropdown(self):
112
  columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
113
  return [
114
- gr.Dropdown(choices=columns, label="X"),
115
- gr.Dropdown(choices=columns, label="Y"),
116
  gr.Dropdown(choices=columns, label="Z (optional)"),
117
  ]
118
 
@@ -320,7 +342,8 @@ with block:
320
  plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
321
  plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
322
  with gr.Row():
323
- plot = gr.Plot()
 
324
  with gr.Row():
325
  plot_message = gr.HTML("")
326
  add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
 
19
  """Load leaderboard data from CSV files in data_dir."""
20
  # Load and merge CSV files.
21
  df = self._read_tables(data_dir)
 
22
 
23
  # Add the #params column.
24
+ models = json.load(open(f"{data_dir}/models.json"))
25
  df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
26
 
27
  # Make the first column (model) an HTML anchor to the model's website.
 
34
  )
35
  df["model"] = df["model"].apply(format_model_link)
36
 
37
+ # Sort by our 'energy efficiency' score.
38
+ df = df.sort_values(by="energy_efficiency", ascending=True)
39
 
40
  # The full table where all the data are.
41
  self.full_df = df
 
48
  """Read tables."""
49
  df_score = pd.read_csv(f"{data_dir}/score.csv")
50
 
51
+ # Compute average NLP metrics
52
+ columns = df_score.columns.to_list()
53
+ columns.remove("model")
54
+ df_score["nlp_average"] = df_score[columns].mean(axis=1)
55
+
56
  with open(f"{data_dir}/schema.yaml") as file:
57
  self.schema: dict[str, list] = yaml.safe_load(file)
58
 
 
71
  if res_df.empty:
72
  raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
73
 
74
+ df = pd.merge(res_df, df_score, on=["model"])
75
+
76
+ # Energy efficiency is defined as the amount of average NLP performance
77
+ # the model gets per Joule of energy.
78
+ df["energy_efficiency"] = df["nlp_average"] / df["energy"]
79
+
80
+ # Order columns.
81
+ columns = df.columns.to_list()
82
+ cols_to_order = ["model"]
83
+ cols_to_order.extend(self.schema.keys())
84
+ cols_to_order.extend(["energy_efficiency", "energy", "nlp_average"])
85
+ columns = cols_to_order + [col for col in columns if col not in cols_to_order]
86
+ df = df[columns]
87
+
88
+ # Delete rows with *any* NaN values.
89
+ df = df.dropna()
90
+
91
+ return df.round(2)
92
 
93
  def _format_msg(self, text: str) -> str:
94
  """Formats into HTML that prints in Monospace font."""
 
133
  def get_dropdown(self):
134
  columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
135
  return [
136
+ gr.Dropdown("nlp_average", choices=columns, label="X"),
137
+ gr.Dropdown("energy_efficiency", choices=columns, label="Y"),
138
  gr.Dropdown(choices=columns, label="Z (optional)"),
139
  ]
140
 
 
342
  plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
343
  plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
344
  with gr.Row():
345
+ # By default show a plot of average model quality vs energy consumption.
346
+ plot = gr.Plot(global_tbm.plot_scatter("600", "600", "gpu", "nlp_average", "energy")[0])
347
  with gr.Row():
348
  plot_message = gr.HTML("")
349
  add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns) # type: ignore
data/2023-06-17/score.csv CHANGED
@@ -1,21 +1,21 @@
1
- model,average,ARC (25-s),HellaSwag (10-s),TruthfulQA (MC) (0-s)
2
- lmsys/vicuna-7B,60.0,53.5,77.5,49.0
3
- lmsys/vicuna-13B,61.6,52.9,80.1,51.8
4
- tatsu-lab/alpaca-7B,56.4,52.6,76.9,39.6
5
- metaai/llama-7B,54.3,51.1,77.7,34.1
6
- metaai/llama-13B,59.0,56.3,80.9,39.9
7
- camel-ai/CAMEL-13B-Combined-Data,60.7,55.5,79.3,47.3
8
- BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN,NaN
9
- databricks/dolly-v2-12b,49.1,42.2,71.8,33.4
10
- FreedomIntelligence/phoenix-inst-chat-7b,51.8,45.0,63.2,47.1
11
- h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,45.5,36.9,61.6,37.9
12
- lmsys/fastchat-t5-3b-v1.0,43.7,35.9,46.4,48.8
13
- Neutralzz/BiLLa-7B-SFT,34.2,27.7,26.0,49.0
14
- nomic-ai/gpt4all-13b-snoozy,61.1,56.1,78.7,48.4
15
- openaccess-ai-collective/manticore-13b-chat-pyg,63.2,58.7,82.0,48.9
16
- OpenAssistant/oasst-sft-1-pythia-12b,51.6,45.6,69.9,39.2
17
- project-baize/baize-v2-7B,55.1,48.5,75.0,41.7
18
- BAIR/koala-7b,55.6,47.1,73.7,46.0
19
- BAIR/koala-13b,60.2,52.9,77.5,50.1
20
- StabilityAI/stablelm-tuned-alpha-7b,41.9,31.9,53.6,40.2
21
- togethercomputer/RedPajama-INCITE-7B-Chat,49.7,42.2,70.8,36.1
 
1
+ model,arc,hellaswag,truthfulqa
2
+ lmsys/vicuna-7B,53.5,77.5,49.0
3
+ lmsys/vicuna-13B,52.9,80.1,51.8
4
+ tatsu-lab/alpaca-7B,52.6,76.9,39.6
5
+ metaai/llama-7B,51.1,77.7,34.1
6
+ metaai/llama-13B,56.3,80.9,39.9
7
+ camel-ai/CAMEL-13B-Combined-Data,55.5,79.3,47.3
8
+ BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
9
+ databricks/dolly-v2-12b,42.2,71.8,33.4
10
+ FreedomIntelligence/phoenix-inst-chat-7b,45.0,63.2,47.1
11
+ h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,36.9,61.6,37.9
12
+ lmsys/fastchat-t5-3b-v1.0,35.9,46.4,48.8
13
+ Neutralzz/BiLLa-7B-SFT,27.7,26.0,49.0
14
+ nomic-ai/gpt4all-13b-snoozy,56.1,78.7,48.4
15
+ openaccess-ai-collective/manticore-13b-chat-pyg,58.7,82.0,48.9
16
+ OpenAssistant/oasst-sft-1-pythia-12b,45.6,69.9,39.2
17
+ project-baize/baize-v2-7B,48.5,75.0,41.7
18
+ BAIR/koala-7b,47.1,73.7,46.0
19
+ BAIR/koala-13b,52.9,77.5,50.1
20
+ StabilityAI/stablelm-tuned-alpha-7b,31.9,53.6,40.2
21
+ togethercomputer/RedPajama-INCITE-7B-Chat,42.2,70.8,36.1