Spaces:

ml-energy
/

leaderboard

Running

App Files Files Community

Jae-Won Chung commited on Jul 1, 2023

Commit

55aeee4

1 Parent(s): 8595b18

More explanations, default plot, compute average

Browse files

Files changed (3) hide show

LEADERBOARD.md +24 -11
app.py +30 -7
data/2023-06-17/score.csv +21 -21

LEADERBOARD.md CHANGED Viewed

@@ -6,14 +6,16 @@ The energy consumption of running inference on a model will depends on factors s
 However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
 That is, when asked the same thing, different models answer in different lengths.
-## Metrics
-- `gpu`: NVIDIA GPU model name
 - `task`: Name of the task. See *Tasks* below for details.
 - `throughput` (token/s): The average number of tokens generated per second.
-- `response_length` (token): The average number of tokens in the model's response.
 - `latency` (s): The average time it took for the model to generate a response.
-- `energy` (J): The average energy consumed by the model to generate a response.
 - `parameters`: The number of parameters the model has, in units of billion.
 ## Tasks
@@ -27,6 +29,9 @@ For each task, every model uses the same system prompt. We still account for dif
 | instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
 | instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
 ## Setup
 Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
@@ -34,12 +39,14 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
 ### Software
 - PyTorch 2.0.1
-- [FastChat](https://github.com/lm-sys/fastchat) -- For various model support
-- [Zeus](https://ml.energy/zeus) -- For GPU energy measurement
 ### Hardware
 - NVIDIA A40 GPU
 ### Parameters
@@ -61,18 +68,24 @@ We used identical system prompts for all models (while respecting their own *rol
 A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
 ```
 ## Upcoming
-- Compare energy numbers against more optimized inference runtimes, like TensorRT.
-- More GPU types
-- More models
 # License
 This leaderboard is a research preview intended for non-commercial use only.
 The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
-Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.
 # Acknowledgements
-We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).

 However, even if we run models with the exact same architecture and size on the same GPU, the average energy consumption **per prompt** is different because different models have **different verbosity**.
 That is, when asked the same thing, different models answer in different lengths.
+## Columns
+- `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
 - `task`: Name of the task. See *Tasks* below for details.
+- `energy_efficiency`: The average NLP evaluation metric attained per Joule of energy.
+- `energy` (J): The average energy consumed by the model to generate a response.
+- `nlp_average`: The arithmetic average of the NLP evaluation metrics we obtained. See *NLP evaluation metrics* below for details.
 - `throughput` (token/s): The average number of tokens generated per second.
 - `latency` (s): The average time it took for the model to generate a response.
+- `response_length` (token): The average number of tokens in the model's response.
 - `parameters`: The number of parameters the model has, in units of billion.
 ## Tasks
 | instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
 | instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
+You can see that response length is shorter on average for the `-concise` variants of the tasks.
+This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.
 ## Setup
 Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
 ### Software
 - PyTorch 2.0.1
+- [Zeus](https://ml.energy/zeus) -- For GPU time and energy measurement
+- [FastChat](https://github.com/lm-sys/fastchat) -- For running inference on various models
+- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/commit/72b7f0c00a6ff94632c5b873fc24e093ae74fa47) -- For NLP evaluation metrics
 ### Hardware
 - NVIDIA A40 GPU
+- NVIDIA A100 GPU
 ### Parameters
 A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
 ```
+## NLP evaluation metrics
+- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
+- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
+- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
 ## Upcoming
+- More optimized inference runtimes, like TensorRT.
+- More GPU models, like V100.
+- More models, like RWKV.
 # License
 This leaderboard is a research preview intended for non-commercial use only.
 The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
+Please direct inquiries/reports of potential violation to Jae-Won Chung.
 # Acknowledgements
+We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).

app.py CHANGED Viewed

@@ -19,9 +19,9 @@ class TableManager:
         """Load leaderboard data from CSV files in data_dir."""
         # Load and merge CSV files.
         df = self._read_tables(data_dir)
-        models = json.load(open(f"{data_dir}/models.json"))
         # Add the #params column.
         df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
         # Make the first column (model) an HTML anchor to the model's website.
@@ -34,8 +34,8 @@ class TableManager:
             )
         df["model"] = df["model"].apply(format_model_link)
-        # Sort by energy.
-        df = df.sort_values(by="energy", ascending=True)
         # The full table where all the data are.
         self.full_df = df
@@ -48,6 +48,11 @@ class TableManager:
         """Read tables."""
         df_score = pd.read_csv(f"{data_dir}/score.csv")
         with open(f"{data_dir}/schema.yaml") as file:
             self.schema: dict[str, list] = yaml.safe_load(file)
@@ -66,7 +71,24 @@ class TableManager:
         if res_df.empty:
             raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
-        return pd.merge(res_df, df_score, on=["model"]).round(2)
     def _format_msg(self, text: str) -> str:
         """Formats into HTML that prints in Monospace font."""
@@ -111,8 +133,8 @@ class TableManager:
     def get_dropdown(self):
         columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
         return [
-            gr.Dropdown(choices=columns, label="X"),
-            gr.Dropdown(choices=columns, label="Y"),
             gr.Dropdown(choices=columns, label="Z (optional)"),
         ]
@@ -320,7 +342,8 @@ with block:
                     plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
                     plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
             with gr.Row():
-                plot = gr.Plot()
             with gr.Row():
                 plot_message = gr.HTML("")
             add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns)  # type: ignore

         """Load leaderboard data from CSV files in data_dir."""
         # Load and merge CSV files.
         df = self._read_tables(data_dir)
         # Add the #params column.
+        models = json.load(open(f"{data_dir}/models.json"))
         df["parameters"] = df["model"].apply(lambda x: models[x]["params"])
         # Make the first column (model) an HTML anchor to the model's website.
             )
         df["model"] = df["model"].apply(format_model_link)
+        # Sort by our 'energy efficiency' score.
+        df = df.sort_values(by="energy_efficiency", ascending=True)
         # The full table where all the data are.
         self.full_df = df
         """Read tables."""
         df_score = pd.read_csv(f"{data_dir}/score.csv")
+        # Compute average NLP metrics
+        columns = df_score.columns.to_list()
+        columns.remove("model")
+        df_score["nlp_average"] = df_score[columns].mean(axis=1)
         with open(f"{data_dir}/schema.yaml") as file:
             self.schema: dict[str, list] = yaml.safe_load(file)
         if res_df.empty:
             raise ValueError(f"No benchmark CSV files were read from {data_dir=}.")
+        df = pd.merge(res_df, df_score, on=["model"])
+        # Energy efficiency is defined as the amount of average NLP performance
+        # the model gets per Joule of energy.
+        df["energy_efficiency"] = df["nlp_average"] / df["energy"]
+        # Order columns.
+        columns = df.columns.to_list()
+        cols_to_order = ["model"]
+        cols_to_order.extend(self.schema.keys())
+        cols_to_order.extend(["energy_efficiency", "energy", "nlp_average"])
+        columns = cols_to_order + [col for col in columns if col not in cols_to_order]
+        df = df[columns]
+        # Delete rows with *any* NaN values.
+        df = df.dropna()
+        return df.round(2)
     def _format_msg(self, text: str) -> str:
         """Formats into HTML that prints in Monospace font."""
     def get_dropdown(self):
         columns = self.full_df.columns.tolist()[1:] # include gpu and task in the dropdown
         return [
+            gr.Dropdown("nlp_average", choices=columns, label="X"),
+            gr.Dropdown("energy_efficiency", choices=columns, label="Y"),
             gr.Dropdown(choices=columns, label="Z (optional)"),
         ]
                     plot_width_input = gr.Textbox("600", lines=1, label="Width (px)")
                     plot_height_input = gr.Textbox("600", lines=1, label="Height (px)")
             with gr.Row():
+                # By default show a plot of average model quality vs energy consumption.
+                plot = gr.Plot(global_tbm.plot_scatter("600", "600", "gpu", "nlp_average", "energy")[0])
             with gr.Row():
                 plot_message = gr.HTML("")
             add_col_btn.click(TableManager.update_dropdown, inputs=tbm, outputs=axis_dropdowns)  # type: ignore

data/2023-06-17/score.csv CHANGED Viewed

@@ -1,21 +1,21 @@
-model,average,ARC (25-s),HellaSwag (10-s),TruthfulQA (MC) (0-s)
-lmsys/vicuna-7B,60.0,53.5,77.5,49.0
-lmsys/vicuna-13B,61.6,52.9,80.1,51.8
-tatsu-lab/alpaca-7B,56.4,52.6,76.9,39.6
-metaai/llama-7B,54.3,51.1,77.7,34.1
-metaai/llama-13B,59.0,56.3,80.9,39.9
-camel-ai/CAMEL-13B-Combined-Data,60.7,55.5,79.3,47.3
-BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN,NaN
-databricks/dolly-v2-12b,49.1,42.2,71.8,33.4
-FreedomIntelligence/phoenix-inst-chat-7b,51.8,45.0,63.2,47.1
-h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,45.5,36.9,61.6,37.9
-lmsys/fastchat-t5-3b-v1.0,43.7,35.9,46.4,48.8
-Neutralzz/BiLLa-7B-SFT,34.2,27.7,26.0,49.0
-nomic-ai/gpt4all-13b-snoozy,61.1,56.1,78.7,48.4
-openaccess-ai-collective/manticore-13b-chat-pyg,63.2,58.7,82.0,48.9
-OpenAssistant/oasst-sft-1-pythia-12b,51.6,45.6,69.9,39.2
-project-baize/baize-v2-7B,55.1,48.5,75.0,41.7
-BAIR/koala-7b,55.6,47.1,73.7,46.0
-BAIR/koala-13b,60.2,52.9,77.5,50.1
-StabilityAI/stablelm-tuned-alpha-7b,41.9,31.9,53.6,40.2
-togethercomputer/RedPajama-INCITE-7B-Chat,49.7,42.2,70.8,36.1

+model,arc,hellaswag,truthfulqa
+lmsys/vicuna-7B,53.5,77.5,49.0
+lmsys/vicuna-13B,52.9,80.1,51.8
+tatsu-lab/alpaca-7B,52.6,76.9,39.6
+metaai/llama-7B,51.1,77.7,34.1
+metaai/llama-13B,56.3,80.9,39.9
+camel-ai/CAMEL-13B-Combined-Data,55.5,79.3,47.3
+BlinkDL/RWKV-4-Raven-7B-v12-Eng98%-Other2%-20230521-ctx8192.pth,NaN,NaN,NaN
+databricks/dolly-v2-12b,42.2,71.8,33.4
+FreedomIntelligence/phoenix-inst-chat-7b,45.0,63.2,47.1
+h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-7b-preview-300bt-v2,36.9,61.6,37.9
+lmsys/fastchat-t5-3b-v1.0,35.9,46.4,48.8
+Neutralzz/BiLLa-7B-SFT,27.7,26.0,49.0
+nomic-ai/gpt4all-13b-snoozy,56.1,78.7,48.4
+openaccess-ai-collective/manticore-13b-chat-pyg,58.7,82.0,48.9
+OpenAssistant/oasst-sft-1-pythia-12b,45.6,69.9,39.2
+project-baize/baize-v2-7B,48.5,75.0,41.7
+BAIR/koala-7b,47.1,73.7,46.0
+BAIR/koala-13b,52.9,77.5,50.1
+StabilityAI/stablelm-tuned-alpha-7b,31.9,53.6,40.2
+togethercomputer/RedPajama-INCITE-7B-Chat,42.2,70.8,36.1