Jae-Won Chung commited on
Commit
64e7ccb
1 Parent(s): 48843fe

Better About tab, fetch leaderboard date from git

Browse files
LEADERBOARD.md CHANGED
@@ -65,14 +65,32 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
65
  We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
66
  See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
67
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## Limitations
69
 
70
  Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
71
  Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
72
 
 
 
 
 
 
73
  ## Upcoming
74
 
75
  - Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
 
76
  - More optimized inference runtimes, like TensorRT.
77
  - Larger models with distributed inference, like Falcon 40B.
78
  - More models, like RWKV.
 
65
  We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
66
  See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
67
 
68
+ ## Contributing
69
+
70
+ Any kind of contribution is more than welcome!
71
+ Please look around our [repository](https://github.com/ml-energy/leaderboard).
72
+
73
+ Especially, if you want to see a specific model on the leaderboard, please consider adding support to the model.
74
+ We'll consider running those on the hardware we have.
75
+ First, see if the model is available in Hugging Face Hub and compatible with lm-evaluation-harness.
76
+ Then, in our [`benchmark.py`](https://github.com/ml-energy/leaderboard/blob/master/scripts/benchmark.py), implement a way to load the weights of the model and run generative inference.
77
+
78
+ Currently, we use FastChat to load models and run inference, but we'll eventually abstract the model executor, making it easier to add models that FastChat does not support.
79
+
80
  ## Limitations
81
 
82
  Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
83
  Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.
84
 
85
+ Batch size 1, in some sense, is the lowest possible hardware utilization.
86
+ We'll soon benchmark batch sizes larger than 1 without continuous batching for comparison.
87
+ This would show what happens in the case of very high hardware utilization (lest with PyTorch), assuming an ideal case where all sequences in each batch generates the same number of output tokens.
88
+ By doing this, we can provide numbers for reasonable comparison without being tied to any existing generative model serving system.
89
+
90
  ## Upcoming
91
 
92
  - Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
93
+ - Batched inference
94
  - More optimized inference runtimes, like TensorRT.
95
  - Larger models with distributed inference, like Falcon 40B.
96
  - More models, like RWKV.
app.py CHANGED
@@ -1,10 +1,11 @@
1
  from __future__ import annotations
2
 
3
- import os
4
  import json
5
  import yaml
 
6
  import itertools
7
  import contextlib
 
8
 
9
  import numpy as np
10
  import gradio as gr
@@ -158,7 +159,7 @@ class TableManager:
158
  gr.Dropdown.update(choices=["None", *columns]),
159
  ]
160
 
161
- def set_filter_get_df(self, *filters):
162
  """Set the current set of filters and return the filtered DataFrame."""
163
  # If the filter is empty, we default to the first choice for each key.
164
  if not filters:
@@ -200,16 +201,21 @@ class TableManager:
200
  return fig, width, height, ""
201
 
202
 
203
- # Find the latest version of the CSV files in data/
204
- # and initialize the global TableManager.
205
- latest_date = sorted(os.listdir("data/"))[-1]
206
-
207
  # The global instance of the TableManager should only be used when
208
  # initializing components in the Gradio interface. If the global instance
209
  # is mutated while handling user sessions, the change will be reflected
210
  # in every user session. Instead, the instance provided by gr.State should
211
  # be used.
212
- global_tbm = TableManager(f"data/{latest_date}")
 
 
 
 
 
 
 
 
 
213
 
214
  # Custom JS.
215
  # XXX: This is a hack to make the model names clickable.
@@ -397,7 +403,7 @@ with block:
397
 
398
  # Block 5: Leaderboard date.
399
  with gr.Row():
400
- gr.HTML(f"<h3 style='color: gray'>Date: {latest_date}</h3>")
401
 
402
  # Tab 2: About page.
403
  with gr.TabItem("About"):
 
1
  from __future__ import annotations
2
 
 
3
  import json
4
  import yaml
5
+ import subprocess
6
  import itertools
7
  import contextlib
8
+ from dateutil import parser
9
 
10
  import numpy as np
11
  import gradio as gr
 
159
  gr.Dropdown.update(choices=["None", *columns]),
160
  ]
161
 
162
+ def set_filter_get_df(self, *filters) -> pd.DataFrame:
163
  """Set the current set of filters and return the filtered DataFrame."""
164
  # If the filter is empty, we default to the first choice for each key.
165
  if not filters:
 
201
  return fig, width, height, ""
202
 
203
 
 
 
 
 
204
  # The global instance of the TableManager should only be used when
205
  # initializing components in the Gradio interface. If the global instance
206
  # is mutated while handling user sessions, the change will be reflected
207
  # in every user session. Instead, the instance provided by gr.State should
208
  # be used.
209
+ global_tbm = TableManager("data")
210
+
211
+ # Run git log to get the latest commit date.
212
+ proc = subprocess.run(
213
+ ["git", "log", "-1", "--format=%cd"],
214
+ stdout=subprocess.PIPE,
215
+ stderr=subprocess.PIPE,
216
+ encoding="utf-8",
217
+ )
218
+ current_date = parser.parse(proc.stdout.strip()).strftime("%Y-%m-%d")
219
 
220
  # Custom JS.
221
  # XXX: This is a hack to make the model names clickable.
 
403
 
404
  # Block 5: Leaderboard date.
405
  with gr.Row():
406
+ gr.HTML(f"<h3 style='color: gray'>Date: {current_date}</h3>")
407
 
408
  # Tab 2: About page.
409
  with gr.TabItem("About"):
data/{2023-07-05/A100_chat-concise_benchmark.csv → A100_chat-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A100_chat_benchmark.csv → A100_chat_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A100_instruct-concise_benchmark.csv → A100_instruct-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A100_instruct_benchmark.csv → A100_instruct_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A40_chat-concise_benchmark.csv → A40_chat-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A40_chat_benchmark.csv → A40_chat_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A40_instruct-concise_benchmark.csv → A40_instruct-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/A40_instruct_benchmark.csv → A40_instruct_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/V100_chat-concise_benchmark.csv → V100_chat-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/V100_chat_benchmark.csv → V100_chat_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/V100_instruct-concise_benchmark.csv → V100_instruct-concise_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/V100_instruct_benchmark.csv → V100_instruct_benchmark.csv} RENAMED
File without changes
data/{2023-07-05/models.json → models.json} RENAMED
File without changes
data/{2023-07-05/schema.yaml → schema.yaml} RENAMED
File without changes
data/{2023-07-05/score.csv → score.csv} RENAMED
File without changes