AppleSwing commited on
Commit
a4829c2
2 Parent(s): 2a18e0a f3caf97

Merge branch 'main' into pr/15

Browse files
README.md CHANGED
@@ -15,20 +15,71 @@ tags:
15
 
16
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
17
 
18
- ## Local development
19
 
20
- Create a virtual environment and install the dependencies:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ```bash
23
- conda create -n <env_name> python=3.10
24
- conda activate <env_name>
25
  pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```
27
 
28
- **Follow the instructions in Dockerfile to install other necessary dependencies.**
29
 
30
- Start the backend server in debug mode:
 
 
31
 
32
  ```bash
33
  python backend-cli.py --debug
34
- ```
 
 
 
 
 
 
 
15
 
16
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
17
 
18
+ # Contributing to Open-MOE-LLM-Leaderboard
19
 
20
+ Thank you for your interest in contributing to the Open-MOE-LLM-Leaderboard project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yao Fu via email at [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk).
21
+
22
+ ## What We're Looking For in Contributions
23
+
24
+ We are looking for contributions in several key areas to enhance the Open-MOE-LLM-Leaderboard project:
25
+
26
+ 1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
27
+
28
+ 2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
29
+
30
+ 3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
31
+
32
+ 4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
33
+
34
+ Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
35
+
36
+ Your contributions are crucial to the success and improvement of the Open-MOE-LLM-Leaderboard project. We look forward to collaborating with you.
37
+
38
+
39
+ ## Development Setup
40
+
41
+ To start contributing, set up your development environment as follows:
42
 
43
  ```bash
44
+ conda create -n leaderboard python=3.10
45
+ conda activate leaderboard
46
  pip install -r requirements.txt
47
+ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
48
+ pip install pydantic==2.6.4 # Resolves a dependency conflict with moe-infinity
49
+ python -m spacy download en # Required for selfcheckgpt
50
+ ```
51
+
52
+ ## Architecture Overview
53
+
54
+ The Open-MOE-LLM-Leaderboard project uses the following architecture:
55
+
56
+ - **User Interface (Gradio)** ->upload-> **HuggingFace Dataset (Request)** ->download-> **Backend GPU Server** ->upload-> **HuggingFace Dataset (Result)** ->download-> **User Interface (Gradio)**
57
+
58
+ In brief:
59
+ 1. Users submit model benchmarking requests through the Gradio interface ([app.py](./app.py)). These requests are then recorded in a HuggingFace dataset ([sparse-generative-ai/requests](https://huggingface.co/datasets/sparse-generative-ai/requests)).
60
+ 2. The backend ([backend-cli.py](./backend-cli.py)), running on a GPU server, processes these requests, performs the benchmarking tasks, and uploads the results to another HuggingFace dataset ([sparse-generative-ai/results](https://huggingface.co/datasets/sparse-generative-ai/results)).
61
+ 3. Finally, the Gradio interface retrieves and displays these results to the users.
62
+
63
+ ## Running the Gradio Interface
64
+
65
+ To launch the Gradio interface, execute:
66
+
67
+ ```bash
68
+ python app.py
69
  ```
70
 
71
+ Then, open your browser and navigate to http://127.0.0.1:7860.
72
 
73
+ ## Running the Backend
74
+
75
+ To start the backend process, use:
76
 
77
  ```bash
78
  python backend-cli.py --debug
79
+ ```
80
+
81
+ For additional details, please consult the [backend-cli.py](./backend-cli.py) script.
82
+
83
+ ---
84
+
85
+ We look forward to your contributions and are here to help guide you through the process. Thank you for supporting the Open-MOE-LLM-Leaderboard project!
requirements.txt CHANGED
@@ -18,7 +18,7 @@ tqdm
18
  wandb
19
  transformers>=4.36.0
20
  tokenizers>=0.15.0
21
- lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git
22
  accelerate
23
  sentencepiece
24
  langdetect
 
18
  wandb
19
  transformers>=4.36.0
20
  tokenizers>=0.15.0
21
+ lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@v0.4.2
22
  accelerate
23
  sentencepiece
24
  langdetect
src/backend/hflm_with_measurement.py CHANGED
@@ -68,6 +68,226 @@ class HFLMWithMeasurement(HFLM):
68
  def __init__(self, **kwargs):
69
  super().__init__(**kwargs)
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  def _model_generate(self, context, max_length, stop, **generation_kwargs):
72
  # temperature = 0.0 if not set
73
  # if do_sample is false and temp==0.0:
 
68
  def __init__(self, **kwargs):
69
  super().__init__(**kwargs)
70
 
71
+ def _loglikelihood_tokens(
72
+ self,
73
+ requests: List[Tuple[Tuple[str, str], List[int], List[int]]],
74
+ disable_tqdm: bool = False,
75
+ override_bs: int = None,
76
+ ) -> List[Tuple[float, bool]]:
77
+ # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
78
+ res = []
79
+
80
+ def _collate(req: Tuple[Tuple[str, str], List[int], List[int]]):
81
+ """Defines the key for the sorted method"""
82
+ # the negative sign on len(toks) sorts descending - this has a few advantages:
83
+ # - time estimates will always be over not underestimates, which is more useful for planning
84
+ # - to know the size of a batch when going through the list, you know the first one is always the batch
85
+ # padded context length. this is useful to simplify the batching logic and more importantly to make
86
+ # automatic adaptive batches much much easier to implement
87
+ # - any OOMs will happen right away rather than near the end
88
+
89
+ toks = req[1] + req[2]
90
+ return -len(toks), tuple(toks)
91
+
92
+ def _lookup_one_token_cont(req: Tuple[Tuple[str, str], List[int], List[int]]):
93
+ """Defines the key to group and lookup one-token continuations"""
94
+ # Use with group_by="contexts" (optional)"
95
+ # allows for the creation of a lookup, so we can reuse logits in case of one-token continuations.
96
+ # speeds up some multiple-choice tasks proportionally to the number of choices.
97
+ # groups requests by context+continuation[:-1] and infer on one request/group.
98
+ return req[-2] + req[-1][:-1]
99
+
100
+ re_ord = Collator(
101
+ requests,
102
+ sort_fn=_collate,
103
+ group_by="contexts"
104
+ if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
105
+ and self.logits_cache
106
+ else None,
107
+ group_fn=_lookup_one_token_cont,
108
+ )
109
+
110
+ # automatic (variable) batch size detection for vectorization
111
+ # pull longest context sample from request
112
+ n_reordered_requests = len(re_ord)
113
+ batch_size = (
114
+ self.batch_size
115
+ if self.batch_size != "auto"
116
+ else override_bs
117
+ if override_bs is not None
118
+ else 0
119
+ )
120
+ batch_fn = (
121
+ self._batch_scheduler
122
+ if self.batch_size == "auto"
123
+ and n_reordered_requests > 0
124
+ and not override_bs
125
+ else None
126
+ )
127
+
128
+ chunks = re_ord.get_batched(n=batch_size, batch_fn=batch_fn)
129
+ pbar = tqdm(
130
+ total=len(requests),
131
+ disable=(disable_tqdm or (self.rank != 0)),
132
+ desc="Running loglikelihood requests",
133
+ )
134
+ for chunk in chunks:
135
+ inps = []
136
+ cont_toks_list = []
137
+ inplens = []
138
+
139
+ conts = []
140
+ encoder_attns = []
141
+
142
+ padding_len_inp = None
143
+ padding_len_cont = None
144
+ # because vectorizing is annoying, we first convert each (context, continuation) pair to padded
145
+ # tensors, then we pack them together into a batch, call the model, and then pick it all apart
146
+ # again because vectorizing is annoying
147
+
148
+ for _, context_enc, continuation_enc in chunk:
149
+ # sanity check
150
+ assert len(context_enc) > 0
151
+ assert len(continuation_enc) > 0
152
+ assert len(continuation_enc) <= self.max_length
153
+
154
+ # how this all works (illustrated on a causal decoder-only setup):
155
+ # CTX CONT
156
+ # inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
157
+ # model \ \
158
+ # logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the
159
+ # cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
160
+
161
+ # when too long to fit in context, truncate from the left
162
+ if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
163
+ inp = torch.tensor(
164
+ (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
165
+ dtype=torch.long,
166
+ device=self.device,
167
+ )
168
+ (inplen,) = inp.shape
169
+ elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
170
+ inp = torch.tensor(
171
+ (context_enc)[-self.max_length :],
172
+ dtype=torch.long,
173
+ device=self.device,
174
+ )
175
+ (inplen,) = inp.shape
176
+
177
+ # build encoder attn masks
178
+ encoder_attns.append(torch.ones_like(inp))
179
+
180
+ cont = torch.tensor(
181
+ (continuation_enc)[-self.max_length :],
182
+ # TODO: left-shift these?
183
+ # TODO: our code assumes we never end up truncating conts for either model type
184
+ dtype=torch.long,
185
+ device=self.device,
186
+ )
187
+ (contlen,) = cont.shape
188
+
189
+ conts.append(cont)
190
+
191
+ padding_len_cont = (
192
+ max(padding_len_cont, contlen)
193
+ if padding_len_cont is not None
194
+ else contlen
195
+ )
196
+
197
+ padding_len_inp = (
198
+ max(padding_len_inp, inplen)
199
+ if padding_len_inp is not None
200
+ else inplen
201
+ )
202
+
203
+ inps.append(inp) # [1, inp_length]
204
+ cont_toks_list.append(continuation_enc)
205
+ inplens.append(inplen)
206
+
207
+ # create encoder attn mask and batched conts, if seq2seq
208
+ call_kwargs = {}
209
+ if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
210
+ batched_inps = pad_and_concat(
211
+ padding_len_inp, inps, padding_side="right"
212
+ ) # [batch, padding_len_inp]
213
+ elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
214
+ # TODO: left-pad encoder inps and mask?
215
+ batched_inps = pad_and_concat(
216
+ padding_len_inp, inps
217
+ ) # [batch, padding_len_inp]
218
+ batched_conts = pad_and_concat(
219
+ padding_len_cont, conts
220
+ ) # [batch, padding_len_cont]
221
+ batched_encoder_mask = pad_and_concat(
222
+ padding_len_inp, encoder_attns
223
+ ) # [batch, padding_len_inp]
224
+ call_kwargs = {
225
+ "attn_mask": batched_encoder_mask,
226
+ "labels": batched_conts,
227
+ }
228
+
229
+ start = time()
230
+ intermediate_res = self._model_call(batched_inps, **call_kwargs)
231
+ end = time()
232
+ multi_logits = F.log_softmax(
233
+ intermediate_res , dim=-1
234
+ ) # [batch, padding_length (inp or cont), vocab]
235
+ per_sample_time = (end - start) / len(multi_logits)
236
+
237
+ for (request_str, ctx_tokens, _), logits, inplen, cont_toks in zip(
238
+ chunk, multi_logits, inplens, cont_toks_list
239
+ ):
240
+ # Slice to original seq length
241
+ contlen = len(cont_toks)
242
+ # take only logits in the continuation
243
+ # (discard context toks if decoder-only ; discard right-padding)
244
+ # also discards + checks for "virtual tokens" in the causal LM's input window
245
+ # from prompt/prefix tuning tokens, if applicable
246
+ ctx_len = (
247
+ inplen + (logits.shape[0] - padding_len_inp)
248
+ if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
249
+ else None
250
+ )
251
+ logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
252
+ logits = logits.unsqueeze(0) # [1, seq, vocab]
253
+
254
+ # Check if per-token argmax is exactly equal to continuation
255
+ greedy_tokens = logits.argmax(dim=-1)
256
+
257
+ # check for one-token continuation cache hits.
258
+ # noop in case group_by != "contexts" or no cache hit and returns the
259
+ # original args. Otherwise, expands the logits batch dimension and yields each
260
+ # batch along with matching continuation tokens and prompt strings.
261
+ # logits -> [1, seq, vocab]
262
+ for request_str, cont_toks, logits in re_ord.get_cache(
263
+ req_str=request_str,
264
+ cxt_toks=ctx_tokens,
265
+ cont_toks=cont_toks,
266
+ logits=logits,
267
+ ):
268
+ cont_toks = torch.tensor(
269
+ cont_toks, dtype=torch.long, device=self.device
270
+ ).unsqueeze(0) # [1, seq]
271
+ max_equal = (greedy_tokens == cont_toks).all()
272
+
273
+ # Obtain log-probs at the corresponding continuation token indices
274
+ # last_token_slice = logits[:, -1, :].squeeze(0).tolist()
275
+ logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
276
+ -1
277
+ ) # [1, seq]
278
+
279
+ # Answer: (log prob, is-exact-match)
280
+ answer = (float(logits.sum()), bool(max_equal))
281
+
282
+ res.append((answer, per_sample_time, 0, 0))
283
+
284
+ self.cache_hook.add_partial("loglikelihood", request_str, answer)
285
+ pbar.update(1)
286
+
287
+ pbar.close()
288
+
289
+ return re_ord.get_original(res)
290
+
291
  def _model_generate(self, context, max_length, stop, **generation_kwargs):
292
  # temperature = 0.0 if not set
293
  # if do_sample is false and temp==0.0:
src/backend/run_eval_suite.py CHANGED
@@ -1,13 +1,57 @@
1
  from lm_eval import evaluator
2
  from lm_eval.tasks import TaskManager
 
 
3
 
4
  from src.backend.manage_requests import EvalRequest
5
 
6
- from src.backend.tasks.xsum.task import XSum
7
- from src.backend.tasks.xsum.task_v2 import XSumv2
8
 
9
- from src.backend.tasks.cnndm.task import CNNDM
10
- from src.backend.tasks.cnndm.task_v2 import CNNDMv2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
13
 
 
1
  from lm_eval import evaluator
2
  from lm_eval.tasks import TaskManager
3
+ from lm_eval.api.metrics import mean
4
+ from lm_eval.api.task import ConfigurableTask
5
 
6
  from src.backend.manage_requests import EvalRequest
7
 
 
 
8
 
9
+ orig_process_results = ConfigurableTask.process_results
10
+ orig_aggregation = ConfigurableTask.aggregation
11
+ orig_higher_is_better = ConfigurableTask.higher_is_better
12
+
13
+ def process_results_decorator(func):
14
+ def wrapper(self, doc, results, *args, **kwargs):
15
+ processed_results = [r[0] for r in results]
16
+
17
+ end_to_end_time = sum([r[1] for r in results]) / len(results)
18
+ prefilling_time = sum([r[2] for r in results]) / len(results)
19
+ decoding_throughput = sum([r[3] for r in results]) / len(results)
20
+ # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
21
+
22
+ result_dict = func(self, doc, processed_results, *args, **kwargs)
23
+ result_dict["end_to_end_time"] = end_to_end_time
24
+ result_dict["prefilling_time"] = prefilling_time
25
+ result_dict["decoding_throughput"] = decoding_throughput
26
+ return result_dict
27
+ return wrapper
28
+ ConfigurableTask.process_results = process_results_decorator(orig_process_results)
29
+
30
+ def aggregation_decorator(func):
31
+ def wrapper(self, *args, **kwargs):
32
+ aggregation_list = func(self, *args, **kwargs)
33
+ aggregation_list["end_to_end_time"] = mean
34
+ aggregation_list["prefilling_time"] = mean
35
+ aggregation_list["decoding_throughput"] = mean
36
+ return aggregation_list
37
+ return wrapper
38
+ ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
39
+
40
+ def higher_is_better_decorator(func):
41
+ def wrapper(self, *args, **kwargs):
42
+ higher_is_better_dict = func(self, *args, **kwargs)
43
+ higher_is_better_dict["end_to_end_time"] = False
44
+ higher_is_better_dict["prefilling_time"] = False
45
+ higher_is_better_dict["decoding_throughput"] = True
46
+ return higher_is_better_dict
47
+ return wrapper
48
+ ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
49
+
50
+ # from src.backend.tasks.xsum.task import XSum
51
+ # from src.backend.tasks.xsum.task_v2 import XSumv2
52
+
53
+ # from src.backend.tasks.cnndm.task import CNNDM
54
+ # from src.backend.tasks.cnndm.task_v2 import CNNDMv2
55
 
56
  from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
57
 
src/backend/tasks/measurement_task_utils.py CHANGED
@@ -12,7 +12,7 @@ def process_results_decorator(func):
12
  end_to_end_time = sum([r[1] for r in results]) / len(results)
13
  prefilling_time = sum([r[2] for r in results]) / len(results)
14
  decoding_throughput = sum([r[3] for r in results]) / len(results)
15
- print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
16
 
17
  # Now call the original process_results with the processed results
18
  result_dict = func(self, doc, processed_results, *args, **kwargs)
 
12
  end_to_end_time = sum([r[1] for r in results]) / len(results)
13
  prefilling_time = sum([r[2] for r in results]) / len(results)
14
  decoding_throughput = sum([r[3] for r in results]) / len(results)
15
+ # print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
16
 
17
  # Now call the original process_results with the processed results
18
  result_dict = func(self, doc, processed_results, *args, **kwargs)
src/display/about.py CHANGED
@@ -1,13 +1,25 @@
1
  from src.display.utils import ModelType
2
 
3
- TITLE = """<h1 align="center" id="space-title">MOE LLM GPU-Poor Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
- 📐 The MOE LLM GPU-Poor Leaderboard aims to evaluate LLMs.
7
 
 
8
 
9
- """
10
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  LLM_BENCHMARKS_TEXT = f"""
12
 
13
  """
 
1
  from src.display.utils import ModelType
2
 
3
+ TITLE = """<h1 align="center" id="space-title">OPEN-MOE-LLM-LEADERBOARD</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
+ The OPEN-MOE-LLM-LEADERBOARD is specifically designed to assess the performance and efficiency of various Mixture of Experts (MoE) Large Language Models (LLMs). This initiative, driven by the open-source community, aims to comprehensively evaluate these advanced MoE LLMs. We extend our gratitude to the Huggingface for the GPU community grant that supported the initial debugging process, and to [NetMind.AI](https://netmind.ai/home) for their generous GPU donation, which ensures the continuous operation of the Leaderboard.
7
 
8
+ The OPEN-MOE-LLM-LEADERBOARD includes generation and multiple choice tasks to measure the performance and efficiency of MOE LLMs.
9
 
 
10
 
11
+ Tasks:
12
+ - **Generation Self-consistancy** -- [SelfCheckGPT](https://github.com/potsawee/selfcheckgpt)
13
+ - **Multiple Choice Performance** -- [MMLU](https://arxiv.org/abs/2009.03300)
14
+
15
+ Columns and Metrics:
16
+ - Method: The MOE LLMs inference framework.
17
+ - E2E(s): Average End to End generation time in seconds.
18
+ - PRE(s): Prefilling Time of input prompt in seconds.
19
+ - T/s: Tokens throughout per second.
20
+ - Precision: The precison of used model.
21
+
22
+ """
23
  LLM_BENCHMARKS_TEXT = f"""
24
 
25
  """
src/display/utils.py CHANGED
@@ -7,6 +7,11 @@ import pandas as pd
7
  def fields(raw_class):
8
  return [v for k, v in raw_class.__dict__.items() if k[:2] != "__" and k[-2:] != "__"]
9
 
 
 
 
 
 
10
 
11
  @dataclass
12
  class Task:
@@ -46,7 +51,7 @@ class Tasks(Enum):
46
 
47
  # # XXX include me back at some point
48
  selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
49
- mmlu = Task("mmlu", "acc", "MMLU/Acc (5-shot)")
50
 
51
 
52
  # These classes are for user facing column names,
@@ -71,20 +76,22 @@ auto_eval_column_dict.append(["model", ColumnContent, ColumnContent("Model", "ma
71
  # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
72
 
73
  # Inference framework
74
- auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent("Inference framework", "str", True)])
75
 
76
  for task in Tasks:
77
  auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
78
  # System performance metrics
79
- auto_eval_column_dict.append([f"{task.name}_end_to_end_time", ColumnContent, ColumnContent(f"{task.value.col_name} End-to-end time (s)", "number", True)])
80
- auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name} Prefilling time (s)", "number", True)])
81
- auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name} Decoding throughput (tok/s)", "number", True)])
 
 
82
 
83
  # Model information
84
  auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
85
  auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
86
  auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
87
- auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", False)])
88
  auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
89
  auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
90
  auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
@@ -144,7 +151,7 @@ class InferenceFramework(Enum):
144
 
145
  def to_str(self):
146
  return self.value.name
147
-
148
  @staticmethod
149
  def from_str(inference_framework: str):
150
  if inference_framework in ["moe-infinity"]:
@@ -152,7 +159,7 @@ class InferenceFramework(Enum):
152
  if inference_framework in ["hf-chat"]:
153
  return InferenceFramework.HF_Chat
154
  return InferenceFramework.Unknown
155
-
156
 
157
  class WeightType(Enum):
158
  Adapter = ModelDetails("Adapter")
 
7
  def fields(raw_class):
8
  return [v for k, v in raw_class.__dict__.items() if k[:2] != "__" and k[-2:] != "__"]
9
 
10
+ E2Es = "E2E(s)" #"End-to-end time (s)"
11
+ PREs = "PRE(s)" #"Prefilling time (s)"
12
+ TS = "T/s" #Decoding throughput (tok/s)
13
+ InFrame = "Method" #"Inference framework"
14
+ MULTIPLE_CHOICEs = ["mmlu"]
15
 
16
  @dataclass
17
  class Task:
 
51
 
52
  # # XXX include me back at some point
53
  selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
54
+ mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
55
 
56
 
57
  # These classes are for user facing column names,
 
76
  # # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
77
 
78
  # Inference framework
79
+ auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True)])
80
 
81
  for task in Tasks:
82
  auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
83
  # System performance metrics
84
+ auto_eval_column_dict.append([f"{task.name}_end_to_end_time", ColumnContent, ColumnContent(f"{task.value.col_name}-{E2Es}", "number", True)])
85
+ if task.value.benchmark in MULTIPLE_CHOICEs:
86
+ continue
87
+ auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name}-{PREs}", "number", True)])
88
+ auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name}-{TS}", "number", True)])
89
 
90
  # Model information
91
  auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
92
  auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
93
  auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
94
+ auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True)])
95
  auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
96
  auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
97
  auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
 
151
 
152
  def to_str(self):
153
  return self.value.name
154
+
155
  @staticmethod
156
  def from_str(inference_framework: str):
157
  if inference_framework in ["moe-infinity"]:
 
159
  if inference_framework in ["hf-chat"]:
160
  return InferenceFramework.HF_Chat
161
  return InferenceFramework.Unknown
162
+
163
 
164
  class WeightType(Enum):
165
  Adapter = ModelDetails("Adapter")
src/populate.py CHANGED
@@ -12,7 +12,7 @@ from src.leaderboard.read_evals import get_raw_eval_results, EvalResult, update_
12
 
13
  from src.backend.envs import Tasks as BackendTasks
14
  from src.display.utils import Tasks
15
-
16
 
17
  def get_leaderboard_df(
18
  results_path: str,
@@ -47,9 +47,9 @@ def get_leaderboard_df(
47
 
48
  # bm_to_name_map = {bm: name for name, bm in name_to_bm_map.items()}
49
  system_metrics_to_name_map = {
50
- "end_to_end_time": "End-to-end time (s)",
51
- "prefilling_time": "Prefilling time (s)",
52
- "decoding_throughput": "Decoding throughput (tok/s)",
53
  }
54
 
55
  all_data_json = []
 
12
 
13
  from src.backend.envs import Tasks as BackendTasks
14
  from src.display.utils import Tasks
15
+ from src.display.utils import E2Es, PREs, TS
16
 
17
  def get_leaderboard_df(
18
  results_path: str,
 
47
 
48
  # bm_to_name_map = {bm: name for name, bm in name_to_bm_map.items()}
49
  system_metrics_to_name_map = {
50
+ "end_to_end_time": f"{E2Es}",
51
+ "prefilling_time": f"{PREs}",
52
+ "decoding_throughput": f"{TS}",
53
  }
54
 
55
  all_data_json = []