AppleSwing
commited on
Merge branch 'main' into pr/15
Browse files- README.md +58 -7
- requirements.txt +1 -1
- src/backend/hflm_with_measurement.py +220 -0
- src/backend/run_eval_suite.py +48 -4
- src/backend/tasks/measurement_task_utils.py +1 -1
- src/display/about.py +15 -3
- src/display/utils.py +15 -8
- src/populate.py +4 -4
README.md
CHANGED
@@ -15,20 +15,71 @@ tags:
|
|
15 |
|
16 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
```bash
|
23 |
-
conda create -n
|
24 |
-
conda activate
|
25 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
```
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
|
|
|
|
31 |
|
32 |
```bash
|
33 |
python backend-cli.py --debug
|
34 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
17 |
|
18 |
+
# Contributing to Open-MOE-LLM-Leaderboard
|
19 |
|
20 |
+
Thank you for your interest in contributing to the Open-MOE-LLM-Leaderboard project! We welcome contributions from everyone. Below you'll find guidance on how to set up your development environment, understand our architecture, and contribute effectively. If you have any questions or wish to discuss your contributions, please reach out to Yao Fu via email at [Y.Fu@ed.ac.uk](mailto:y.fu@ed.ac.uk).
|
21 |
+
|
22 |
+
## What We're Looking For in Contributions
|
23 |
+
|
24 |
+
We are looking for contributions in several key areas to enhance the Open-MOE-LLM-Leaderboard project:
|
25 |
+
|
26 |
+
1. **General Bug Fixes/Reports**: We welcome reports of any bugs found in the frontend interface or backend, as well as fixes for these issues.
|
27 |
+
|
28 |
+
2. **Adding New Tasks (Benchmark Datasets)**: If you have ideas for new benchmark datasets that could be added, your contributions would be greatly appreciated.
|
29 |
+
|
30 |
+
3. **Supporting New Inference Frameworks**: Expanding our project to support new inference frameworks is crucial for our growth. If you can contribute in this area, please reach out.
|
31 |
+
|
32 |
+
4. **Testing More Models**: To make our leaderboard as comprehensive as possible, we need to test a wide range of models. Contributions in this area are highly valuable.
|
33 |
+
|
34 |
+
Documentation is currently of lower priority, but if you have thoughts or suggestions, please feel free to raise them.
|
35 |
+
|
36 |
+
Your contributions are crucial to the success and improvement of the Open-MOE-LLM-Leaderboard project. We look forward to collaborating with you.
|
37 |
+
|
38 |
+
|
39 |
+
## Development Setup
|
40 |
+
|
41 |
+
To start contributing, set up your development environment as follows:
|
42 |
|
43 |
```bash
|
44 |
+
conda create -n leaderboard python=3.10
|
45 |
+
conda activate leaderboard
|
46 |
pip install -r requirements.txt
|
47 |
+
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ moe-infinity
|
48 |
+
pip install pydantic==2.6.4 # Resolves a dependency conflict with moe-infinity
|
49 |
+
python -m spacy download en # Required for selfcheckgpt
|
50 |
+
```
|
51 |
+
|
52 |
+
## Architecture Overview
|
53 |
+
|
54 |
+
The Open-MOE-LLM-Leaderboard project uses the following architecture:
|
55 |
+
|
56 |
+
- **User Interface (Gradio)** ->upload-> **HuggingFace Dataset (Request)** ->download-> **Backend GPU Server** ->upload-> **HuggingFace Dataset (Result)** ->download-> **User Interface (Gradio)**
|
57 |
+
|
58 |
+
In brief:
|
59 |
+
1. Users submit model benchmarking requests through the Gradio interface ([app.py](./app.py)). These requests are then recorded in a HuggingFace dataset ([sparse-generative-ai/requests](https://huggingface.co/datasets/sparse-generative-ai/requests)).
|
60 |
+
2. The backend ([backend-cli.py](./backend-cli.py)), running on a GPU server, processes these requests, performs the benchmarking tasks, and uploads the results to another HuggingFace dataset ([sparse-generative-ai/results](https://huggingface.co/datasets/sparse-generative-ai/results)).
|
61 |
+
3. Finally, the Gradio interface retrieves and displays these results to the users.
|
62 |
+
|
63 |
+
## Running the Gradio Interface
|
64 |
+
|
65 |
+
To launch the Gradio interface, execute:
|
66 |
+
|
67 |
+
```bash
|
68 |
+
python app.py
|
69 |
```
|
70 |
|
71 |
+
Then, open your browser and navigate to http://127.0.0.1:7860.
|
72 |
|
73 |
+
## Running the Backend
|
74 |
+
|
75 |
+
To start the backend process, use:
|
76 |
|
77 |
```bash
|
78 |
python backend-cli.py --debug
|
79 |
+
```
|
80 |
+
|
81 |
+
For additional details, please consult the [backend-cli.py](./backend-cli.py) script.
|
82 |
+
|
83 |
+
---
|
84 |
+
|
85 |
+
We look forward to your contributions and are here to help guide you through the process. Thank you for supporting the Open-MOE-LLM-Leaderboard project!
|
requirements.txt
CHANGED
@@ -18,7 +18,7 @@ tqdm
|
|
18 |
wandb
|
19 |
transformers>=4.36.0
|
20 |
tokenizers>=0.15.0
|
21 |
-
lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git
|
22 |
accelerate
|
23 |
sentencepiece
|
24 |
langdetect
|
|
|
18 |
wandb
|
19 |
transformers>=4.36.0
|
20 |
tokenizers>=0.15.0
|
21 |
+
lm_eval[ifeval] @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@v0.4.2
|
22 |
accelerate
|
23 |
sentencepiece
|
24 |
langdetect
|
src/backend/hflm_with_measurement.py
CHANGED
@@ -68,6 +68,226 @@ class HFLMWithMeasurement(HFLM):
|
|
68 |
def __init__(self, **kwargs):
|
69 |
super().__init__(**kwargs)
|
70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
def _model_generate(self, context, max_length, stop, **generation_kwargs):
|
72 |
# temperature = 0.0 if not set
|
73 |
# if do_sample is false and temp==0.0:
|
|
|
68 |
def __init__(self, **kwargs):
|
69 |
super().__init__(**kwargs)
|
70 |
|
71 |
+
def _loglikelihood_tokens(
|
72 |
+
self,
|
73 |
+
requests: List[Tuple[Tuple[str, str], List[int], List[int]]],
|
74 |
+
disable_tqdm: bool = False,
|
75 |
+
override_bs: int = None,
|
76 |
+
) -> List[Tuple[float, bool]]:
|
77 |
+
# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
|
78 |
+
res = []
|
79 |
+
|
80 |
+
def _collate(req: Tuple[Tuple[str, str], List[int], List[int]]):
|
81 |
+
"""Defines the key for the sorted method"""
|
82 |
+
# the negative sign on len(toks) sorts descending - this has a few advantages:
|
83 |
+
# - time estimates will always be over not underestimates, which is more useful for planning
|
84 |
+
# - to know the size of a batch when going through the list, you know the first one is always the batch
|
85 |
+
# padded context length. this is useful to simplify the batching logic and more importantly to make
|
86 |
+
# automatic adaptive batches much much easier to implement
|
87 |
+
# - any OOMs will happen right away rather than near the end
|
88 |
+
|
89 |
+
toks = req[1] + req[2]
|
90 |
+
return -len(toks), tuple(toks)
|
91 |
+
|
92 |
+
def _lookup_one_token_cont(req: Tuple[Tuple[str, str], List[int], List[int]]):
|
93 |
+
"""Defines the key to group and lookup one-token continuations"""
|
94 |
+
# Use with group_by="contexts" (optional)"
|
95 |
+
# allows for the creation of a lookup, so we can reuse logits in case of one-token continuations.
|
96 |
+
# speeds up some multiple-choice tasks proportionally to the number of choices.
|
97 |
+
# groups requests by context+continuation[:-1] and infer on one request/group.
|
98 |
+
return req[-2] + req[-1][:-1]
|
99 |
+
|
100 |
+
re_ord = Collator(
|
101 |
+
requests,
|
102 |
+
sort_fn=_collate,
|
103 |
+
group_by="contexts"
|
104 |
+
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
|
105 |
+
and self.logits_cache
|
106 |
+
else None,
|
107 |
+
group_fn=_lookup_one_token_cont,
|
108 |
+
)
|
109 |
+
|
110 |
+
# automatic (variable) batch size detection for vectorization
|
111 |
+
# pull longest context sample from request
|
112 |
+
n_reordered_requests = len(re_ord)
|
113 |
+
batch_size = (
|
114 |
+
self.batch_size
|
115 |
+
if self.batch_size != "auto"
|
116 |
+
else override_bs
|
117 |
+
if override_bs is not None
|
118 |
+
else 0
|
119 |
+
)
|
120 |
+
batch_fn = (
|
121 |
+
self._batch_scheduler
|
122 |
+
if self.batch_size == "auto"
|
123 |
+
and n_reordered_requests > 0
|
124 |
+
and not override_bs
|
125 |
+
else None
|
126 |
+
)
|
127 |
+
|
128 |
+
chunks = re_ord.get_batched(n=batch_size, batch_fn=batch_fn)
|
129 |
+
pbar = tqdm(
|
130 |
+
total=len(requests),
|
131 |
+
disable=(disable_tqdm or (self.rank != 0)),
|
132 |
+
desc="Running loglikelihood requests",
|
133 |
+
)
|
134 |
+
for chunk in chunks:
|
135 |
+
inps = []
|
136 |
+
cont_toks_list = []
|
137 |
+
inplens = []
|
138 |
+
|
139 |
+
conts = []
|
140 |
+
encoder_attns = []
|
141 |
+
|
142 |
+
padding_len_inp = None
|
143 |
+
padding_len_cont = None
|
144 |
+
# because vectorizing is annoying, we first convert each (context, continuation) pair to padded
|
145 |
+
# tensors, then we pack them together into a batch, call the model, and then pick it all apart
|
146 |
+
# again because vectorizing is annoying
|
147 |
+
|
148 |
+
for _, context_enc, continuation_enc in chunk:
|
149 |
+
# sanity check
|
150 |
+
assert len(context_enc) > 0
|
151 |
+
assert len(continuation_enc) > 0
|
152 |
+
assert len(continuation_enc) <= self.max_length
|
153 |
+
|
154 |
+
# how this all works (illustrated on a causal decoder-only setup):
|
155 |
+
# CTX CONT
|
156 |
+
# inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1]
|
157 |
+
# model \ \
|
158 |
+
# logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the
|
159 |
+
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
|
160 |
+
|
161 |
+
# when too long to fit in context, truncate from the left
|
162 |
+
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
|
163 |
+
inp = torch.tensor(
|
164 |
+
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
|
165 |
+
dtype=torch.long,
|
166 |
+
device=self.device,
|
167 |
+
)
|
168 |
+
(inplen,) = inp.shape
|
169 |
+
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
|
170 |
+
inp = torch.tensor(
|
171 |
+
(context_enc)[-self.max_length :],
|
172 |
+
dtype=torch.long,
|
173 |
+
device=self.device,
|
174 |
+
)
|
175 |
+
(inplen,) = inp.shape
|
176 |
+
|
177 |
+
# build encoder attn masks
|
178 |
+
encoder_attns.append(torch.ones_like(inp))
|
179 |
+
|
180 |
+
cont = torch.tensor(
|
181 |
+
(continuation_enc)[-self.max_length :],
|
182 |
+
# TODO: left-shift these?
|
183 |
+
# TODO: our code assumes we never end up truncating conts for either model type
|
184 |
+
dtype=torch.long,
|
185 |
+
device=self.device,
|
186 |
+
)
|
187 |
+
(contlen,) = cont.shape
|
188 |
+
|
189 |
+
conts.append(cont)
|
190 |
+
|
191 |
+
padding_len_cont = (
|
192 |
+
max(padding_len_cont, contlen)
|
193 |
+
if padding_len_cont is not None
|
194 |
+
else contlen
|
195 |
+
)
|
196 |
+
|
197 |
+
padding_len_inp = (
|
198 |
+
max(padding_len_inp, inplen)
|
199 |
+
if padding_len_inp is not None
|
200 |
+
else inplen
|
201 |
+
)
|
202 |
+
|
203 |
+
inps.append(inp) # [1, inp_length]
|
204 |
+
cont_toks_list.append(continuation_enc)
|
205 |
+
inplens.append(inplen)
|
206 |
+
|
207 |
+
# create encoder attn mask and batched conts, if seq2seq
|
208 |
+
call_kwargs = {}
|
209 |
+
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
|
210 |
+
batched_inps = pad_and_concat(
|
211 |
+
padding_len_inp, inps, padding_side="right"
|
212 |
+
) # [batch, padding_len_inp]
|
213 |
+
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
|
214 |
+
# TODO: left-pad encoder inps and mask?
|
215 |
+
batched_inps = pad_and_concat(
|
216 |
+
padding_len_inp, inps
|
217 |
+
) # [batch, padding_len_inp]
|
218 |
+
batched_conts = pad_and_concat(
|
219 |
+
padding_len_cont, conts
|
220 |
+
) # [batch, padding_len_cont]
|
221 |
+
batched_encoder_mask = pad_and_concat(
|
222 |
+
padding_len_inp, encoder_attns
|
223 |
+
) # [batch, padding_len_inp]
|
224 |
+
call_kwargs = {
|
225 |
+
"attn_mask": batched_encoder_mask,
|
226 |
+
"labels": batched_conts,
|
227 |
+
}
|
228 |
+
|
229 |
+
start = time()
|
230 |
+
intermediate_res = self._model_call(batched_inps, **call_kwargs)
|
231 |
+
end = time()
|
232 |
+
multi_logits = F.log_softmax(
|
233 |
+
intermediate_res , dim=-1
|
234 |
+
) # [batch, padding_length (inp or cont), vocab]
|
235 |
+
per_sample_time = (end - start) / len(multi_logits)
|
236 |
+
|
237 |
+
for (request_str, ctx_tokens, _), logits, inplen, cont_toks in zip(
|
238 |
+
chunk, multi_logits, inplens, cont_toks_list
|
239 |
+
):
|
240 |
+
# Slice to original seq length
|
241 |
+
contlen = len(cont_toks)
|
242 |
+
# take only logits in the continuation
|
243 |
+
# (discard context toks if decoder-only ; discard right-padding)
|
244 |
+
# also discards + checks for "virtual tokens" in the causal LM's input window
|
245 |
+
# from prompt/prefix tuning tokens, if applicable
|
246 |
+
ctx_len = (
|
247 |
+
inplen + (logits.shape[0] - padding_len_inp)
|
248 |
+
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
|
249 |
+
else None
|
250 |
+
)
|
251 |
+
logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
|
252 |
+
logits = logits.unsqueeze(0) # [1, seq, vocab]
|
253 |
+
|
254 |
+
# Check if per-token argmax is exactly equal to continuation
|
255 |
+
greedy_tokens = logits.argmax(dim=-1)
|
256 |
+
|
257 |
+
# check for one-token continuation cache hits.
|
258 |
+
# noop in case group_by != "contexts" or no cache hit and returns the
|
259 |
+
# original args. Otherwise, expands the logits batch dimension and yields each
|
260 |
+
# batch along with matching continuation tokens and prompt strings.
|
261 |
+
# logits -> [1, seq, vocab]
|
262 |
+
for request_str, cont_toks, logits in re_ord.get_cache(
|
263 |
+
req_str=request_str,
|
264 |
+
cxt_toks=ctx_tokens,
|
265 |
+
cont_toks=cont_toks,
|
266 |
+
logits=logits,
|
267 |
+
):
|
268 |
+
cont_toks = torch.tensor(
|
269 |
+
cont_toks, dtype=torch.long, device=self.device
|
270 |
+
).unsqueeze(0) # [1, seq]
|
271 |
+
max_equal = (greedy_tokens == cont_toks).all()
|
272 |
+
|
273 |
+
# Obtain log-probs at the corresponding continuation token indices
|
274 |
+
# last_token_slice = logits[:, -1, :].squeeze(0).tolist()
|
275 |
+
logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
|
276 |
+
-1
|
277 |
+
) # [1, seq]
|
278 |
+
|
279 |
+
# Answer: (log prob, is-exact-match)
|
280 |
+
answer = (float(logits.sum()), bool(max_equal))
|
281 |
+
|
282 |
+
res.append((answer, per_sample_time, 0, 0))
|
283 |
+
|
284 |
+
self.cache_hook.add_partial("loglikelihood", request_str, answer)
|
285 |
+
pbar.update(1)
|
286 |
+
|
287 |
+
pbar.close()
|
288 |
+
|
289 |
+
return re_ord.get_original(res)
|
290 |
+
|
291 |
def _model_generate(self, context, max_length, stop, **generation_kwargs):
|
292 |
# temperature = 0.0 if not set
|
293 |
# if do_sample is false and temp==0.0:
|
src/backend/run_eval_suite.py
CHANGED
@@ -1,13 +1,57 @@
|
|
1 |
from lm_eval import evaluator
|
2 |
from lm_eval.tasks import TaskManager
|
|
|
|
|
3 |
|
4 |
from src.backend.manage_requests import EvalRequest
|
5 |
|
6 |
-
from src.backend.tasks.xsum.task import XSum
|
7 |
-
from src.backend.tasks.xsum.task_v2 import XSumv2
|
8 |
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
|
13 |
|
|
|
1 |
from lm_eval import evaluator
|
2 |
from lm_eval.tasks import TaskManager
|
3 |
+
from lm_eval.api.metrics import mean
|
4 |
+
from lm_eval.api.task import ConfigurableTask
|
5 |
|
6 |
from src.backend.manage_requests import EvalRequest
|
7 |
|
|
|
|
|
8 |
|
9 |
+
orig_process_results = ConfigurableTask.process_results
|
10 |
+
orig_aggregation = ConfigurableTask.aggregation
|
11 |
+
orig_higher_is_better = ConfigurableTask.higher_is_better
|
12 |
+
|
13 |
+
def process_results_decorator(func):
|
14 |
+
def wrapper(self, doc, results, *args, **kwargs):
|
15 |
+
processed_results = [r[0] for r in results]
|
16 |
+
|
17 |
+
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
18 |
+
prefilling_time = sum([r[2] for r in results]) / len(results)
|
19 |
+
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
20 |
+
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
21 |
+
|
22 |
+
result_dict = func(self, doc, processed_results, *args, **kwargs)
|
23 |
+
result_dict["end_to_end_time"] = end_to_end_time
|
24 |
+
result_dict["prefilling_time"] = prefilling_time
|
25 |
+
result_dict["decoding_throughput"] = decoding_throughput
|
26 |
+
return result_dict
|
27 |
+
return wrapper
|
28 |
+
ConfigurableTask.process_results = process_results_decorator(orig_process_results)
|
29 |
+
|
30 |
+
def aggregation_decorator(func):
|
31 |
+
def wrapper(self, *args, **kwargs):
|
32 |
+
aggregation_list = func(self, *args, **kwargs)
|
33 |
+
aggregation_list["end_to_end_time"] = mean
|
34 |
+
aggregation_list["prefilling_time"] = mean
|
35 |
+
aggregation_list["decoding_throughput"] = mean
|
36 |
+
return aggregation_list
|
37 |
+
return wrapper
|
38 |
+
ConfigurableTask.aggregation = aggregation_decorator(orig_aggregation)
|
39 |
+
|
40 |
+
def higher_is_better_decorator(func):
|
41 |
+
def wrapper(self, *args, **kwargs):
|
42 |
+
higher_is_better_dict = func(self, *args, **kwargs)
|
43 |
+
higher_is_better_dict["end_to_end_time"] = False
|
44 |
+
higher_is_better_dict["prefilling_time"] = False
|
45 |
+
higher_is_better_dict["decoding_throughput"] = True
|
46 |
+
return higher_is_better_dict
|
47 |
+
return wrapper
|
48 |
+
ConfigurableTask.higher_is_better = higher_is_better_decorator(orig_higher_is_better)
|
49 |
+
|
50 |
+
# from src.backend.tasks.xsum.task import XSum
|
51 |
+
# from src.backend.tasks.xsum.task_v2 import XSumv2
|
52 |
+
|
53 |
+
# from src.backend.tasks.cnndm.task import CNNDM
|
54 |
+
# from src.backend.tasks.cnndm.task_v2 import CNNDMv2
|
55 |
|
56 |
from src.backend.tasks.selfcheckgpt.task import SelfCheckGPT
|
57 |
|
src/backend/tasks/measurement_task_utils.py
CHANGED
@@ -12,7 +12,7 @@ def process_results_decorator(func):
|
|
12 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
13 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
14 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
15 |
-
print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
16 |
|
17 |
# Now call the original process_results with the processed results
|
18 |
result_dict = func(self, doc, processed_results, *args, **kwargs)
|
|
|
12 |
end_to_end_time = sum([r[1] for r in results]) / len(results)
|
13 |
prefilling_time = sum([r[2] for r in results]) / len(results)
|
14 |
decoding_throughput = sum([r[3] for r in results]) / len(results)
|
15 |
+
# print(f"end_to_end_time: {end_to_end_time}, prefilling_time: {prefilling_time}, decoding_throughput: {decoding_throughput}")
|
16 |
|
17 |
# Now call the original process_results with the processed results
|
18 |
result_dict = func(self, doc, processed_results, *args, **kwargs)
|
src/display/about.py
CHANGED
@@ -1,13 +1,25 @@
|
|
1 |
from src.display.utils import ModelType
|
2 |
|
3 |
-
TITLE = """<h1 align="center" id="space-title">MOE
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
-
|
7 |
|
|
|
8 |
|
9 |
-
"""
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
LLM_BENCHMARKS_TEXT = f"""
|
12 |
|
13 |
"""
|
|
|
1 |
from src.display.utils import ModelType
|
2 |
|
3 |
+
TITLE = """<h1 align="center" id="space-title">OPEN-MOE-LLM-LEADERBOARD</h1>"""
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
+
The OPEN-MOE-LLM-LEADERBOARD is specifically designed to assess the performance and efficiency of various Mixture of Experts (MoE) Large Language Models (LLMs). This initiative, driven by the open-source community, aims to comprehensively evaluate these advanced MoE LLMs. We extend our gratitude to the Huggingface for the GPU community grant that supported the initial debugging process, and to [NetMind.AI](https://netmind.ai/home) for their generous GPU donation, which ensures the continuous operation of the Leaderboard.
|
7 |
|
8 |
+
The OPEN-MOE-LLM-LEADERBOARD includes generation and multiple choice tasks to measure the performance and efficiency of MOE LLMs.
|
9 |
|
|
|
10 |
|
11 |
+
Tasks:
|
12 |
+
- **Generation Self-consistancy** -- [SelfCheckGPT](https://github.com/potsawee/selfcheckgpt)
|
13 |
+
- **Multiple Choice Performance** -- [MMLU](https://arxiv.org/abs/2009.03300)
|
14 |
+
|
15 |
+
Columns and Metrics:
|
16 |
+
- Method: The MOE LLMs inference framework.
|
17 |
+
- E2E(s): Average End to End generation time in seconds.
|
18 |
+
- PRE(s): Prefilling Time of input prompt in seconds.
|
19 |
+
- T/s: Tokens throughout per second.
|
20 |
+
- Precision: The precison of used model.
|
21 |
+
|
22 |
+
"""
|
23 |
LLM_BENCHMARKS_TEXT = f"""
|
24 |
|
25 |
"""
|
src/display/utils.py
CHANGED
@@ -7,6 +7,11 @@ import pandas as pd
|
|
7 |
def fields(raw_class):
|
8 |
return [v for k, v in raw_class.__dict__.items() if k[:2] != "__" and k[-2:] != "__"]
|
9 |
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
@dataclass
|
12 |
class Task:
|
@@ -46,7 +51,7 @@ class Tasks(Enum):
|
|
46 |
|
47 |
# # XXX include me back at some point
|
48 |
selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
|
49 |
-
mmlu = Task("mmlu", "acc", "MMLU/Acc (5-shot)
|
50 |
|
51 |
|
52 |
# These classes are for user facing column names,
|
@@ -71,20 +76,22 @@ auto_eval_column_dict.append(["model", ColumnContent, ColumnContent("Model", "ma
|
|
71 |
# # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
|
72 |
|
73 |
# Inference framework
|
74 |
-
auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent("
|
75 |
|
76 |
for task in Tasks:
|
77 |
auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
|
78 |
# System performance metrics
|
79 |
-
auto_eval_column_dict.append([f"{task.name}_end_to_end_time", ColumnContent, ColumnContent(f"{task.value.col_name}
|
80 |
-
|
81 |
-
|
|
|
|
|
82 |
|
83 |
# Model information
|
84 |
auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
|
85 |
auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
|
86 |
auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
|
87 |
-
auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str",
|
88 |
auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
|
89 |
auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
|
90 |
auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
|
@@ -144,7 +151,7 @@ class InferenceFramework(Enum):
|
|
144 |
|
145 |
def to_str(self):
|
146 |
return self.value.name
|
147 |
-
|
148 |
@staticmethod
|
149 |
def from_str(inference_framework: str):
|
150 |
if inference_framework in ["moe-infinity"]:
|
@@ -152,7 +159,7 @@ class InferenceFramework(Enum):
|
|
152 |
if inference_framework in ["hf-chat"]:
|
153 |
return InferenceFramework.HF_Chat
|
154 |
return InferenceFramework.Unknown
|
155 |
-
|
156 |
|
157 |
class WeightType(Enum):
|
158 |
Adapter = ModelDetails("Adapter")
|
|
|
7 |
def fields(raw_class):
|
8 |
return [v for k, v in raw_class.__dict__.items() if k[:2] != "__" and k[-2:] != "__"]
|
9 |
|
10 |
+
E2Es = "E2E(s)" #"End-to-end time (s)"
|
11 |
+
PREs = "PRE(s)" #"Prefilling time (s)"
|
12 |
+
TS = "T/s" #Decoding throughput (tok/s)
|
13 |
+
InFrame = "Method" #"Inference framework"
|
14 |
+
MULTIPLE_CHOICEs = ["mmlu"]
|
15 |
|
16 |
@dataclass
|
17 |
class Task:
|
|
|
51 |
|
52 |
# # XXX include me back at some point
|
53 |
selfcheck = Task("selfcheckgpt", "max-selfcheckgpt", "SelfCheckGPT")
|
54 |
+
mmlu = Task("mmlu", "acc", "MMLU") #MMLU/Acc (5-shot)
|
55 |
|
56 |
|
57 |
# These classes are for user facing column names,
|
|
|
76 |
# # auto_eval_column_dict.append(["average", ColumnContent, ColumnContent("Avg", "number", True)])
|
77 |
|
78 |
# Inference framework
|
79 |
+
auto_eval_column_dict.append(["inference_framework", ColumnContent, ColumnContent(f"{InFrame}", "str", True)])
|
80 |
|
81 |
for task in Tasks:
|
82 |
auto_eval_column_dict.append([task.name, ColumnContent, ColumnContent(task.value.col_name, "number", True)])
|
83 |
# System performance metrics
|
84 |
+
auto_eval_column_dict.append([f"{task.name}_end_to_end_time", ColumnContent, ColumnContent(f"{task.value.col_name}-{E2Es}", "number", True)])
|
85 |
+
if task.value.benchmark in MULTIPLE_CHOICEs:
|
86 |
+
continue
|
87 |
+
auto_eval_column_dict.append([f"{task.name}_prefilling_time", ColumnContent, ColumnContent(f"{task.value.col_name}-{PREs}", "number", True)])
|
88 |
+
auto_eval_column_dict.append([f"{task.name}_decoding_throughput", ColumnContent, ColumnContent(f"{task.value.col_name}-{TS}", "number", True)])
|
89 |
|
90 |
# Model information
|
91 |
auto_eval_column_dict.append(["model_type", ColumnContent, ColumnContent("Type", "str", False)])
|
92 |
auto_eval_column_dict.append(["architecture", ColumnContent, ColumnContent("Architecture", "str", False)])
|
93 |
auto_eval_column_dict.append(["weight_type", ColumnContent, ColumnContent("Weight type", "str", False, True)])
|
94 |
+
auto_eval_column_dict.append(["precision", ColumnContent, ColumnContent("Precision", "str", True)])
|
95 |
auto_eval_column_dict.append(["license", ColumnContent, ColumnContent("Hub License", "str", False)])
|
96 |
auto_eval_column_dict.append(["params", ColumnContent, ColumnContent("#Params (B)", "number", False)])
|
97 |
auto_eval_column_dict.append(["likes", ColumnContent, ColumnContent("Hub ❤️", "number", False)])
|
|
|
151 |
|
152 |
def to_str(self):
|
153 |
return self.value.name
|
154 |
+
|
155 |
@staticmethod
|
156 |
def from_str(inference_framework: str):
|
157 |
if inference_framework in ["moe-infinity"]:
|
|
|
159 |
if inference_framework in ["hf-chat"]:
|
160 |
return InferenceFramework.HF_Chat
|
161 |
return InferenceFramework.Unknown
|
162 |
+
|
163 |
|
164 |
class WeightType(Enum):
|
165 |
Adapter = ModelDetails("Adapter")
|
src/populate.py
CHANGED
@@ -12,7 +12,7 @@ from src.leaderboard.read_evals import get_raw_eval_results, EvalResult, update_
|
|
12 |
|
13 |
from src.backend.envs import Tasks as BackendTasks
|
14 |
from src.display.utils import Tasks
|
15 |
-
|
16 |
|
17 |
def get_leaderboard_df(
|
18 |
results_path: str,
|
@@ -47,9 +47,9 @@ def get_leaderboard_df(
|
|
47 |
|
48 |
# bm_to_name_map = {bm: name for name, bm in name_to_bm_map.items()}
|
49 |
system_metrics_to_name_map = {
|
50 |
-
"end_to_end_time": "
|
51 |
-
"prefilling_time": "
|
52 |
-
"decoding_throughput": "
|
53 |
}
|
54 |
|
55 |
all_data_json = []
|
|
|
12 |
|
13 |
from src.backend.envs import Tasks as BackendTasks
|
14 |
from src.display.utils import Tasks
|
15 |
+
from src.display.utils import E2Es, PREs, TS
|
16 |
|
17 |
def get_leaderboard_df(
|
18 |
results_path: str,
|
|
|
47 |
|
48 |
# bm_to_name_map = {bm: name for name, bm in name_to_bm_map.items()}
|
49 |
system_metrics_to_name_map = {
|
50 |
+
"end_to_end_time": f"{E2Es}",
|
51 |
+
"prefilling_time": f"{PREs}",
|
52 |
+
"decoding_throughput": f"{TS}",
|
53 |
}
|
54 |
|
55 |
all_data_json = []
|