File size: 3,330 Bytes
3c37eb3
c8763bd
 
3c37eb3
bee5389
ad5bd56
3c37eb3
ad5bd56
c8763bd
9dc4521
3c37eb3
e747f4e
c382b2a
9e3eaf4
d574374
 
 
 
df1a500
67b4a03
 
483e3a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2e1ee9
483e3a1
 
 
 
 
 
 
e2e1ee9
483e3a1
 
 
 
 
 
bee5389
6203f23
2ff4a74
3c37eb3
9dc4521
 
bee5389
9dc4521
2ff4a74
00642fb
ad5bd56
9dc4521
bee5389
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
TITLE = """<h1 align="center" id="space-title">πŸ€— LLM-Perf Leaderboard πŸ‹οΈ</h1>"""

INTRODUCTION_TEXT = f"""
The πŸ€— LLM-Perf Leaderboard πŸ‹οΈ aims to benchmark the performance (latency, throughput & memory) of Large Language Models (LLMs) with different hardwares, backends and optimizations using [Optimum-Benchmark](https://github.com/huggingface/optimum-benchmark) and [Optimum](https://github.com/huggingface/optimum) flavors.

Anyone from the community can request a model or a hardware/backend/optimization configuration for automated benchmarking:
- Model evaluation requests should be made in the [πŸ€— Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and will be added to the πŸ€— LLM-Perf Leaderboard πŸ‹οΈ automatically.
- Hardware/Backend/Optimization performance requests should be made in the [community discussions](https://huggingface.co/spaces/optimum/llm-perf-leaderboard/discussions) to assess their relevance and feasibility.
"""

ABOUT_TEXT = """<h3>About the πŸ€— LLM-Perf Leaderboard πŸ‹οΈ</h3>
<ul>
    <li>To avoid communication-dependent results, only one GPU is used.</li>
    <li>Score is the average evaluation score obtained from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">πŸ€— Open LLM Leaderboard</a>.</li>
    <li>LLMs are running on a singleton batch with a prompt size of 512 and generating a 1000 tokens.</li>
    <li>Peak memory is measured in MB during the generate pass using Py3NVML while assuring the GPU's isolation.</li>
    <li>Energy consumption is measured in kWh using CodeCarbon and taking into consideration the GPU, CPU, RAM and location of the machine.</li>
    <li>Each pair of (Model Type, Weight Class) is represented by the best scored model. This LLM is the one used for all the hardware/backend/optimization experiments.</li>
</ul>
"""

EXAMPLE_CONFIG_TEXT = """
Here's an example of the configuration file used to benchmark the models with Optimum-Benchmark:
```yaml
defaults:
  - backend: pytorch # default backend
  - benchmark: inference # default benchmark
  - experiment # inheriting from experiment config
  - _self_ # for hydra 1.1 compatibility
  - override hydra/job_logging: colorlog # colorful logging
  - override hydra/hydra_logging: colorlog # colorful logging

hydra:
  run:
    dir: llm-experiments/{experiment_name}
  job:
    chdir: true

experiment_name: {experiment_name}

model: {model}

device: cuda

backend:
  no_weights: true
  delete_cache: true
  torch_dtype: float16
  quantization_strategy: gptq
  bettertransformer: true

benchmark:
  memory: true

  input_shapes:
    batch_size: 1
    sequence_length: 512

  new_tokens: 1000
```
"""


CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results."
CITATION_BUTTON_TEXT = r"""@misc{llm-perf-leaderboard,
  author = {Ilyas Moutawwakil, RΓ©gis Pierrard},
  title = {LLM-Perf Leaderboard},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = "\url{https://huggingface.co/spaces/optimum/llm-perf-leaderboard}",
@software{optimum-benchmark,
  author = {Ilyas Moutawwakil, RΓ©gis Pierrard},
  publisher = {Hugging Face},
  title = {Optimum-Benchmark: A framework for benchmarking the performance of Transformers models with different hardwares, backends and optimizations.},
}
"""