Jae-Won Chung commited on
Commit
d846882
1 Parent(s): 511ed5e

Better About tab

Browse files
Files changed (1) hide show
  1. LEADERBOARD.md +16 -12
LEADERBOARD.md CHANGED
@@ -1,14 +1,23 @@
1
  The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
2
 
 
 
 
3
  ## Columns
4
 
5
- - `gpu`: NVIDIA GPU model name. Note that NLP evaluation was only run once on our A40 GPUs, so this column only changes system-level measurements like latency and energy.
6
  - `task`: Name of the task. See *Tasks* below for details.
7
  - `energy` (J): The average GPU energy consumed by the model to generate a response.
8
  - `throughput` (token/s): The average number of tokens generated per second.
9
  - `latency` (s): The average time it took for the model to generate a response.
10
  - `response_length` (token): The average number of tokens in the model's response.
11
  - `parameters`: The number of parameters the model has, in units of billion.
 
 
 
 
 
 
12
 
13
  ## Tasks
14
 
@@ -39,6 +48,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
39
 
40
  - NVIDIA A40 GPU
41
  - NVIDIA A100 GPU
 
42
 
43
  ### Parameters
44
 
@@ -50,17 +60,11 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
50
  - Temperature 0.7
51
  - Repetition penalty 1.0
52
 
53
- ## Data used for benchmarking
54
 
55
  We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
56
  See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
57
 
58
- ## NLP evaluation metrics
59
-
60
- - `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
61
- - `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
62
- - `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
63
-
64
  ## Limitations
65
 
66
  Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
@@ -68,18 +72,18 @@ Hence, absolute latency, throughput, and energy numbers should not be used to es
68
 
69
  ## Upcoming
70
 
71
- - Within the Summer, we'll add an LLM Arena for energy consumption!
72
  - More optimized inference runtimes, like TensorRT.
73
  - Larger models with distributed inference, like Falcon 40B.
74
  - More models, like RWKV.
75
 
76
- # License
77
 
78
  This leaderboard is a research preview intended for non-commercial use only.
79
  Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
80
  The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
81
  Please direct inquiries/reports of potential violation to Jae-Won Chung.
82
 
83
- # Acknowledgements
84
 
85
- We thank [Chameleon Cloud](https://www.chameleoncloud.org/) for the A100 80GB GPU nodes (`gpu_a100_pcie`) and [CloudLab](https://cloudlab.us/) for the V100 GPU nodes (`r7525`).
 
1
  The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
2
 
3
+ The code for the leaderboard, backing data, and scripts for benchmarking are all open-source in our [repository](https://github.com/ml-energy/leaderboard).
4
+ We'll see you at the [Discussion board](https://github.com/ml-energy/leaderboard/discussions), where you can ask questions, suggest improvement ideas, or just discuss leaderboard results!
5
+
6
  ## Columns
7
 
8
+ - `gpu`: NVIDIA GPU model name.
9
  - `task`: Name of the task. See *Tasks* below for details.
10
  - `energy` (J): The average GPU energy consumed by the model to generate a response.
11
  - `throughput` (token/s): The average number of tokens generated per second.
12
  - `latency` (s): The average time it took for the model to generate a response.
13
  - `response_length` (token): The average number of tokens in the model's response.
14
  - `parameters`: The number of parameters the model has, in units of billion.
15
+ - `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset. Measures capability to do grade-school level question answering, 25 shot.
16
+ - `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag). Measuring grounded commonsense, 10 shot.
17
+ - `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958). Measuring truthfulness against questions that elicit common falsehoods, 0 shot.
18
+
19
+ NLP evaluation metrics (`arc`, `hellaswag`, and `truthfulqa`) were only run once each on A40 GPUs because their results do not depend on the GPU type.
20
+ Hence, all GPU model rows for the same model share the same NLP evaluation numbers.
21
 
22
  ## Tasks
23
 
 
48
 
49
  - NVIDIA A40 GPU
50
  - NVIDIA A100 GPU
51
+ - NVIDIA V100 GPU
52
 
53
  ### Parameters
54
 
 
60
  - Temperature 0.7
61
  - Repetition penalty 1.0
62
 
63
+ ### Data
64
 
65
  We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
66
  See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
67
 
 
 
 
 
 
 
68
  ## Limitations
69
 
70
  Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
 
72
 
73
  ## Upcoming
74
 
75
+ - Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
76
  - More optimized inference runtimes, like TensorRT.
77
  - Larger models with distributed inference, like Falcon 40B.
78
  - More models, like RWKV.
79
 
80
+ ## License
81
 
82
  This leaderboard is a research preview intended for non-commercial use only.
83
  Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
84
  The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
85
  Please direct inquiries/reports of potential violation to Jae-Won Chung.
86
 
87
+ ## Acknowledgements
88
 
89
+ We thank [Chameleon Cloud](https://www.chameleoncloud.org/) and [CloudLab](https://cloudlab.us/) for the GPU nodes.