LEADERBOARD.md · ml-energy/leaderboard at 8c6c688d33536e01f867dd382449bffaa8f0d378

The goal of the ML.ENERGY Leaderboard is to give people a sense of how much energy LLMs would consume.

The code for the leaderboard, backing data, and scripts for benchmarking are all open-source in our repository. We'll see you at the Discussion board, where you can ask questions, suggest improvement ideas, or just discuss leaderboard results!

Columns

gpu: NVIDIA GPU model name.
task: Name of the task. See Tasks below for details.
energy (J): The average GPU energy consumed by the model to generate a response.
throughput (token/s): The average number of tokens generated per second.
latency (s): The average time it took for the model to generate a response.
response_length (token): The average number of tokens in the model's response.
parameters: The number of parameters the model has, in units of billion.
arc: AI2 Reasoning Challenge's challenge dataset. Measures capability to do grade-school level question answering, 25 shot.
hellaswag: HellaSwag dataset. Measuring grounded commonsense, 10 shot.
truthfulqa: TruthfulQA dataset. Measuring truthfulness against questions that elicit common falsehoods, 0 shot.

NLP evaluation metrics (arc, hellaswag, and truthfulqa) were only run once each on A40 GPUs because their results do not depend on the GPU type. Hence, all GPU model rows for the same model share the same NLP evaluation numbers.

Tasks

For each task, every model uses the same system prompt. We still account for differences in roles, e.g. USER, HUMAN, ASSISTANT, GPT.

Name	System prompt
chat	A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
chat-concise	A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise.
instruct	Below is an instruction that describes a task. Write a response that appropriately completes the request.
instruct-concise	Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise.

You can see that response length is shorter on average for the -concise variants of the tasks. This affects the number of decoding iterations the model has to run in order to finish responding, thus affecting latency and energy consumption per prompt.

Setup

Find our benchmark script for one model here.

Software

PyTorch 2.0.1
Zeus -- For GPU time and energy measurement
FastChat -- For running inference on various models
lm-evaluation-harness -- For NLP evaluation metrics

Hardware

NVIDIA A40 GPU
NVIDIA A100 GPU
NVIDIA V100 GPU

Parameters

Model
- Batch size 1
- FP16
Sampling (decoding)
- Greedy sampling from multinomial distribution
- Temperature 0.7
- Repetition penalty 1.0

Data

We randomly sampled around 3000 prompts from the cleaned ShareGPT dataset. See here for more detail on how we created the benchmark dataset.

Limitations

Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario. Hence, absolute latency, throughput, and energy numbers should not be used to estimate figures in real production settings, while relative comparison makes some sense.

Upcoming

Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
More optimized inference runtimes, like TensorRT.
Larger models with distributed inference, like Falcon 40B.
More models, like RWKV.

License

This leaderboard is a research preview intended for non-commercial use only. Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses. The use of LLaMA weights are subject to their license. Please direct inquiries/reports of potential violation to Jae-Won Chung.

Acknowledgements

We thank Chameleon Cloud and CloudLab for the GPU nodes.