AWS Trainium & Inferentia documentation

Llama performance on AWS Inferentia2 (Latency & Througput)

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Llama performance on AWS Inferentia2 (Latency & Througput)

How fast is Llama on Inferentia2? Let’s figure out!

For this benchmark we will use the LLama 2 7B and 13B models with different configurations:

Model type num cores batch_size
Llama2 7B - L (latency) 24 1
Llama2 7B - T (throughput) 24 4
Llama2 13B - L (latency) 24 1
Llama2 13B - T (throughput) 24 4

Note: all models are compiled with a maximum sequence length of 2048.

All models are compiled to use the full extent of cores available on the inf2.48xlarge instance.

Note: please refer to the inferentia2 product page for details on the available instances.

We created two “latency” oriented configurations for the llama2 7B and llama2 13B models that can serve only one request at a time, but at full speed and two “throughput” oriented configurations to serve up to four requests in parallel.

To evaluate the models, we generate tokens up to a total sequence length of 1024, starting from 256 input tokens (i.e. we generate 256, 512 and 768 tokens).

Encoding time (time to first token)

The encoding time or time to first token is the time required to process the input tokens and generate the first output token. It is a very important metric, as it corresponds to the latency directly perceived by the user when streaming generated tokens.

We test the encoding time for increasing context sizes, 256 input tokens corresponding roughly to a typical Q/A usage, while 768 is more typical of a Retrieval Augmented Generation (RAG) use-case.

Encoding time is expressed in seconds.

Llama2 inferentia2 encoding-time

We can see that all deployed models exhibit excellent response times, even for long contexts.

End-to-end Latency

The end-to-end latency corresponds to the total time to reach a sequence length of 1024 tokens.

It therefore includes the encoding and generation time.

Latency is expressed in seconds.

Llama2 inferentia2 end-to-end latency

All models deployed on the high-end instance exhibit a good latency, even those actually configured to optimize throughput.

Throughput

We adopt the same convention as other benchmarks to evaluate the throughput, by dividing the end-to-end latency by the sum of both input and output tokens. In other words, we divide the end-to-end latency by batch_size * sequence_length to obtain the number of generated tokens per second.

Throughput is expressed in tokens/second.

Llama2 inferentia2 throughput

Again, the models deployed on the high-end instance have a very good throughput, even those optimized for latency.