Why are the float32 versions a few tokens per second faster than their float16 versions?

#1
by vcoyk - opened

Shouldn't using fp16 be significantly faster than fp32?

Hugging Face Optimum org

I think that for small models, the 1000 tokens latency still fluctuates but even the difference is generally less than 5% of the reported mean.
There's also the fact that for these small models, a non negligeable time is spent outside of the forward pass and that adds to the variance of the estimate.

It should be noted that what we report is an average throughput computed from the time it takes to generate a 1000 tokens.
But in reality, the time it takes to generate the first few dozens of tokens is way less than what it takes in the last ones (the model has probably reached a steady state by then).
I might end up implementing a custom generate method that gives more details about the evolution of tokens throughput throughout in the generation process.
Contributions are very welcome ๐Ÿค— : https://github.com/huggingface/optimum-benchmark

Hugging Face Optimum org

this example provides a case where there's no performance gain in using float16 even on an optimized backend like TensorRT:
https://forums.developer.nvidia.com/t/no-performance-difference-between-float16-and-float32-optimized-tensorrt-models/185140

IlyasMoutawwakil changed discussion status to closed

Sign up or log in to comment