How to design a inference performance benchmark for text model?

#254
by ricardoooo - opened

I'm new to transformer, as a application developer, I'm curious to the inference performance in real world. I didn't found any benchmark docs about LLM, but I do care about how many tokens it can produce one second on a specific hardware platform(such as 8 A100 40G). How to design the strategies for this test?
I noticed that there is a config parameter named max_new_tokens for most text models. But I think not every generation request will generate as many tokens as the max_new_tokens set, so should I calculate the generated tokens for each input by finding out the eos_token_id in the correspond output?

Sign up or log in to comment