Recommendations on how to improve inference time

#35
by angeligareta - opened

Hello,

First of all thank you for the release of the quantized versions, this is going to change the possibilities of using Yi-34B.

I would like to ask recommendations on how to improve inference time for the Yi-34B models. I am having a MAX_INPUT_LENGTH of 3500 tokens and I am setting MAX_BATCH_PREFILL_TOKENS equal to X times MAX_INPUT_LENGTH, trying several ones until there is no a memory overflow when launching the app.
Is this a good approach? What other TGI parameters should I check?

And in the table where you compared the quantized versions based on a batch size, could you elaborate how to interpret that batch_size with respect to the MAX_INPUT_LENGTH?

Thank you in advance!

And in the table where you compared the quantized versions based on a batch size, could you elaborate how to interpret that batch_size with respect to the MAX_INPUT_LENGTH?

The table might not be accurate in your scenario. It serves as a guidance indicating the general memory footprint usage under typical conditions. The testing was performed on transformers==4.35.2 with a prompt length of 512 and a maximum generation token count of 1000.

I would like to ask recommendations on how to improve inference time for the Yi-34B models.

I can't provide any specific suggestions since it depends on your running case and the TGI. But the MAX_INPUT_LENGTH and MAX_BATCH_PREFILL_TOKENS do impact how much memory will be preserved for the KV cache at runtime. This will limit the batch size, aka the concurrency. I think you can try on decreasing MAX_INPUT_LENGTH while increasing MAX_BATCH_PREFILL_TOKENS.

@angeligareta Try install flash-attn2 with Nvidia GPU as inference device, this save a lot of attention memory. Also try the GPTQ 4 bit version made by TheBloke, pairing with exllamav2 (if you mostly use Yi for single batch inferencing) or vLLM (which serves multi-batch well), all with flash-attn2

As it seems like your question has been answered, if there is nothing else we can help you with on this matter, we will be closing this discussion for now.

If you have any further questions, feel free to reopen this discussion or start a new one.

Thank you for your contribution to this community!

richardllin changed discussion status to closed

Sign up or log in to comment