how many tokens generator per second about inference llama-13b-hf on A10G

#2
by baby1 - opened

which generate parameter is the most important to accelerate inference speed

max_length is 1000 ? it seem could be very slow!

generation_config:
temperature: 0.90
top_p: 0.75
num_beams: 1
use_cache: True
max_length: 1000
min_length: 0

that is not the case.

unfortunately, Hugging Face library does not support streaming generation at the moment, so one should write a sort of money patch to enable it. The parameters you posted are used in batch generation mode which is not the case of this space.

 instruction_prompt,
    max_tokens=128,
    temperature=1,
    top_p=0.9,
    cache=True

these are the only parameters supported in streaming mode at the moment. The generation could be faster if I remove the window size to look back the conversation history.

Also, note that what you see from the chat UI is 3 tokens generated at a time. I aggregate n number of tokens into a chunk and yield. This is an experiment to see if yielding every tokens is costly.

Also, one could build an application like the other request take its turn while the generated a chunk of tokens from the previous request are being yielded if I can set asyncio.sleep(0.01) or something. However Gradio does not support async generator at the moment.

chansung changed discussion status to closed

how many in local mode? I am hoping to test it with 512 tokens after buying colab subscription, but i would like to know if can try 1024 or 2048 tokens?

Sign up or log in to comment