Spaces:

chansung
/

LLM-As-Chatbot

Runtime error

App Files Files Community

how many tokens generator per second about inference llama-13b-hf on A10G

by baby1 - opened Mar 28, 2023

Discussion

baby1

Mar 28, 2023

which generate parameter is the most important to accelerate inference speed

baby1

Mar 28, 2023

max_length is 1000 ? it seem could be very slow!

generation_config:
temperature: 0.90
top_p: 0.75
num_beams: 1
use_cache: True
max_length: 1000
min_length: 0

chansung

Owner Mar 28, 2023

that is not the case.

unfortunately, Hugging Face library does not support streaming generation at the moment, so one should write a sort of money patch to enable it. The parameters you posted are used in batch generation mode which is not the case of this space.

chansung

Owner Mar 28, 2023

 instruction_prompt,
    max_tokens=128,
    temperature=1,
    top_p=0.9,
    cache=True

these are the only parameters supported in streaming mode at the moment. The generation could be faster if I remove the window size to look back the conversation history.

Also, note that what you see from the chat UI is 3 tokens generated at a time. I aggregate n number of tokens into a chunk and yield. This is an experiment to see if yielding every tokens is costly.

Also, one could build an application like the other request take its turn while the generated a chunk of tokens from the previous request are being yielded if I can set asyncio.sleep(0.01) or something. However Gradio does not support async generator at the moment.

chansung changed discussion status to closed Jun 14, 2023

axeljeremy7

Jun 15, 2023

how many in local mode? I am hoping to test it with 512 tokens after buying colab subscription, but i would like to know if can try 1024 or 2048 tokens?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment