Slow inference time for new version of transformers
I found that '4.21.1' version of transformers give better inference time than '4.28.1'.
For 4.21.1 the downloaded models looks like this
Also for 4.21.1 version gives load_in_8bit
is not implemented.
Generated text is identical (do_sample=False).
Here are few comparisons I did. It seems the difference becomes more pronounced for bigger texts.
Here is the code to generate it:
model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
# load_in_8bit=True,
torch_dtype='auto',
cache_dir=cache_dir
)
inputs = tokenizer(query, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_new_tokens=128, do_sample=False, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(tokens[0])
This is an interesting finding... I was wondering why the model feels slow even in 8bit while the llama 30b model is lightning quick at https://open-assistant.io/chat .
Have you tried the output for multiple times ?
I use huggingface pipeline with model loaded in 8bit and repeatedly calling it to generate, while first token takes some time, the subsequent tokens take around 0.5s each. Haven't tried much to optimise TBH. Also there is generate_stream available now but haven't explored it yet.
I used 2 x 3060 to load the model in 8-bit (with load_in_ 8bit=True), where each GPU takes about 7GB of memory. The first time I called model.generate, it took more than 2 min to generate the output. However, the subsequntial call took merely 2~3 seconds to generate response.
I think the speed of inference depends much on your hardware setup.
How many token are generated in those 2 to 3 seconds?
I just made an experiment with different prompts. On my hardware setup,
- for longer output (> 200 token), it's about 0.22 sec for an output token.
- for shorter output (< 20 token), it's about 0.15 sec for an output token.
The above is just roughly estimated
That is nice. I use Nvidia A10G single GPU. May be the pipeline
wrapper issue? I'll just use model.generate() and try... but the issue is I'll lose return_full_text
argument which is handy in pipeline
I dropped transformers, and used https://github.com/huggingface/text-generation-inference
This is super fast and support streaming as well.
@sonatasv Cool. I am not sure if text-generation-inference lib supports 8-bit quantization. Are you using 8-bit ?
yes, can be enabled with --quantize argument
Is this totally free or do we need huggingface key in this?