Slow inference time for new version of transformers

#7
by russellsparadox - opened

I found that '4.21.1' version of transformers give better inference time than '4.28.1'.

For 4.21.1 the downloaded models looks like this

image.png

While for 4.28.1
image.png

Also for 4.21.1 version gives load_in_8bit is not implemented.

Generated text is identical (do_sample=False).

Here are few comparisons I did. It seems the difference becomes more pronounced for bigger texts.

image.png

Here is the code to generate it:

model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="auto",
                                             # load_in_8bit=True,
                                             torch_dtype='auto',
                                             cache_dir=cache_dir
                                            )
inputs = tokenizer(query, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=128, do_sample=False, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(tokens[0])

This is an interesting finding... I was wondering why the model feels slow even in 8bit while the llama 30b model is lightning quick at https://open-assistant.io/chat .

Have you tried the output for multiple times ?

I use huggingface pipeline with model loaded in 8bit and repeatedly calling it to generate, while first token takes some time, the subsequent tokens take around 0.5s each. Haven't tried much to optimise TBH. Also there is generate_stream available now but haven't explored it yet.

I used 2 x 3060 to load the model in 8-bit (with load_in_ 8bit=True), where each GPU takes about 7GB of memory. The first time I called model.generate, it took more than 2 min to generate the output. However, the subsequntial call took merely 2~3 seconds to generate response.

I think the speed of inference depends much on your hardware setup.

How many token are generated in those 2 to 3 seconds?

I just made an experiment with different prompts. On my hardware setup,

  1. for longer output (> 200 token), it's about 0.22 sec for an output token.
  2. for shorter output (< 20 token), it's about 0.15 sec for an output token.

The above is just roughly estimated

That is nice. I use Nvidia A10G single GPU. May be the pipeline wrapper issue? I'll just use model.generate() and try... but the issue is I'll lose return_full_text argument which is handy in pipeline

I dropped transformers, and used https://github.com/huggingface/text-generation-inference

This is super fast and support streaming as well.

@sonatasv Cool. I am not sure if text-generation-inference lib supports 8-bit quantization. Are you using 8-bit ?

yes, can be enabled with --quantize argument

Is this totally free or do we need huggingface key in this?

Sign up or log in to comment