Slow inference time for new version of transformers

by russellsparadox - opened Apr 18, 2023

Discussion

russellsparadox

Apr 18, 2023

•

edited Apr 19, 2023

I found that '4.21.1' version of transformers give better inference time than '4.28.1'.

For 4.21.1 the downloaded models looks like this

While for 4.28.1

Also for 4.21.1 version gives load_in_8bit is not implemented.

Generated text is identical (do_sample=False).

Here are few comparisons I did. It seems the difference becomes more pronounced for bigger texts.

Here is the code to generate it:

model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="auto",
                                             # load_in_8bit=True,
                                             torch_dtype='auto',
                                             cache_dir=cache_dir
                                            )
inputs = tokenizer(query, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=128, do_sample=False, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(tokens[0])

gsaivinay

Apr 18, 2023

This is an interesting finding... I was wondering why the model feels slow even in 8bit while the llama 30b model is lightning quick at https://open-assistant.io/chat .

captainst

Apr 19, 2023

Have you tried the output for multiple times ?

gsaivinay

Apr 19, 2023

I use huggingface pipeline with model loaded in 8bit and repeatedly calling it to generate, while first token takes some time, the subsequent tokens take around 0.5s each. Haven't tried much to optimise TBH. Also there is generate_stream available now but haven't explored it yet.

captainst

Apr 19, 2023

I used 2 x 3060 to load the model in 8-bit (with load_in_ 8bit=True), where each GPU takes about 7GB of memory. The first time I called model.generate, it took more than 2 min to generate the output. However, the subsequntial call took merely 2~3 seconds to generate response.

I think the speed of inference depends much on your hardware setup.

gsaivinay

Apr 19, 2023

How many token are generated in those 2 to 3 seconds?

captainst

Apr 19, 2023

I just made an experiment with different prompts. On my hardware setup,

for longer output (> 200 token), it's about 0.22 sec for an output token.
for shorter output (< 20 token), it's about 0.15 sec for an output token.

The above is just roughly estimated

gsaivinay

Apr 19, 2023

That is nice. I use Nvidia A10G single GPU. May be the pipeline wrapper issue? I'll just use model.generate() and try... but the issue is I'll lose return_full_text argument which is handy in pipeline

gsaivinay

Apr 19, 2023

I dropped transformers, and used https://github.com/huggingface/text-generation-inference

This is super fast and support streaming as well.

captainst

Apr 20, 2023

@sonatasv Cool. I am not sure if text-generation-inference lib supports 8-bit quantization. Are you using 8-bit ?

gsaivinay

Apr 20, 2023

yes, can be enabled with --quantize argument

banank1989

Jun 27, 2023

Is this totally free or do we need huggingface key in this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment