Falcon models slow inference

#59

by mikeytrw - opened Jun 12, 2023

Jun 12, 2023

Hi, I find very poor performance during inference compared to Llama based models.. Understandably given the param size there should be a difference between 30b and 40b, but it's half the speed or worse when generating tokens.

I'm running the FP16 version on 2xA100 80GB

Am I missing something or misconfiguring it?

NajiAboo

Jun 15, 2023

Hi,

Did you able to figure out it. I am also facing same issue

mikeytrw

Jun 15, 2023

No, I'm currently running the H2o OpenAssist fine tuned model on 8xA100 80GB and it's still sloooooow, well, at least compared to Llama.

ajmalsiddiqui

Jun 15, 2023

•

edited Jun 15, 2023

Please run inference using safetensor, I would suggest to use hugging text generation inference with falcon model. It will be faster. Additionallly, number of token will affect the speed. Please try and share your feedback. Thanks

NajiAboo

Jun 16, 2023

Thanks for the reply. Can you please share some samples or collab or github repo. Thanks for your time.

ajmalsiddiqui

Jun 16, 2023

Please try
https://github.com/huggingface/text-generation-inference

mikeytrw

Jun 17, 2023

I don't want to use text generation interface because I already have a server codebase this is part of. Can you explain a bit more about using safetensor? Or link to a code example. Thanks

ajmalsiddiqui

Jun 18, 2023

Dear Mike, the same thing i need in my server codebase and could not explore yet the safetensors implenentation that exists in huggging face text generation. I will do but at later stage. In your case, i would suggest to implement your own using the logic and implementation already done in text generation by hugging face.

Someshfengde

Jun 19, 2023

I'm also inferencing on the same 2 x A100 80 gb GPUs the inference time is high.

sekharvth

Jun 21, 2023

Inference time for out of the box falcon models is directly proportional to max_new_tokens being generated. This is because of a faulty incorporation of the past_key_values and rotary embeddings , former is used to cache the transformer keys and values as each token gets generated so that it's not recomputed at every timestep, latter is responsible for the embeddings. There's also a bug in the causal masking section of the attention mechanism being called. All this has been mentioned in this thread

puru22

Jul 12, 2023

•

edited Jul 13, 2023

I have raised a pull request today fixing the slowness. Just change the modelling_RW.py with the one in this pull request. Let me know if you find any issue with this. Note that pretty much all falcon family models involve the same changes for speeding up the generation. As some people in this thread mention, its mainly because falcon model is recomputing everything from beginning for every next token generation.
This is the pull request.

https://huggingface.co/tiiuae/falcon-40b/discussions/85

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment