Falcon models slow inference

#59
by mikeytrw - opened

Hi, I find very poor performance during inference compared to Llama based models.. Understandably given the param size there should be a difference between 30b and 40b, but it's half the speed or worse when generating tokens.

I'm running the FP16 version on 2xA100 80GB

Am I missing something or misconfiguring it?

Hi,

Did you able to figure out it. I am also facing same issue

No, I'm currently running the H2o OpenAssist fine tuned model on 8xA100 80GB and it's still sloooooow, well, at least compared to Llama.

Please run inference using safetensor, I would suggest to use hugging text generation inference with falcon model. It will be faster. Additionallly, number of token will affect the speed. Please try and share your feedback. Thanks

Thanks for the reply. Can you please share some samples or collab or github repo. Thanks for your time.

I don't want to use text generation interface because I already have a server codebase this is part of. Can you explain a bit more about using safetensor? Or link to a code example. Thanks

Dear Mike, the same thing i need in my server codebase and could not explore yet the safetensors implenentation that exists in huggging face text generation. I will do but at later stage. In your case, i would suggest to implement your own using the logic and implementation already done in text generation by hugging face.

I'm also inferencing on the same 2 x A100 80 gb GPUs the inference time is high.

Inference time for out of the box falcon models is directly proportional to max_new_tokens being generated. This is because of a faulty incorporation of the past_key_values and rotary embeddings , former is used to cache the transformer keys and values as each token gets generated so that it's not recomputed at every timestep, latter is responsible for the embeddings. There's also a bug in the causal masking section of the attention mechanism being called. All this has been mentioned in this thread

I have raised a pull request today fixing the slowness. Just change the modelling_RW.py with the one in this pull request. Let me know if you find any issue with this. Note that pretty much all falcon family models involve the same changes for speeding up the generation. As some people in this thread mention, its mainly because falcon model is recomputing everything from beginning for every next token generation.
This is the pull request.

https://huggingface.co/tiiuae/falcon-40b/discussions/85

Sign up or log in to comment