Changes in modelling_RW.py to be able to handle past_key_values for faster model generations
The current code has missed out passing past_key_values in every forward pass for fast generation of tokens. This results in lot of recompute. This "modelling_RW.py" I am uploading deals with this in the way pytorch huggingface transformers package generation/utils.py wants. All the changes are basically around including past_key_values everywhere. I think this will apply on all falcon models These are the changes specifically. The same changes apply to pretty much all of the falcon family models with slow generation.
Class RotaryEmbedding forward method
Include past_seq_length in forward pass and apply rotary embedding according to the position of the query token ---- if else condition added (line number 98-101)
_make_causal_mask function
to give masking according to the way F.scaled dot product attention behaves. F.scaled_dot_product attention treats the attention_mask matrix as receiving attentions. For example if attention_mask is
[[True, False], [True, True]]. It would mean the first token is "receiving" attentions from first token and not second token. This is unlike what we generally end up thinking which is first token is giving attention to itself and not to the second one. Due to reason the past_key_values attentions are all True in make_causal mask function. Also I have reversed the inequality above that due to the same reason. ---- (line number 111 inequality, line number 114 attention mask to be True)
Class Attention forward method
a) past_key_value length is passed in rotary function ---- if,else loop added (line number 276-280)
b) concatenation of past key and current key is done after permuting the past key shape to match the current key shape ---- (line number 283-290)
c) to keep key_layer shape consistent with the output expectation which is (batch_size, head_dim, seq_length), another permutation done before creating "present" to return in the output ---- (line number 294-298)
d)add an if else depending on whether attention mask has been created or not, currently it just ignores ---- (line number 305-311)
Class RWModel prepare_attn_mask method
Have removed src_length > 1 criteria for making causal mask (line number 554).
RW causal LM prepare inputs for generation
Read pastkey values from the input coming from huggingface generate method and dont call convert_to_rw_cache method (line number 749-757)
Ya, Falcon seemed slow af, and this seemed to help, thanks!
Is this why Llama is beating Falcon, because they're overlooking simple things? I'm quite surprised since this has far superior licensing to Llama.
Dont know why this was overlooked. This updated code runs at around 2.5x speed on CPU and I have not been able to measure the speed up on GPU of fp16 model. The quantized version of the model did not seem to give any speed up after this improvement on GPU. What precision did you use @YoYo1234Qwerty and how much speed up did you get in case it was GPU ?
Also Falcon Models by default run with Flash Attention algo so its heavily I/O optimized already. So any change in the algorithm has to keep up with the I/O optimization to get the speed up on GPU. I am guessing they might have chosen to not reuse the past key values because it might have not led to any difference in inference speed and would have led to more complicate code. Not sure.
Is there a similar fix for Falcon-7b?
Yeah, here is the one for falcon 7b instruct, https://huggingface.co/tiiuae/falcon-7b-instruct/discussions/60#64ad2eae4beffa272de2610c. Note that I did not get speed up on gpu, just on cpu with all this change for falcon-7b-instruct
Yeah, here is the one for falcon 7b instruct, https://huggingface.co/tiiuae/falcon-7b-instruct/discussions/60#64ad2eae4beffa272de2610c. Note that I did not get speed up on gpu, just on cpu with all this change for falcon-7b-instruct
Thanks for your response! Just to make sure: How can I find the exact file path for modelling_RW.py which is used by the falcon model? When I search on my system I find multiple copies of this file (at least one in hub directory and one in modules directory) so how to determine which one should be replaced with the new version?
Did not understand what you mean here, so the directory which you are using to load the model using AutoModelForCausalLM.from_pretrained() should have one modelling_RW.py file which should be replaced by the one in the above pull request.
I understand what you mean, I guess you are using model hub directly to download the huggingface model, which is ending up creating two locations for modelling_RW.py in huggingface cache modules directory. I would recommend just directly cloning the repo (with git lfs installed in your system) and use that directory for everything.