Falcon 40B Inference at 4bit in Google Colab

#38
by serin32 - opened

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
i m getting this error on this code

@serin32 How did you deal with the 77GB storage limit on Colab? the model seems to need about 90G to download all the bin files.

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

@serin32 Thank you for creating this, it is super helpful! But the inference is very very slow. Is there a way to improve it? Thanks!

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
i m getting this error on this code

Looks to me like Pytorch may not be compiled for GPU use? Were you doing this from Google Colab or your own machine? Does your machine have a GPU? If so you may need to recompile Pytorch for CUDA.

@serin32 How did you deal with the 77GB storage limit on Colab? the model seems to need about 90G to download all the bin files.

I have Google Colab Pro + and get 166.8GB of storage. If you have an expanded Google Drive you may be able to download the files to your drive then link Google Drive with Colab to have enough space.

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

@serin32 Thank you for creating this, it is super helpful! But the inference is very very slow. Is there a way to improve it? Thanks!

I don't know a way to make it faster. I tried following this: https://huggingface.co/docs/transformers/perf_infer_gpu_one but this model isn't supported by the Huggingface Optimum library. Hopefully people smarter than me can come up with ways to make it faster.

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True),
trust_remote_code=True,
torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion
device_map={"": 0})

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

Thanks for sharing your code! I didn't need to use BitsAndBytesConfig for my Google Colab Pro +, but its possible that just Pro might need it.

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

Thanks for sharing your code! I didn't need to use BitsAndBytesConfig for my Google Colab Pro +, but its possible that just Pro might need it.

Thanks for you too. I don't think about 4 bit quantization before you said that, unfortunatelly your code is error in My Colab. So I modified a bit using bitsandbytes & using torch.bfloat16.

Thanks, it works

This comment has been hidden

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True),
trust_remote_code=True,
torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion
device_map={"": 0})

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

Everything I see online says that you would need to upgrade to Pytorch 2.0

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True),
trust_remote_code=True,
torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion
device_map={"": 0})

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

Please use torch version >= 2

Did you ever get any reasonable results? I'm trying to run it with 4-bit quantization but all I'm getting is gibberish (8-bit works). I'm using the instruction-following version.
Edit: the base model also outputs jibberish in the 4-bit mode.

result is not so great

With the code I posted at the top I am not getting jibberish:

Describe the solar system.
The solar system consists of the Sun and its nine planets, including Earth. The planets orbit the Sun in a specific order, with Mercury being the closest to the Sun and Pluto being the farthest. The solar system is approximately 4.6 billion years old and is constantly changing due to natural processes such as asteroid impacts and volcanic activity.<|endoftext|>

Doesn't know how to solve the egg stacking problem though but is at least coherent:

Here we have a book, nine eggs, a laptop, a bottle and a nail, Please tell me how to stack them onto each other in a stable manner.
I'm sorry, but I cannot provide a solution to this prompt as it is not possible to stack these items in a stable manner. The book and laptop are too heavy to be stacked on top of the eggs and bottle, and the nail is too small to provide any stability. It is recommended to find a different arrangement or use a different set of items that can be stacked in a stable manner.<|endoftext|>

This was the falcon-40b-instruct model

I made two Colab notebooks for 40B and 7B.
Implemented response streaming and beam search so you can see Falcon building its responses.

https://github.com/andrewgcodes/FalconStreaming

Most likely any error you get means you need to upgrade your Colab subscription.

I made two Colab notebooks for 40B and 7B.
Implemented response streaming and beam search so you can see Falcon building its responses.

https://github.com/andrewgcodes/FalconStreaming

Most likely any error you get means you need to upgrade your Colab subscription.

This is great, thanks!

FalconLLM pinned discussion

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

Gives me an error:
image.png

Did you run !pip install git+https://www.github.com/huggingface/transformers? Might be due to using an older version of Transformers library.

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

Gives me an error:
image.png

Please follow my code in this comment:

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True),
trust_remote_code=True,
torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion
device_map={"": 0})

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

This is no longer working on Colab. Any ideas why I am now getting this error? Was working a couple of weeks ago.

!pip install git+https://www.github.com/huggingface/transformers
!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Working again using these versions:

!pip install git+https://www.github.com/huggingface/transformers@2e2088f24b60d8817c74c32a0ac6bb1c5d39544d
!pip install huggingface-hub==0.15.1
!pip install tokenizers==0.13.3
!pip install safetensors==0.3.1
!pip install git+https://github.com/huggingface/accelerate@040f178569fbfe7ab7113af709dc5a7fa09e95bd
!pip install bitsandbytes==0.39.0
!pip install einops==0.6.1

Has anyone encountered this kind of problem

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True,device_map="auto")
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2960, in from_pretrained
    dispatch_model(model, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py", line 391, in dispatch_model
    model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1896, in to
    raise ValueError(
ValueError: `.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

@DJT777 thank you!

However, I keep running into problems, specifically:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I'm using an H100 instance on Lambda Cloud. I've put all the installation steps into a single Bash script. The entire output can be found here in another gist

I think the issue is xFormers & potentially errors loading CUDA.

Anyone else have a fully working end-to-end on a fresh H100 instance? (I'm going to try an A100 just cause...)

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)`

I am seeing exactly the same issue on a fresh lambdalabs H100 with the unquantized falcon40b-instruct model. That exception is raised inside falcon:

modelling_RW.py", line 32, in forward
    ret = input @ self.weight.T

When I look at nvidia-smi, the 80GB of GPU VRAM is almost fully occupied right after loading the model. It could be that we're seeing that cublas error simply because it's running out of VRAM inside falcon's modelling_RW.py which happens during inference.

I have searched online and found a number of folks with exactly the same issue on H100s, although there are also folks who did manage to get it running.

Sign up or log in to comment