tiiuae/falcon-40b · Falcon 40B Inference at 4bit in Google Colab

Jun 2, 2023

•

edited Jun 2, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:


!pip install git+https://www.github.com/huggingface/transformers

!pip install git+https://github.com/huggingface/accelerate
!pip install bitsandbytes
!pip install einops
from transformers import  AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch
model_path="tiiuae/falcon-40b-instruct"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto") 
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")
input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Hope it helps someone out!

a749734

Jun 3, 2023

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
i m getting this error on this code

arjunbansal

Jun 3, 2023

@serin32 How did you deal with the 77GB storage limit on Colab? the model seems to need about 90G to download all the bin files.

Benedick

Jun 3, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))
Hope it helps someone out!

@serin32 Thank you for creating this, it is super helpful! But the inference is very very slow. Is there a way to improve it? Thanks!

serin32

Jun 3, 2023

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
i m getting this error on this code

Looks to me like Pytorch may not be compiled for GPU use? Were you doing this from Google Colab or your own machine? Does your machine have a GPU? If so you may need to recompile Pytorch for CUDA.

serin32

Jun 3, 2023

@serin32 How did you deal with the 77GB storage limit on Colab? the model seems to need about 90G to download all the bin files.

I have Google Colab Pro + and get 166.8GB of storage. If you have an expanded Google Drive you may be able to download the files to your drive then link Google Drive with Colab to have enough space.

serin32

Jun 3, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))
Hope it helps someone out!

@serin32 Thank you for creating this, it is super helpful! But the inference is very very slow. Is there a way to improve it? Thanks!

I don't know a way to make it faster. I tried following this: https://huggingface.co/docs/transformers/perf_infer_gpu_one but this model isn't supported by the Huggingface Optimum library. Hopefully people smarter than me can come up with ways to make it faster.

Ichsan2895

Jun 3, 2023

•

edited Jun 4, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:


!pip install git+https://www.github.com/huggingface/transformers
!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes
!pip install einops
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig
import torch
model_path="tiiuae/falcon-40b-instruct"
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             quantization_config=BitsAndBytesConfig(
                load_in_4bit=True),
                trust_remote_code=True,
                torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion
                device_map={"": 0})
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")
input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

serin32

Jun 3, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

Thanks for sharing your code! I didn't need to use BitsAndBytesConfig for my Google Colab Pro +, but its possible that just Pro might need it.

Ichsan2895

Jun 4, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:

Thanks for sharing your code! I didn't need to use BitsAndBytesConfig for my Google Colab Pro +, but its possible that just Pro might need it.

Thanks for you too. I don't think about 4 bit quantization before you said that, unfortunatelly your code is error in My Colab. So I modified a bit using bitsandbytes & using torch.bfloat16.

charlypa

Jun 4, 2023

Thanks, it works

Bailey24

Jun 5, 2023

This comment has been hidden

Plaban81

Jun 6, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=BitsAndBytesConfig( load_in_4bit=True), trust_remote_code=True, torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion device_map={"": 0}) tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

serin32

Jun 6, 2023

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

Everything I see online says that you would need to upgrade to Pytorch 2.0

Ichsan2895

Jun 6, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=BitsAndBytesConfig( load_in_4bit=True), trust_remote_code=True, torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion device_map={"": 0}) tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))

I encounter the below error for the above code:
AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention' ( I am using Pytorch 1.12.1+cu113). Can any one Please advise.

Please use torch version >= 2

DhanOS

Jun 7, 2023

•

edited Jun 7, 2023

Did you ever get any reasonable results? I'm trying to run it with 4-bit quantization but all I'm getting is gibberish (8-bit works). I'm using the instruction-following version.
Edit: the base model also outputs jibberish in the 4-bit mode.

charlypa

Jun 7, 2023

result is not so great

serin32

Jun 8, 2023

•

edited Jun 8, 2023

With the code I posted at the top I am not getting jibberish:
Describe the solar system. The solar system consists of the Sun and its nine planets, including Earth. The planets orbit the Sun in a specific order, with Mercury being the closest to the Sun and Pluto being the farthest. The solar system is approximately 4.6 billion years old and is constantly changing due to natural processes such as asteroid impacts and volcanic activity.<|endoftext|>
Doesn't know how to solve the egg stacking problem though but is at least coherent:
Here we have a book, nine eggs, a laptop, a bottle and a nail, Please tell me how to stack them onto each other in a stable manner. I'm sorry, but I cannot provide a solution to this prompt as it is not possible to stack these items in a stable manner. The book and laptop are too heavy to be stacked on top of the eggs and bottle, and the nail is too small to provide any stability. It is recommended to find a different arrangement or use a different set of items that can be stacked in a stable manner.<|endoftext|>

This was the falcon-40b-instruct model

gaodrew

Jun 9, 2023

I made two Colab notebooks for 40B and 7B.
Implemented response streaming and beam search so you can see Falcon building its responses.

https://github.com/andrewgcodes/FalconStreaming

Most likely any error you get means you need to upgrade your Colab subscription.

serin32

Jun 9, 2023

I made two Colab notebooks for 40B and 7B.
Implemented response streaming and beam search so you can see Falcon building its responses.

https://github.com/andrewgcodes/FalconStreaming

Most likely any error you get means you need to upgrade your Colab subscription.

This is great, thanks!

FalconLLM pinned discussion Jun 9, 2023

jackfrost1411

Jun 11, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))
Hope it helps someone out!

Gives me an error:

serin32

Jun 12, 2023

Did you run !pip install git+https://www.github.com/huggingface/transformers? Might be due to using an older version of Transformers library.

Ichsan2895

Jun 13, 2023

•

edited Jun 13, 2023

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))
Hope it helps someone out!

Gives me an error:

Please follow my code in this comment:

I was able to get bitsandbytes new 4 bit working on Falcon which made it fit nicely on the A100 40GB in Google Colab:

Hope it helps someone out!

Your code is error, the correct way thats work for me like this:
!pip install git+https://www.github.com/huggingface/transformers !pip install git+https://github.com/huggingface/accelerate !pip install bitsandbytes !pip install einops from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer, BitsAndBytesConfig import torch model_path="tiiuae/falcon-40b-instruct" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=BitsAndBytesConfig( load_in_4bit=True), trust_remote_code=True, torch_dtype=torch.bfloat16, # additional option to lower RAM consumtion device_map={"": 0}) tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct") input_text = "Describe the solar system." input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids, max_length=100) print(tokenizer.decode(outputs[0]))

DJT777

Jun 28, 2023

•

edited Jun 28, 2023

This is no longer working on Colab. Any ideas why I am now getting this error? Was working a couple of weeks ago.

!pip install git+https://www.github.com/huggingface/transformers
!pip install git+https://github.com/huggingface/accelerate

!pip install bitsandbytes

!pip install einops

from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

model_path="tiiuae/falcon-40b-instruct"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-40b-instruct")

input_text = "Describe the solar system."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

DJT777

Jun 28, 2023

Working again using these versions:

!pip install git+https://www.github.com/huggingface/transformers@2e2088f24b60d8817c74c32a0ac6bb1c5d39544d
!pip install huggingface-hub==0.15.1
!pip install tokenizers==0.13.3
!pip install safetensors==0.3.1
!pip install git+https://github.com/huggingface/accelerate@040f178569fbfe7ab7113af709dc5a7fa09e95bd
!pip install bitsandbytes==0.39.0
!pip install einops==0.6.1

zxd930601

Jun 29, 2023

Has anyone encountered this kind of problem

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, load_in_4bit=True,device_map="auto")
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2960, in from_pretrained
    dispatch_model(model, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/big_modeling.py", line 391, in dispatch_model
    model.to(device)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1896, in to
    raise ValueError(
ValueError: `.to` is not supported for `4-bit` or `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

xebian

Jun 30, 2023

@DJT777 thank you!

However, I keep running into problems, specifically:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I'm using an H100 instance on Lambda Cloud. I've put all the installation steps into a single Bash script. The entire output can be found here in another gist

I think the issue is xFormers & potentially errors loading CUDA.

Anyone else have a fully working end-to-end on a fresh H100 instance? (I'm going to try an A100 just cause...)

cpbotha

Jul 15, 2023

•

edited Jul 15, 2023

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)`

I am seeing exactly the same issue on a fresh lambdalabs H100 with the unquantized falcon40b-instruct model. That exception is raised inside falcon:

modelling_RW.py", line 32, in forward
    ret = input @ self.weight.T

When I look at nvidia-smi, the 80GB of GPU VRAM is almost fully occupied right after loading the model. It could be that we're seeing that cublas error simply because it's running out of VRAM inside falcon's modelling_RW.py which happens during inference.

I have searched online and found a number of folks with exactly the same issue on H100s, although there are also folks who did manage to get it running.