tiiuae/falcon-40b · How to load Falcon-40B on Nvidia H100 GPU with 80GB VRAM?

airtable

May 30, 2023

Even in load_8_bit=True setting, the model doesn't load on the GPU, how to load it for inference? @FalconLLM

DanielHesslow

May 30, 2023

This comment has been hidden

FalconLLM

Technology Innovation Institute org May 30, 2023

Unfortunately, I do not currently have access to an H100, so it will be hard to debug issues there specifically. Some people do seem to be able to run on H100: https://www.youtube.com/watch?v=iEuf1PrmZ0Q, maybe seeing what they do might be of some help?

80GB is going to be very tight though, so will require some cpu offloading with accelerate. If I understand things correctly accelerate is able to automatically offload to cpu memory, but I am not too familiar with this process.

The smallest we've run it on is 4xA10(4x24GB=96GB).
Sorry to not be of more help, hopefully some other people that has managed to make it run can chime in

dsatch

May 31, 2023

I have not been able to run at all, even on massive deployment of 240 VGPU. I used the code from the main page. It is clearly a memory issue, because 7B runs (but event that takes up more than 50% of VGPU on 240 GB setup). Any ideas, can you help?

FalconLLM

Technology Innovation Institute org May 31, 2023

@dstatch, Which/How many GPUs were you trying to run it on?

dsatch

May 31, 2023

4 X 80 GB I tried to use Runpod and Datacrunch, fails in both places. It seems that it is not even a VRAM issue, but in inter-GPU communication. Really excited about the potential of this, but as it stands even throwing very large resources at it does not help.

dsatch

May 31, 2023

but I am not sure what the issue, just positive that I am not the only one experiencing it, since I have tried in multiple places

tinkertank

May 31, 2023

•

edited May 31, 2023

I'm running it with the following code on A datacrunch 80G A100 (using 8bit mode).
Credit where credit is due, I basically lifted this code from Sam Witteveen's excellent youtube video & colab:
https://www.youtube.com/watch?v=5M1ZpG2Zz90

Should work on H100 as well.

import torch
import transformers
from transformers import GenerationConfig, pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import bitsandbytes as bnb
from torch.cuda.amp import autocast

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)

model = AutoModelForCausalLM.from_pretrained(model,
        load_in_8bit=True,
        trust_remote_code=True,
        device_map='auto',
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
)



pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

with autocast(dtype=torch.float16):
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

I'm running the following conda env.
(kind of a mess, but seems to work)

conda create --name llm python=3.10
conda activate llm
conda install pytorch==2.0.0 pytorch-cuda=11.8 transformers -c pytorch -c nvidia
pip install einops accelerate
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip -q install sentencepiece Xformers einops
pip -q install langchain

cian0

Jun 1, 2023

Would this run with 5 12GB VRAM (3060) gpus? I run a mining rig at home..

Adrians

Jun 1, 2023

Thanks @tinkertank ! I tried your install + run on H100 (lamnbda labs) but I'm getting cublas errors...

File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b/b0462812b2f53caab9ccc64051635a74662fc73b/modelling_RW.py", line 252, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 388, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 559, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Any ideas?

nateraw

Jun 2, 2023

•

edited Jun 2, 2023

Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.

Just checked, and the below worked on a fresh instance (I ran no other commands).

Install miniconda

We only do this because the install for torch/cuda works smoothly.

# Download latest miniconda.
wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install. -b is used to skip prompt
bash Miniconda3-latest-Linux-x86_64.sh -b

# Activate.
eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"

# (optional) Add activation cmd to bashrc so you don't have to run the above every time.
printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc

Setup env

Note: I don't think you need to install transformers from github if you do device_map={"": 0} later instead of device_map=0, but I haven't checked.

# Create and activate env. -y skips confirmation prompt.
conda create -n falcon-env python=3.9 -y
conda activate falcon-env

# newest torch with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce
pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git

Run it

This will use up basically all the memory, but it works.

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=0,
)
sequences = pipeline(
    "To make the perfect chocolate chip cookies,",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

airtable

Jun 9, 2023

Hi guys

I am back, the code from @nateraw worked on my Lambda H100 instance, only needed to upgrade Transformers to 4.30.0 from 4.29.2, without that it was giving a device_map int type doesn't have .values() error and took me a while to figure out.

But looks the model tightly fits, here's the GPU usage at 99.1%

Next up, load in langchain

airtable changed discussion status to closed Jun 9, 2023

vinwizard

Jul 6, 2023

Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.

Just checked, and the below worked on a fresh instance (I ran no other commands).

Install miniconda

We only do this because the install for torch/cuda works smoothly.

# Download latest miniconda.
wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install. -b is used to skip prompt
bash Miniconda3-latest-Linux-x86_64.sh -b

# Activate.
eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"

# (optional) Add activation cmd to bashrc so you don't have to run the above every time.
printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc

Setup env

Note: I don't think you need to install transformers from github if you do device_map={"": 0} later instead of device_map=0, but I haven't checked.

# Create and activate env. -y skips confirmation prompt.
conda create -n falcon-env python=3.9 -y
conda activate falcon-env

# newest torch with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce
pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git

Run it

This will use up basically all the memory, but it works.

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=0,
)
sequences = pipeline(
    "To make the perfect chocolate chip cookies,",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

How much was the inference time on this? @nateraw