How to load Falcon-40B on Nvidia H100 GPU with 80GB VRAM?

#18
by airtable - opened

Even in load_8_bit=True setting, the model doesn't load on the GPU, how to load it for inference? @FalconLLM

Technology Innovation Institute org
This comment has been hidden
Technology Innovation Institute org

Unfortunately, I do not currently have access to an H100, so it will be hard to debug issues there specifically. Some people do seem to be able to run on H100: https://www.youtube.com/watch?v=iEuf1PrmZ0Q, maybe seeing what they do might be of some help?

80GB is going to be very tight though, so will require some cpu offloading with accelerate. If I understand things correctly accelerate is able to automatically offload to cpu memory, but I am not too familiar with this process.

The smallest we've run it on is 4xA10(4x24GB=96GB).
Sorry to not be of more help, hopefully some other people that has managed to make it run can chime in

I have not been able to run at all, even on massive deployment of 240 VGPU. I used the code from the main page. It is clearly a memory issue, because 7B runs (but event that takes up more than 50% of VGPU on 240 GB setup). Any ideas, can you help?

Technology Innovation Institute org

@dstatch, Which/How many GPUs were you trying to run it on?

4 X 80 GB I tried to use Runpod and Datacrunch, fails in both places. It seems that it is not even a VRAM issue, but in inter-GPU communication. Really excited about the potential of this, but as it stands even throwing very large resources at it does not help.

but I am not sure what the issue, just positive that I am not the only one experiencing it, since I have tried in multiple places

I'm running it with the following code on A datacrunch 80G A100 (using 8bit mode).
Credit where credit is due, I basically lifted this code from Sam Witteveen's excellent youtube video & colab:
https://www.youtube.com/watch?v=5M1ZpG2Zz90

Should work on H100 as well.

import torch
import transformers
from transformers import GenerationConfig, pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import bitsandbytes as bnb
from torch.cuda.amp import autocast

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)

model = AutoModelForCausalLM.from_pretrained(model,
        load_in_8bit=True,
        trust_remote_code=True,
        device_map='auto',
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
)



pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

with autocast(dtype=torch.float16):
    sequences = pipeline(
       "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
        max_length=200,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
    )
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

I'm running the following conda env.
(kind of a mess, but seems to work)

conda create --name llm python=3.10
conda activate llm
conda install pytorch==2.0.0 pytorch-cuda=11.8 transformers -c pytorch -c nvidia
pip install einops accelerate
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip -q install sentencepiece Xformers einops
pip -q install langchain

Would this run with 5 12GB VRAM (3060) gpus? I run a mining rig at home..

Thanks @tinkertank ! I tried your install + run on H100 (lamnbda labs) but I'm getting cublas errors...

File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b/b0462812b2f53caab9ccc64051635a74662fc73b/modelling_RW.py", line 252, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 388, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 559, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Any ideas?

Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.

Just checked, and the below worked on a fresh instance (I ran no other commands).

Install miniconda

We only do this because the install for torch/cuda works smoothly.

# Download latest miniconda.
wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install. -b is used to skip prompt
bash Miniconda3-latest-Linux-x86_64.sh -b

# Activate.
eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"

# (optional) Add activation cmd to bashrc so you don't have to run the above every time.
printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc

Setup env

Note: I don't think you need to install transformers from github if you do device_map={"": 0} later instead of device_map=0, but I haven't checked.

# Create and activate env. -y skips confirmation prompt.
conda create -n falcon-env python=3.9 -y
conda activate falcon-env

# newest torch with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce
pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git

Run it

This will use up basically all the memory, but it works.

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=0,
)
sequences = pipeline(
    "To make the perfect chocolate chip cookies,",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Hi guys

I am back, the code from @nateraw worked on my Lambda H100 instance, only needed to upgrade Transformers to 4.30.0 from 4.29.2, without that it was giving a device_map int type doesn't have .values() error and took me a while to figure out.

But looks the model tightly fits, here's the GPU usage at 99.1%

image.png

Next up, load in langchain

airtable changed discussion status to closed

Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.

Just checked, and the below worked on a fresh instance (I ran no other commands).

Install miniconda

We only do this because the install for torch/cuda works smoothly.

# Download latest miniconda.
wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install. -b is used to skip prompt
bash Miniconda3-latest-Linux-x86_64.sh -b

# Activate.
eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"

# (optional) Add activation cmd to bashrc so you don't have to run the above every time.
printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc

Setup env

Note: I don't think you need to install transformers from github if you do device_map={"": 0} later instead of device_map=0, but I haven't checked.

# Create and activate env. -y skips confirmation prompt.
conda create -n falcon-env python=3.9 -y
conda activate falcon-env

# newest torch with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce
pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git

Run it

This will use up basically all the memory, but it works.

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer


model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=0,
)
sequences = pipeline(
    "To make the perfect chocolate chip cookies,",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

How much was the inference time on this? @nateraw

Sign up or log in to comment