ai21labs/Jamba-v0.1 · multiple gpu?

Mar 28

3x4090, won't load in 8bit using script on model card. Tried device_map='auto', tried reducing n_ctxt in config.json, tried setting load-in_8bit_fp32/16 in quantization.config, even tried eliminating the 'don't quantize mamba'.
Am I out of luck?

Traceback (most recent call last):
File "/home/bruce/Downloads/owl/tests/owl/jamba.py", line 9, in
model = AutoModelForCausalLM.from_pretrained("/home/bruce/Downloads/models/Jamba",
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained
return model_class.from_pretrained(
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3481, in from_pretrained
hf_quantizer.validate_environment(device_map=device_map)
File "/home/bruce/.local/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_8bit.py", line 86, in validate_environment
raise ValueError(
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to
from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.

Tom-Neverwinter

Mar 29

I have 4 tesla m40 and a 1080. 100GB of vram total and 128GB system ram.

how much in resources do we need to even run this thing?

bdambrosio

Mar 29

•

edited Mar 29

@Tom-Neverwinter - I believe the problem is it that at the moment it won't accept multi-gpu configurations.
Hopefully this isn't a hard fix...

But do note 100GB is not enough to run a 51B+ parameter model in FP16, even if/when it does support multi-gpu, you will still need to quantize, like me.

vsevolodl

Mar 29

@Tom-Neverwinter Interesting. I have two A6000s, and it works for me in 8-bit but doesn't work in half-precision (though the model successfully loads into VRAM). It says that "Fast Mamba kernels are not available. Make sure they are installed and that the mamba module is on a CUDA device." I reinstalled mamba-ssm from the source, but I'm still getting this message.