Nvidia H100 Finetuning Error on BitsandBytes

#82
by ashmitbhattarai - opened

I am trying to fine-tune the model on H100 80GB Graphics card. The same code runs on A100 40GB. Its 4 bits quantized model


quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_doube_quant=False,
    bnb_4bit_compute_dtype = torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="cuda:0",
    trust_remote_code=True,
    # max_seq_len=8192
    # use_safetensors=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code = True,
    use_fast=True
)
## Prepare the model for K-Bit Trainining
base_model.gradient_checkpointing_enable()

model = prepare_model_for_kbit_training(base_model)
....
trainer...
trainer.train()

I get CUDA Error: an illegal instruction was encountered.

Again, the same code runs great on A100 just not on H100.. FYI: Inference on the base model works just fine just the fine-tune training is erroneous.

Using notebook as reference

i have the same issue...

I do have the same issue . How to solve this ?

Sign up or log in to comment