Hey everyone, I know there are some google searchable articles talking about "RuntimeError: CUDA error: device-side assert triggered" may be linked to classification labels number mismatch with the layer of model. but that is classification BERT article not the case here ( decorder only transformer as Mistral)

So, I encounter problem as title stated I am using Huggingface transformer library as well as bitsandbytes to do QLORA fine tuning

The error

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

some test that I've done

when using training data of longer length, I encounter the error

tokenizer(max_length=2048, padding=True),  tokenizer initiation without setting model_max_length, padding_side cause this problem
I have plot the sequence length of all the data, it has a distribution of data with different length

when using training data of smaller length, No Error and I was able to do further qlora fine tuning without error !

tokenizer(max_length=512, padding=True), tokenizer initiation with (model_max_length=512, padding_side="left")
manually remove data that have length over 512
as before, the length of tokenized data have different length..

What's happening under the hood ?

my setting

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

run_name =  '1008-....XXXX'
local_path = '/.../mistral-chinese-alpaca-qlora'

training_arguments = TrainingArguments(
        output_dir=f"{local_path}/output_dir",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=6,
        learning_rate=2e-5,
        lr_scheduler_type="cosine",
        evaluation_strategy = "steps",
        eval_steps=1000, 
        save_strategy="steps",
        save_steps= 2000,
        logging_steps=100,
        num_train_epochs=1,
        report_to = 'wandb',
        run_name = run_name
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_training_data,
    eval_dataset = tokenized_testing_data,
    args=training_arguments,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)

)

mistralai
/

Mistral-7B-v0.1

QLORA fine tuning with longer length of sequence (max_length=2048, padding=True) cause RuntimeError: CUDA error: device-side assert triggered; shorten length to 512 works !

The error

some test that I've done

when using training data of longer length, I encounter the error

when using training data of smaller length, No Error and I was able to do further qlora fine tuning without error !

my setting