QLORA fine tuning with longer length of sequence (max_length=2048, padding=True) cause RuntimeError: CUDA error: device-side assert triggered; shorten length to 512 works !
Hey everyone, I know there are some google searchable articles talking about "RuntimeError: CUDA error: device-side assert triggered" may be linked to classification labels number mismatch with the layer of model. but that is classification BERT article not the case here ( decorder only transformer as Mistral)
So, I encounter problem as title stated I am using Huggingface transformer library as well as bitsandbytes to do QLORA fine tuning
The error
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
some test that I've done
when using training data of longer length, I encounter the error
tokenizer(max_length=2048, padding=True), tokenizer initiation without setting model_max_length, padding_side cause this problem
I have plot the sequence length of all the data, it has a distribution of data with different length
when using training data of smaller length, No Error and I was able to do further qlora fine tuning without error !
tokenizer(max_length=512, padding=True), tokenizer initiation with (model_max_length=512, padding_side="left")
manually remove data that have length over 512
as before, the length of tokenized data have different length..
What's happening under the hood ?
my setting
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
run_name = '1008-....XXXX'
local_path = '/.../mistral-chinese-alpaca-qlora'
training_arguments = TrainingArguments(
output_dir=f"{local_path}/output_dir",
per_device_train_batch_size=1,
gradient_accumulation_steps=6,
learning_rate=2e-5,
lr_scheduler_type="cosine",
evaluation_strategy = "steps",
eval_steps=1000,
save_strategy="steps",
save_steps= 2000,
logging_steps=100,
num_train_epochs=1,
report_to = 'wandb',
run_name = run_name
)
trainer = Trainer(
model=model,
train_dataset=tokenized_training_data,
eval_dataset = tokenized_testing_data,
args=training_arguments,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)