Fine-tuning is not improving the domain knowledge? it is very complicated, could you help?
Hi, Thank you for the awesome model, I really like the model output. I am trying to fine-tune the model for a domain specific use-case and using this qlora configuration:
sequence_len: 4000
sample_packing: true
pad_to_sequence_len: true
trust_remote_code: True
adapter: qlora
lora_r: 256
lora_alpha: 512
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- down_proj
- up_proj
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002
warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
Although the loss is going down , the plot looks like this
But while evaluating the performance is worse than the original model
Phi-3-mini-4k-instruct - Average domain accuracy : 40%
Qlora - Phi-3-mini-4k-instruct(with above config) : 35%
If there are any issues with the hyperparameters (e.g., learning rate), or Do you have some recommandations on how we can finetune this model?
Try lowering the sequence length you are using to tune the model, something like 2k.
We have seen several reports of the model going off the rails with extremely long prompts.
A combination of an “off the rail” instruct model + additional long-sequence fine tuning could be diminishing the performance.
If it's possible, maybe try just a couple of steps with/without LoRA and see the performance comparison? Or even try disabling the dropout?
Are you using a validation set during the training? Maybe it's something that we can track the performance on.
Since the loss is going down and the final performance is going down as well, there might some inflection point where the model is starting to overfit.