Fine-tuning is not improving the domain knowledge? it is very complicated, could you help?

#50

by aaditya - opened May 1, 2024

May 1, 2024

Hi, Thank you for the awesome model, I really like the model output. I am trying to fine-tune the model for a domain specific use-case and using this qlora configuration:

sequence_len: 4000
sample_packing: true
pad_to_sequence_len: true
trust_remote_code: True
adapter: qlora
lora_r: 256
lora_alpha: 512
lora_dropout: 0.05
lora_target_linear: true
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00002
warmup_steps: 100
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0

Although the loss is going down , the plot looks like this

But while evaluating the performance is worse than the original model

Phi-3-mini-4k-instruct - Average domain accuracy : 40%
Qlora - Phi-3-mini-4k-instruct(with above config) : 35%

If there are any issues with the hyperparameters (e.g., learning rate), or Do you have some recommandations on how we can finetune this model?

gugarosa

Microsoft org May 1, 2024

Try lowering the sequence length you are using to tune the model, something like 2k.

We have seen several reports of the model going off the rails with extremely long prompts.

A combination of an “off the rail” instruct model + additional long-sequence fine tuning could be diminishing the performance.

aaditya

May 2, 2024

@gugarosa Update: I tried two epoch with 2k length with same config as above, as previously the loss went down but during evaluation the accuracy is worse than base model.

gugarosa

Microsoft org May 2, 2024

If it's possible, maybe try just a couple of steps with/without LoRA and see the performance comparison? Or even try disabling the dropout?

aaditya

May 3, 2024

@gugarosa I tried FFT, qlora, and Lora all three but the issue is the same the performance goes down while the loss is decreasing well.

gugarosa

Microsoft org May 3, 2024

Are you using a validation set during the training? Maybe it's something that we can track the performance on.

Since the loss is going down and the final performance is going down as well, there might some inflection point where the model is starting to overfit.

nguyenbh changed discussion status to closed Jul 1, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment