Is the SFT phase a full finetune or a lora?

#3
by dividebythree - opened

First, congrats on the excellent model! I've used it only a bit for RP, and it seems noticeably better than even Mixtral-Instruct.

I know there has been lots of discussion on how to finetune Mixtral. E.g. if something is broken with the load balancing loss in Transformers, if DPO is the secret sauce... I see that you uploaded a qlora adapter for the DPO phase. But what about SFT? Was that also a qlora, or was it a full finetune?

If this model ends up being as good as it seems at first glance, it would be very helpful for the community to know what makes it so good. Perhaps a full finetune SFT phase explains everything (other Mixtral finetunes all seem to be qlora). Any other training details that aren't obvious that you could share would also be much appreciated.

NousResearch org

The sft phase was full finetune

NousResearch org

Nothing non standard

Hi @teknium , great work!
May I ask you a bit about the training settings? Did you freeze any part of the model(like gating layer), or did you apply auxiliary loss or any load balance trick when training ?

Sign up or log in to comment