Finetuend SFT trainer (based model is DPO)