Is the drop in many metrics expected? Why do SFT first if it makes the model worse? Why not do DPO directly on the mistral model?

by dball - opened Jan 26

Owner Jan 26

Comparison:
dball/zephyr-7b-sft-qlora (-)Average: 59.8; -GSM8k 34.12; (-)Win 78.22; (+)TQA 42.32; -MMLU 61.9; (-)HSwag 82.49; (-)ARC 59.73
mistralai/Mistral-7B-v0.1: Average: 60.97; GSM8K 37.83; Win 78.37; TQA: 42.15; MMLU 64.16; HSwag 83.31; ARC 59.98

dball

Owner Jan 26

The drop is expected, for full finetuning it is even stronger (catastrophic forgetting is usually weaker for QLoRA):

alignment-handbook/zephyr-7b-sft-full -Avg 57.56; -GSM8K 28.73; -Win 76.09; (-)TQA 41.71; -MMLU 60.31; -HSwag 80.82; -ARC 57.68;
mistralai/Mistral-7B-v0.1: Average: 60.97; GSM8K 37.83; Win 78.37; TQA: 42.15; MMLU 64.16; HSwag 83.31; ARC 59.98

But is this SFT a necessary first step before one can do DPO? Otherwise DPO directly on mistralai/Mistral-7B-v0.1 would for sure be more promising.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment