Is the drop in many metrics expected? Why do SFT first if it makes the model worse? Why not do DPO directly on the mistral model?

#1
by dball - opened
Owner

Comparison:
dball/zephyr-7b-sft-qlora (-)Average: 59.8; -GSM8k 34.12; (-)Win 78.22; (+)TQA 42.32; -MMLU 61.9; (-)HSwag 82.49; (-)ARC 59.73
mistralai/Mistral-7B-v0.1: Average: 60.97; GSM8K 37.83; Win 78.37; TQA: 42.15; MMLU 64.16; HSwag 83.31; ARC 59.98

See also https://huggingface.co/datasets/open-llm-leaderboard/details_dball__zephyr-7b-sft-qlora

Owner

The drop is expected, for full finetuning it is even stronger (catastrophic forgetting is usually weaker for QLoRA):

alignment-handbook/zephyr-7b-sft-full -Avg 57.56; -GSM8K 28.73; -Win 76.09; (-)TQA 41.71; -MMLU 60.31; -HSwag 80.82; -ARC 57.68;
mistralai/Mistral-7B-v0.1: Average: 60.97; GSM8K 37.83; Win 78.37; TQA: 42.15; MMLU 64.16; HSwag 83.31; ARC 59.98

But is this SFT a necessary first step before one can do DPO? Otherwise DPO directly on mistralai/Mistral-7B-v0.1 would for sure be more promising.

Sign up or log in to comment