gpt-2 small (124M) on tulu2-sft mixture Then DPO 40k on UltraFeedback