adapter trained with DPO on the gsm8k preference dataset with cot and 1 epoch 9c07365 verified valerielucro commited on Jun 26