adapter trained with DPO on the gsm8k preference dataset with cot and 1 epoch 487d117 verified valerielucro commited on Jun 26