phi-2-dpo-renew1

This model is a fine-tuned version of lole25/phi-2-sft-lora-ultrachat on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5780
Rewards/chosen: -0.8278
Rewards/rejected: -1.2811
Rewards/accuracies: 0.6305
Rewards/margins: 0.4532
Logps/rejected: -371.9221
Logps/chosen: -360.3287
Logits/rejected: -0.0200
Logits/chosen: -0.0541

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6925	0.03	100	0.6928	0.0001	-0.0008	0.4950	0.0008	-243.8912	-277.5416	1.0654	0.9728
0.6903	0.05	200	0.6900	0.0049	-0.0015	0.5830	0.0064	-243.9661	-277.0526	1.0659	0.9732
0.682	0.08	300	0.6801	0.0215	-0.0064	0.6055	0.0280	-244.4588	-275.3941	1.0974	1.0023
0.6574	0.1	400	0.6623	-0.0453	-0.1180	0.6055	0.0727	-255.6189	-282.0750	1.0541	0.9585
0.6262	0.13	500	0.6407	-0.3256	-0.4857	0.6045	0.1601	-292.3858	-310.1027	0.7972	0.7187
0.6441	0.16	600	0.6310	-0.4984	-0.7357	0.6040	0.2373	-317.3828	-327.3852	0.5041	0.4434
0.6238	0.18	700	0.6180	-0.5136	-0.7730	0.6175	0.2594	-321.1137	-328.9063	0.4768	0.4140
0.6022	0.21	800	0.6146	-0.5608	-0.8568	0.6095	0.2960	-329.4937	-333.6271	0.3469	0.2920
0.5893	0.24	900	0.6059	-0.6665	-1.0014	0.6170	0.3349	-343.9540	-344.1970	0.3136	0.2576
0.6435	0.26	1000	0.6007	-0.5361	-0.8713	0.6295	0.3352	-330.9463	-331.1562	0.3378	0.2766
0.5626	0.29	1100	0.5971	-0.6841	-1.0299	0.6195	0.3458	-346.8068	-345.9583	0.3416	0.2879
0.5319	0.31	1200	0.5971	-0.8852	-1.2896	0.6280	0.4044	-372.7756	-366.0687	0.1914	0.1477
0.5818	0.34	1300	0.5949	-0.7178	-1.1027	0.6315	0.3849	-354.0860	-349.3257	0.2165	0.1688
0.5981	0.37	1400	0.5936	-0.6617	-1.0257	0.6290	0.3641	-346.3885	-343.7120	0.1974	0.1465
0.5843	0.39	1500	0.5905	-0.8861	-1.3031	0.6335	0.4171	-374.1299	-366.1545	0.1004	0.0587
0.6283	0.42	1600	0.5882	-0.7845	-1.1706	0.6305	0.3860	-360.8746	-356.0013	0.2242	0.1738
0.5892	0.44	1700	0.5891	-0.6741	-1.0616	0.6310	0.3875	-349.9719	-344.9546	0.1718	0.1259
0.5821	0.47	1800	0.5856	-0.8949	-1.3353	0.6315	0.4404	-377.3439	-367.0341	0.1199	0.0761
0.6072	0.5	1900	0.5861	-0.7180	-1.1339	0.6270	0.4159	-357.2063	-349.3515	0.1237	0.0773
0.6338	0.52	2000	0.5852	-0.7155	-1.1277	0.6340	0.4122	-356.5852	-349.0984	0.0087	-0.0301
0.5582	0.55	2100	0.5860	-0.7383	-1.1682	0.6340	0.4300	-360.6402	-351.3726	-0.0229	-0.0595
0.6103	0.58	2200	0.5821	-0.9235	-1.3855	0.6345	0.4620	-382.3635	-369.8921	-0.0714	-0.1065
0.5636	0.6	2300	0.5836	-0.7656	-1.2038	0.6335	0.4382	-364.1970	-354.1104	-0.0481	-0.0841
0.5846	0.63	2400	0.5804	-0.8773	-1.3343	0.6335	0.4570	-377.2508	-365.2781	-0.0871	-0.1200
0.5799	0.65	2500	0.5834	-0.8420	-1.3045	0.6340	0.4625	-374.2641	-361.7435	-0.0576	-0.0922
0.5565	0.68	2600	0.5810	-0.8009	-1.2549	0.6345	0.4540	-369.3044	-357.6355	-0.0285	-0.0643
0.5614	0.71	2700	0.5782	-0.9522	-1.4183	0.6325	0.4661	-385.6433	-372.7677	-0.0358	-0.0698
0.608	0.73	2800	0.5776	-0.9378	-1.3994	0.6360	0.4616	-383.7585	-371.3293	-0.0229	-0.0571
0.588	0.76	2900	0.5795	-0.8330	-1.2891	0.6345	0.4560	-372.7224	-360.8503	-0.0442	-0.0792
0.5324	0.79	3000	0.5807	-0.7714	-1.2134	0.6340	0.4420	-365.1566	-354.6904	-0.0298	-0.0648
0.6036	0.81	3100	0.5817	-0.7454	-1.1839	0.6360	0.4385	-362.2076	-352.0881	-0.0359	-0.0710
0.615	0.84	3200	0.5806	-0.7630	-1.2065	0.6330	0.4435	-364.4670	-353.8469	-0.0295	-0.0645
0.6211	0.86	3300	0.5794	-0.7767	-1.2207	0.6335	0.4439	-365.8820	-355.2186	-0.0240	-0.0585
0.535	0.89	3400	0.5777	-0.8399	-1.2929	0.6320	0.4530	-373.1028	-361.5366	-0.0225	-0.0558
0.5322	0.92	3500	0.5779	-0.8260	-1.2781	0.6335	0.4522	-371.6272	-360.1418	-0.0210	-0.0546
0.5527	0.94	3600	0.5780	-0.8254	-1.2779	0.6315	0.4525	-371.6083	-360.0847	-0.0229	-0.0565
0.5769	0.97	3700	0.5780	-0.8286	-1.2816	0.6315	0.4530	-371.9745	-360.4062	-0.0225	-0.0562
0.635	0.99	3800	0.5780	-0.8268	-1.2798	0.6300	0.4530	-371.7967	-360.2288	-0.0237	-0.0573

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

phi-2-dpo-renew1

phi-2-dpo-renew1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/phi-2-dpo-renew1

Dataset used to train DUAL-GPO/phi-2-dpo-renew1

Evaluation results