phi-2-ipo-renew1

This model is a fine-tuned version of lole25/phi-2-sft-ultrachat-lora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 2028.0933
Rewards/chosen: -0.1243
Rewards/rejected: -0.2158
Rewards/accuracies: 0.6900
Rewards/margins: 0.0915
Logps/rejected: -255.1287
Logps/chosen: -269.0499
Logits/rejected: 0.5909
Logits/chosen: 0.5352

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
2496.843	0.05	100	2502.2668	-0.0003	-0.0002	0.5005	-0.0002	-233.5649	-256.6506	0.8888	0.8318
2499.2807	0.1	200	2494.8354	0.0001	-0.0005	0.5190	0.0006	-233.5995	-256.6106	0.8882	0.8310
2477.7609	0.16	300	2481.5015	-0.0011	-0.0031	0.5595	0.0019	-233.8548	-256.7285	0.8892	0.8319
2428.4195	0.21	400	2419.1045	-0.0068	-0.0156	0.6495	0.0089	-235.1127	-257.2951	0.8983	0.8404
2296.8842	0.26	500	2349.4358	-0.0240	-0.0419	0.6565	0.0179	-237.7379	-259.0124	0.8806	0.8214
2254.5846	0.31	600	2273.4993	-0.0525	-0.0829	0.6570	0.0304	-241.8383	-261.8659	0.8478	0.7868
2330.7787	0.37	700	2224.3350	-0.0819	-0.1221	0.6630	0.0402	-245.7631	-264.8093	0.8128	0.7517
2223.6863	0.42	800	2196.0991	-0.1009	-0.1487	0.6675	0.0478	-248.4222	-266.7057	0.7611	0.6992
2066.7418	0.47	900	2166.0732	-0.1112	-0.1658	0.6700	0.0546	-250.1319	-267.7397	0.7518	0.6917
2119.2691	0.52	1000	2138.9312	-0.1215	-0.1821	0.6715	0.0606	-251.7610	-268.7693	0.7213	0.6619
2191.7109	0.58	1100	2121.8115	-0.1257	-0.1906	0.6695	0.0648	-252.6059	-269.1910	0.7176	0.6584
2308.1883	0.63	1200	2110.3069	-0.1409	-0.2123	0.6665	0.0715	-254.7812	-270.7044	0.6920	0.6330
1996.7178	0.68	1300	2095.3130	-0.1314	-0.2042	0.6755	0.0728	-253.9726	-269.7621	0.6722	0.6141
2038.3844	0.73	1400	2085.0852	-0.1383	-0.2140	0.6800	0.0756	-254.9441	-270.4488	0.6513	0.5933
2094.2182	0.79	1500	2076.3042	-0.1390	-0.2166	0.6790	0.0777	-255.2133	-270.5129	0.6474	0.5898
2171.3457	0.84	1600	2069.3757	-0.1374	-0.2166	0.6810	0.0792	-255.2130	-270.3595	0.6392	0.5818
2189.3863	0.89	1700	2062.1995	-0.1386	-0.2192	0.6780	0.0806	-255.4675	-270.4739	0.6291	0.5723
2292.8938	0.94	1800	2053.1299	-0.1196	-0.2005	0.6830	0.0809	-253.6025	-268.5789	0.6275	0.5703
2085.5805	0.99	1900	2052.3237	-0.1086	-0.1906	0.6900	0.0821	-252.6131	-267.4730	0.6319	0.5747
1847.759	1.05	2000	2050.4177	-0.1118	-0.1953	0.6850	0.0836	-253.0827	-267.7950	0.6333	0.5763
2024.9559	1.1	2100	2046.7593	-0.1219	-0.2083	0.6900	0.0864	-254.3799	-268.8073	0.6157	0.5590
2038.6354	1.15	2200	2043.5728	-0.1205	-0.2072	0.6880	0.0867	-254.2731	-268.6722	0.6083	0.5518
2022.9617	1.2	2300	2035.5857	-0.1173	-0.2041	0.6895	0.0868	-253.9597	-268.3491	0.6101	0.5535
1871.641	1.26	2400	2036.3373	-0.1190	-0.2073	0.6895	0.0884	-254.2831	-268.5161	0.6046	0.5482
1907.3463	1.31	2500	2034.7010	-0.1216	-0.2108	0.6880	0.0892	-254.6297	-268.7765	0.6022	0.5460
1884.6086	1.36	2600	2033.7977	-0.1215	-0.2105	0.6910	0.0890	-254.6014	-268.7708	0.6013	0.5451
2034.9129	1.41	2700	2032.5447	-0.1235	-0.2140	0.6900	0.0905	-254.9471	-268.9633	0.5987	0.5426
2068.2822	1.47	2800	2030.8698	-0.1251	-0.2162	0.6900	0.0911	-255.1671	-269.1270	0.5943	0.5383
1977.4029	1.52	2900	2030.6033	-0.1251	-0.2162	0.6895	0.0911	-255.1690	-269.1252	0.5941	0.5381
2110.2887	1.57	3000	2030.5707	-0.1259	-0.2173	0.6905	0.0915	-255.2821	-269.2050	0.5908	0.5348
2068.2863	1.62	3100	2029.4174	-0.1242	-0.2156	0.6935	0.0914	-255.1087	-269.0390	0.5913	0.5357
1977.8852	1.67	3200	2026.1289	-0.1249	-0.2165	0.6960	0.0916	-255.2016	-269.1071	0.5920	0.5364
2123.3787	1.73	3300	2027.3552	-0.1248	-0.2162	0.6930	0.0914	-255.1666	-269.0933	0.5926	0.5370
1945.4934	1.78	3400	2025.7804	-0.1248	-0.2164	0.6935	0.0916	-255.1899	-269.1010	0.5909	0.5353
1937.2627	1.83	3500	2027.8240	-0.1247	-0.2163	0.6930	0.0916	-255.1750	-269.0878	0.5903	0.5347
2007.2062	1.88	3600	2025.3228	-0.1244	-0.2164	0.6895	0.0919	-255.1843	-269.0623	0.5910	0.5352
2076.715	1.94	3700	2027.4857	-0.1243	-0.2159	0.6920	0.0916	-255.1383	-269.0487	0.5913	0.5358
2055.2201	1.99	3800	2027.8082	-0.1244	-0.2160	0.6920	0.0916	-255.1455	-269.0543	0.5902	0.5347

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO-2
/

phi-2-ipo-renew1

phi-2-ipo-renew1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO-2/phi-2-ipo-renew1

Dataset used to train DUAL-GPO-2/phi-2-ipo-renew1

Evaluation results