phi-2-gpo-renew2-b0.01-log-i0

This model is a fine-tuned version of lole25/phi-2-sft-lora-ultrachat on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.6909
Rewards/chosen: -0.0288
Rewards/rejected: -0.0865
Rewards/accuracies: 0.6270
Rewards/margins: 0.0577
Logps/rejected: -252.4614
Logps/chosen: -280.4224
Logits/rejected: 1.0251
Logits/chosen: 0.9229

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6931	0.03	100	0.6931	-0.0003	-0.0006	0.4515	0.0003	-243.8745	-277.5758	1.0631	0.9710
0.693	0.05	200	0.6929	0.0028	-0.0017	0.5885	0.0046	-243.9904	-277.2661	1.0632	0.9705
0.6926	0.08	300	0.6925	0.0100	-0.0055	0.6260	0.0155	-244.3642	-276.5485	1.0488	0.9545
0.6916	0.1	400	0.6920	0.0057	-0.0240	0.6340	0.0297	-246.2157	-276.9778	0.9930	0.8978
0.6913	0.13	500	0.6917	-0.0320	-0.0687	0.6310	0.0366	-250.6851	-280.7516	0.9188	0.8239
0.6916	0.16	600	0.6915	-0.0605	-0.1045	0.6215	0.0440	-254.2614	-283.5969	0.9507	0.8586
0.6911	0.18	700	0.6914	-0.0360	-0.0798	0.6260	0.0438	-251.7944	-281.1486	0.9765	0.8818
0.6915	0.21	800	0.6913	-0.0433	-0.0906	0.6240	0.0473	-252.8779	-281.8777	0.9965	0.9022
0.691	0.24	900	0.6912	-0.0529	-0.1055	0.6245	0.0526	-254.3653	-282.8321	1.0206	0.9266
0.6913	0.26	1000	0.6912	-0.0397	-0.0905	0.6290	0.0507	-252.8640	-281.5216	1.0170	0.9216
0.6912	0.29	1100	0.6912	-0.0550	-0.1016	0.625	0.0466	-253.9782	-283.0510	1.0190	0.9244
0.6902	0.31	1200	0.6912	-0.0570	-0.1101	0.6230	0.0531	-254.8289	-283.2487	1.0101	0.9164
0.6912	0.34	1300	0.6911	-0.0234	-0.0732	0.6130	0.0498	-251.1342	-279.8864	1.0357	0.9401
0.6914	0.37	1400	0.6911	-0.0157	-0.0634	0.6295	0.0477	-250.1540	-279.1180	1.0311	0.9342
0.6919	0.39	1500	0.6910	-0.0502	-0.1023	0.6320	0.0521	-254.0441	-282.5649	1.0137	0.9161
0.6912	0.42	1600	0.6910	-0.0349	-0.0862	0.6320	0.0513	-252.4398	-281.0401	1.0315	0.9320
0.6905	0.44	1700	0.6910	-0.0530	-0.1089	0.6325	0.0559	-254.7030	-282.8433	1.0088	0.9100
0.6901	0.47	1800	0.6910	-0.0409	-0.0984	0.6225	0.0575	-253.6523	-281.6338	1.0314	0.9324
0.6902	0.5	1900	0.6910	-0.0326	-0.0895	0.6215	0.0569	-252.7657	-280.8078	1.0212	0.9226
0.6919	0.52	2000	0.6910	-0.0239	-0.0768	0.6275	0.0529	-251.4911	-279.9320	1.0252	0.9259
0.6919	0.55	2100	0.6909	-0.0381	-0.0926	0.6345	0.0545	-253.0794	-281.3606	1.0476	0.9477
0.6917	0.58	2200	0.6909	-0.0421	-0.0985	0.6325	0.0564	-253.6693	-281.7611	1.0407	0.9399
0.6909	0.6	2300	0.6909	-0.0318	-0.0861	0.6335	0.0543	-252.4272	-280.7285	1.0408	0.9399
0.6903	0.63	2400	0.6909	-0.0296	-0.0850	0.6360	0.0553	-252.3121	-280.5100	1.0219	0.9198
0.6908	0.65	2500	0.6909	-0.0373	-0.0959	0.6330	0.0586	-253.4011	-281.2754	1.0213	0.9196
0.6907	0.68	2600	0.6909	-0.0424	-0.1023	0.6295	0.0599	-254.0473	-281.7884	1.0173	0.9161
0.6905	0.71	2700	0.6909	-0.0353	-0.0938	0.6310	0.0585	-253.1964	-281.0736	1.0139	0.9119
0.692	0.73	2800	0.6909	-0.0327	-0.0894	0.6305	0.0567	-252.7526	-280.8156	1.0163	0.9141
0.6906	0.76	2900	0.6909	-0.0334	-0.0904	0.6295	0.0570	-252.8527	-280.8846	1.0123	0.9098
0.6904	0.79	3000	0.6909	-0.0312	-0.0890	0.6295	0.0579	-252.7167	-280.6625	1.0147	0.9123
0.6905	0.81	3100	0.6909	-0.0301	-0.0877	0.6330	0.0576	-252.5846	-280.5529	1.0175	0.9147
0.6919	0.84	3200	0.6909	-0.0301	-0.0878	0.6305	0.0577	-252.6000	-280.5576	1.0176	0.9154
0.69	0.86	3300	0.6909	-0.0266	-0.0839	0.6285	0.0573	-252.2050	-280.2096	1.0212	0.9186
0.689	0.89	3400	0.6909	-0.0289	-0.0867	0.6280	0.0578	-252.4849	-280.4384	1.0223	0.9202
0.6901	0.92	3500	0.6909	-0.0290	-0.0869	0.6260	0.0579	-252.5046	-280.4475	1.0239	0.9216
0.6914	0.94	3600	0.6909	-0.0288	-0.0865	0.6290	0.0577	-252.4631	-280.4258	1.0244	0.9221
0.6914	0.97	3700	0.6909	-0.0289	-0.0864	0.6320	0.0576	-252.4591	-280.4350	1.0240	0.9216
0.6917	0.99	3800	0.6909	-0.0287	-0.0866	0.6320	0.0579	-252.4790	-280.4204	1.0246	0.9221

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

phi-2-gpo-renew2-b0.01-log-i0

phi-2-gpo-renew2-b0.01-log-i0

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/phi-2-gpo-renew2-b0.01-log-i0

Dataset used to train DUAL-GPO/phi-2-gpo-renew2-b0.01-log-i0

Evaluation results