phi-2-ipo-test-iter-0

This model is a fine-tuned version of lole25/phi-2-sft-ultrachat-lora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 2546.4375
Rewards/chosen: -0.1591
Rewards/rejected: -0.1612
Rewards/accuracies: 0.5220
Rewards/margins: 0.0021
Logps/rejected: -249.6534
Logps/chosen: -272.5227
Logits/rejected: 0.4171
Logits/chosen: 0.3526

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 4

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
2477.3281	0.32	100	2500.7156	-0.0018	-0.0018	0.4930	-0.0000	-233.7207	-256.7978	0.8796	0.8221
2224.3488	0.64	200	2499.8904	-0.0195	-0.0198	0.5015	0.0003	-235.5204	-258.5673	0.8051	0.7462
1898.0719	0.96	300	2505.6912	-0.0563	-0.0571	0.5140	0.0008	-239.2530	-262.2491	0.6844	0.6233
1879.8852	1.28	400	2516.0835	-0.0944	-0.0957	0.5200	0.0013	-243.1053	-266.0533	0.5839	0.5215
1917.2811	1.6	500	2527.1995	-0.1156	-0.1170	0.5115	0.0014	-245.2343	-268.1747	0.5244	0.4611
1799.3824	1.92	600	2534.4292	-0.1363	-0.1381	0.5210	0.0018	-247.3504	-270.2482	0.4714	0.4075
1751.5762	2.24	700	2531.3550	-0.1448	-0.1474	0.5180	0.0026	-248.2780	-271.0988	0.4545	0.3906
1711.1711	2.56	800	2536.2451	-0.1487	-0.1511	0.5145	0.0024	-248.6440	-271.4834	0.4402	0.3759
1894.4447	2.88	900	2542.6299	-0.1549	-0.1570	0.5235	0.0022	-249.2417	-272.1000	0.4262	0.3618
1798.5389	3.2	1000	2542.7288	-0.1581	-0.1604	0.5205	0.0023	-249.5780	-272.4200	0.4202	0.3559
1834.9711	3.52	1100	2542.2373	-0.1586	-0.1610	0.5205	0.0024	-249.6345	-272.4703	0.4177	0.3532
1765.5148	3.84	1200	2546.1714	-0.1589	-0.1610	0.5220	0.0021	-249.6357	-272.5010	0.4160	0.3515

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.2.1+cu121
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO-2
/

phi-2-ipo-test-iter-0

phi-2-ipo-test-iter-0

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO-2/phi-2-ipo-test-iter-0

Dataset used to train DUAL-GPO-2/phi-2-ipo-test-iter-0

Collection including DUAL-GPO-2/phi-2-ipo-test-iter-0

phi-2-ipo-test-iter-0

Evaluation results