phi2-lora-quantized-distilabel-intel-orca-dpo-pairs

This model is a fine-tuned version of microsoft/phi-2 on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6887	0.14	20	0.6767	0.0030	-0.0331	0.6341	0.0361	-226.1282	-214.0752	0.2238	0.1343
0.6472	0.27	40	0.6171	0.0141	-0.1710	0.7639	0.1852	-227.5079	-213.9642	0.2464	0.1508
0.5759	0.41	60	0.5584	0.0123	-0.4023	0.7808	0.4146	-229.8206	-213.9829	0.2774	0.1736
0.526	0.54	80	0.5326	0.0036	-0.5790	0.7816	0.5826	-231.5877	-214.0700	0.2983	0.1884
0.4963	0.68	100	0.5225	0.0020	-0.6964	0.7825	0.6984	-232.7611	-214.0853	0.3131	0.1986
0.4977	0.81	120	0.5188	-0.0025	-0.7533	0.7816	0.7508	-233.3300	-214.1302	0.3162	0.2002
0.4818	0.95	140	0.5173	-0.0019	-0.7725	0.7816	0.7706	-233.5226	-214.1249	0.3181	0.2015