zephyr-smol_llama-100m-dpo-full

This model is a fine-tuned version of amazingvince/zephyr-smol_llama-100m-sft-full on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6549	0.26	1000	0.6037	-0.1205	-0.4850	0.6550	0.3644	-447.3903	-589.4750	-4.7410	-5.0341
0.5349	0.52	2000	0.5779	-0.0126	-0.5080	0.6770	0.4955	-447.6208	-588.3951	-4.8645	-5.1463
0.6029	0.77	3000	0.5657	0.0902	-0.4636	0.6900	0.5538	-447.1767	-587.3674	-5.0016	-5.2911
0.5273	1.03	4000	0.5596	0.0496	-0.5449	0.7040	0.5944	-447.9891	-587.7738	-4.9972	-5.2892
0.5	1.29	5000	0.5557	0.0585	-0.6110	0.7050	0.6695	-448.6505	-587.6843	-5.0108	-5.3047
0.5056	1.55	6000	0.5499	0.0054	-0.6719	0.7130	0.6773	-449.2598	-588.2154	-4.9988	-5.2907
0.4608	1.81	7000	0.5500	-0.0376	-0.7494	0.7030	0.7118	-450.0341	-588.6455	-5.0549	-5.3406
0.426	2.07	8000	0.5472	-0.0106	-0.7021	0.7100	0.6916	-449.5617	-588.3751	-4.9750	-5.2626
0.3875	2.32	9000	0.5464	-0.0011	-0.7171	0.7140	0.7159	-449.7113	-588.2810	-4.9935	-5.2796
0.397	2.58	10000	0.5462	-0.0391	-0.7566	0.7190	0.7175	-450.1064	-588.6602	-4.9737	-5.2618
0.4486	2.84	11000	0.5459	-0.0493	-0.7667	0.7110	0.7174	-450.2074	-588.7629	-4.9569	-5.2441