zephyr-7b-dpo-lora

This model is a fine-tuned version of HuggingFaceH4/mistral-7b-sft-beta on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5795	0.1047	100	0.5875	0.0265	-0.3721	0.6825	0.3986	-271.3593	-271.6063	-2.7688	-2.7900
0.5449	0.2094	200	0.5520	0.0601	-0.5726	0.7103	0.6327	-273.3645	-271.2704	-2.7792	-2.7981
0.545	0.3141	300	0.5320	-0.0197	-0.7637	0.7044	0.7439	-275.2751	-272.0686	-2.7616	-2.7803
0.4747	0.4187	400	0.5228	-0.1728	-0.9527	0.7004	0.7798	-277.1651	-273.5996	-2.7532	-2.7732
0.5367	0.5234	500	0.5175	-0.2142	-1.0435	0.7143	0.8293	-278.0737	-274.0135	-2.7339	-2.7540
0.5031	0.6281	600	0.5139	-0.2939	-1.1329	0.7024	0.8389	-278.9670	-274.8105	-2.7071	-2.7268
0.5057	0.7328	700	0.5084	-0.0108	-0.9049	0.7202	0.8941	-276.6876	-271.9794	-2.7207	-2.7404
0.5172	0.8375	800	0.5090	-0.0300	-0.9231	0.7183	0.8931	-276.8697	-272.1711	-2.7173	-2.7371
0.5173	0.9422	900	0.5084	-0.0008	-0.9080	0.7222	0.9072	-276.7181	-271.8789	-2.7174	-2.7372