Llama-2-7b-hf-DPO-Filtered-0.2-version-3

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6488	0.2994	72	0.6678	-0.0112	-0.0724	0.6000	0.0613	-57.0147	-54.4726	-0.5812	-0.5741
0.6	0.5988	144	0.6316	-0.4870	-0.6800	0.6000	0.1930	-63.0904	-59.2310	-0.6746	-0.6711
0.5876	0.8981	216	0.6931	-0.4396	-0.5539	0.5	0.1143	-61.8289	-58.7568	-0.5937	-0.5907
0.4949	1.1975	288	0.7890	-0.7079	-0.9614	0.6500	0.2535	-65.9037	-61.4400	-0.8747	-0.8740
0.565	1.4969	360	0.9088	-1.6793	-1.8869	0.5500	0.2077	-75.1596	-71.1538	-1.2245	-1.2255
0.283	1.7963	432	0.8288	-1.8095	-2.1999	0.6000	0.3905	-78.2897	-72.4555	-1.2749	-1.2766
0.1794	2.0956	504	0.8811	-1.8931	-2.2411	0.5500	0.3480	-78.7009	-73.2920	-1.3148	-1.3161
0.3907	2.3950	576	0.8772	-2.2014	-2.6107	0.5500	0.4093	-82.3973	-76.375	-1.4219	-1.4232
0.0225	2.6944	648	0.8843	-2.2655	-2.6784	0.5500	0.4129	-83.0741	-77.0161	-1.3961	-1.3981
0.2077	2.9938	720	0.8860	-2.2683	-2.6831	0.5500	0.4148	-83.1208	-77.0439	-1.3980	-1.4004