Llama-2-7b-hf-DPO-PartialEval_ET0.1_MT1.2_1-5_V.1.0_Filtered0.1_V2.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6731	0.3002	71	0.6597	-0.2850	-0.3581	0.7000	0.0730	-81.4183	-61.7075	-0.6048	-0.6395
0.7247	0.6004	142	0.6736	-0.2337	-0.3068	0.6000	0.0731	-80.9054	-61.1937	-0.6225	-0.6508
0.7904	0.9006	213	0.6429	-0.3998	-0.6509	0.4000	0.2511	-84.3463	-62.8548	-0.5841	-0.6111
0.4605	1.2008	284	0.6441	-1.2776	-1.6264	0.6000	0.3488	-94.1014	-71.6333	-0.6811	-0.7046
0.1414	1.5011	355	0.7397	-1.5081	-2.0627	0.5	0.5546	-98.4643	-73.9382	-0.9076	-0.9267
0.3163	1.8013	426	0.6853	-0.6284	-1.2097	0.5	0.5813	-89.9342	-65.1407	-1.0111	-1.0272
0.4302	2.1015	497	0.6807	-1.2134	-1.8550	0.6000	0.6416	-96.3875	-70.9912	-1.0568	-1.0718
0.1069	2.4017	568	0.6842	-1.4697	-2.2823	0.6000	0.8126	-100.6605	-73.5542	-1.1094	-1.1223
0.2267	2.7019	639	0.7144	-1.6306	-2.4361	0.6000	0.8055	-102.1983	-75.1633	-1.1455	-1.1589