Llama-2-7b-hf-DPO-PartialEval_LookAhead5_ET0.1_MT1.2_1-5_Filtered0.1_V2.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6944	0.3026	77	0.7061	-0.0481	-0.0149	0.3333	-0.0332	-78.3709	-87.4894	-0.4703	-0.4454
0.7162	0.6051	154	0.7499	-0.0798	0.0370	0.25	-0.1168	-77.8523	-87.8063	-0.5650	-0.5389
0.7265	0.9077	231	0.7156	-0.1175	-0.0976	0.5833	-0.0199	-79.1982	-88.1833	-0.5730	-0.5467
0.6934	1.2102	308	0.8015	-0.6012	-0.5127	0.5	-0.0884	-83.3497	-93.0202	-0.7454	-0.7223
0.4346	1.5128	385	0.8278	-0.8704	-0.8319	0.5	-0.0385	-86.5415	-95.7126	-0.9246	-0.9032
0.6773	1.8153	462	0.7807	-0.8712	-0.8972	0.5	0.0260	-87.1945	-95.7207	-0.8954	-0.8734
0.3446	2.1179	539	0.8424	-1.5623	-1.5383	0.5833	-0.0240	-93.6050	-102.6317	-1.0923	-1.0707
0.1483	2.4204	616	0.9759	-2.2720	-2.1419	0.5833	-0.1301	-99.6412	-109.7289	-1.2875	-1.2666
0.213	2.7230	693	1.0109	-2.4672	-2.3527	0.5	-0.1144	-101.7496	-111.6802	-1.3387	-1.3182