Llama-2-7b-hf-DPO-LookAhead-0_TTree1.4_TT0.9_TP0.7_TE0.2_V1

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.666	0.3008	80	0.7114	-0.0994	-0.0692	0.5	-0.0302	-107.2313	-175.0767	0.3992	0.3563
0.7848	0.6015	160	0.7871	-0.5530	-0.4147	0.4167	-0.1383	-110.6864	-179.6128	0.4061	0.3637
0.7413	0.9023	240	0.8345	-0.6343	-0.4162	0.4167	-0.2181	-110.7009	-180.4258	0.3814	0.3393
0.5906	1.2030	320	1.0830	-1.4953	-0.8871	0.4167	-0.6082	-115.4103	-189.0355	0.2654	0.2219
0.3771	1.5038	400	1.1984	-2.0768	-1.2714	0.3333	-0.8053	-119.2534	-194.8505	0.1371	0.0921
0.2132	1.8045	480	1.2438	-2.4881	-1.6813	0.3333	-0.8068	-123.3516	-198.9633	0.0444	-0.0027
0.0544	2.1053	560	1.6818	-3.7464	-2.5485	0.1667	-1.1979	-132.0241	-211.5465	-0.1111	-0.1621
0.0452	2.4060	640	1.8619	-4.4511	-3.1120	0.4167	-1.3391	-137.6592	-218.5939	-0.2407	-0.2942
0.023	2.7068	720	1.9544	-4.7879	-3.3303	0.4167	-1.4576	-139.8422	-221.9621	-0.3063	-0.3612