Llama-2-7b-hf-DPO-LookAhead3_FullEval_TTree1.4_TLoop0.7_TEval0.2_Filter0.2_V3.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6474	0.3018	51	0.6491	-0.0362	-0.1070	0.5	0.0707	-126.5454	-86.4856	-0.6602	-0.6833
0.6614	0.6036	102	0.5967	-0.0764	-0.2716	0.625	0.1951	-128.1913	-86.8877	-0.6723	-0.6955
0.736	0.9053	153	0.6105	-0.3083	-0.5178	0.625	0.2095	-130.6541	-89.2063	-0.7358	-0.7574
0.4273	1.2071	204	0.5950	-0.5205	-0.8235	0.75	0.3030	-133.7103	-91.3283	-0.8108	-0.8319
0.4513	1.5089	255	0.5775	-0.8673	-1.1891	0.5	0.3218	-137.3667	-94.7965	-0.8911	-0.9112
0.376	1.8107	306	0.5885	-0.9856	-1.2703	0.375	0.2848	-138.1790	-95.9789	-0.8967	-0.9161
0.3154	2.1124	357	0.5543	-1.3571	-1.8062	0.625	0.4491	-143.5375	-99.6945	-1.0781	-1.0970
0.0512	2.4142	408	0.5432	-1.8765	-2.4774	0.75	0.6009	-150.2498	-104.8879	-1.2016	-1.2198
0.0875	2.7160	459	0.5631	-2.3247	-3.0071	0.625	0.6824	-155.5472	-109.3707	-1.3016	-1.3197