Llama-2-7b-hf-DPO-LookAhead-5_Q2_TTree1.4_TT0.9_TP0.7_TE0.2_V2

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7029	0.3026	77	0.6933	-0.0162	-0.0163	0.3333	0.0001	-95.3805	-117.5575	0.5079	0.4933
0.6605	0.6051	154	0.6804	0.0594	0.0318	0.6667	0.0276	-94.8997	-116.8017	0.4988	0.4837
0.6291	0.9077	231	0.6684	0.2040	0.1302	0.75	0.0738	-93.9156	-115.3556	0.4931	0.4757
0.3149	1.2102	308	0.6806	-0.2081	-0.3152	0.5833	0.1071	-98.3691	-119.4764	0.4810	0.4619
0.3251	1.5128	385	0.7502	-0.4333	-0.4100	0.5833	-0.0233	-99.3170	-121.7279	0.4258	0.4057
0.2002	1.8153	462	0.8816	-1.2398	-1.0499	0.5	-0.1899	-105.7162	-129.7932	0.3036	0.2813
0.0182	2.1179	539	0.9166	-1.4380	-1.2371	0.5	-0.2010	-107.5881	-131.7757	0.1946	0.1703
0.2002	2.4204	616	0.9190	-1.5677	-1.4004	0.5	-0.1673	-109.2209	-133.0719	0.1338	0.1085
0.1982	2.7230	693	0.9352	-1.7573	-1.5576	0.5	-0.1997	-110.7931	-134.9680	0.0983	0.0721