Llama-2-7b-hf-DPO-FullEval_LookAhead5_TTree1.2_TT0.7_TP0.7_TE0.1_Filtered0.1_V1.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.778	0.2994	78	0.6949	-0.3298	-0.3608	0.4167	0.0310	-66.0123	-80.7126	-0.5458	-0.5189
0.5895	0.5988	156	0.7221	-0.3528	-0.4053	0.5	0.0526	-66.4578	-80.9425	-0.5269	-0.4997
0.6423	0.8983	234	0.8029	-0.5435	-0.7369	0.5	0.1935	-69.7736	-82.8494	-0.5770	-0.5512
0.3985	1.1977	312	0.7639	-0.4640	-0.9647	0.5833	0.5008	-72.0517	-82.0546	-0.7305	-0.7083
0.4527	1.4971	390	0.8308	-0.3638	-0.5763	0.5	0.2125	-68.1671	-81.0529	-0.9151	-0.8936
0.3677	1.7965	468	0.7432	-0.8212	-1.2349	0.6667	0.4137	-74.7538	-85.6273	-1.0459	-1.0262
0.2591	2.0960	546	0.8634	-1.6345	-2.0043	0.5833	0.3699	-82.4478	-93.7598	-1.2888	-1.2721
0.1802	2.3954	624	1.1197	-3.1423	-3.3841	0.6667	0.2418	-96.2452	-108.8380	-1.6577	-1.6465
0.2054	2.6948	702	1.0008	-2.7513	-3.1213	0.6667	0.3700	-93.6174	-104.9277	-1.6348	-1.6227
0.2823	2.9942	780	0.9993	-2.7308	-3.0957	0.6667	0.3649	-93.3609	-104.7227	-1.6352	-1.6231