llama_DPO_model_e1

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6603	0.1	25	0.6253	0.0416	-0.1007	0.9633	0.1423	-185.8169	-160.2181	-1.0525	-0.8550
0.5342	0.2	50	0.5074	0.1130	-0.3090	1.0	0.4220	-187.8993	-159.5039	-1.0525	-0.8569
0.4382	0.3	75	0.4022	0.1798	-0.5442	1.0	0.7241	-190.2517	-158.8354	-1.0530	-0.8563
0.3592	0.4	100	0.3212	0.2338	-0.7752	1.0	1.0090	-192.5613	-158.2961	-1.0531	-0.8579
0.3035	0.5	125	0.2590	0.2824	-0.9912	1.0	1.2736	-194.7217	-157.8096	-1.0528	-0.8583
0.2374	0.6	150	0.2125	0.3190	-1.1966	1.0	1.5157	-196.7760	-157.4438	-1.0528	-0.8575
0.2094	0.7	175	0.1868	0.3455	-1.3260	1.0	1.6714	-198.0693	-157.1793	-1.0528	-0.8598
0.1886	0.79	200	0.1796	0.3491	-1.3639	1.0	1.7130	-198.4486	-157.1428	-1.0532	-0.8617
0.1805	0.89	225	0.1785	0.3523	-1.3731	1.0	1.7254	-198.5406	-157.1107	-1.0530	-0.8593
0.1821	0.99	250	0.1779	0.3527	-1.3764	1.0	1.7292	-198.5740	-157.1067	-1.0528	-0.8587