llama_SFT_e1_DPO_e1

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6715	0.1	25	0.6332	0.0386	-0.0871	0.9333	0.1257	-186.5185	-159.4784	-1.0533	-0.8570
0.5507	0.2	50	0.5213	0.1021	-0.2851	1.0	0.3872	-188.4984	-158.8435	-1.0540	-0.8579
0.4521	0.3	75	0.4180	0.1622	-0.5141	1.0	0.6763	-190.7885	-158.2424	-1.0548	-0.8606
0.3675	0.4	100	0.3332	0.2182	-0.7466	1.0	0.9647	-193.1132	-157.6828	-1.0545	-0.8611
0.3149	0.5	125	0.2724	0.2574	-0.9589	1.0	1.2164	-195.2370	-157.2902	-1.0544	-0.8631
0.2486	0.6	150	0.2247	0.2948	-1.1593	1.0	1.4541	-197.2406	-156.9163	-1.0550	-0.8663
0.2173	0.7	175	0.1966	0.3176	-1.2962	1.0	1.6138	-198.6099	-156.6887	-1.0553	-0.8673
0.1971	0.79	200	0.1878	0.3231	-1.3461	1.0	1.6692	-199.1087	-156.6337	-1.0542	-0.8665
0.1869	0.89	225	0.1869	0.3210	-1.3535	1.0	1.6745	-199.1825	-156.6541	-1.0546	-0.8626
0.1911	0.99	250	0.1876	0.3221	-1.3485	1.0	1.6706	-199.1326	-156.6435	-1.0544	-0.8650