model_shp1_dpo1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0741	2.67	100	1.1825	-3.1899	-3.0393	0.4800	-0.1506	-244.8796	-279.3668	-1.2034	-1.2620
0.0016	5.33	200	2.1179	-8.9597	-8.2224	0.4100	-0.7372	-296.7111	-337.0645	-1.1154	-1.1503
0.0001	8.0	300	1.9624	-9.5308	-9.0562	0.4500	-0.4746	-305.0487	-342.7763	-1.0878	-1.1168
0.0001	10.67	400	1.9799	-9.6041	-9.1296	0.4500	-0.4745	-305.7831	-343.5089	-1.0797	-1.1079
0.0001	13.33	500	1.9938	-9.6787	-9.2063	0.4500	-0.4724	-306.5495	-344.2545	-1.0746	-1.1031
0.0001	16.0	600	2.0046	-9.7222	-9.2446	0.4600	-0.4776	-306.9330	-344.6898	-1.0722	-1.0999
0.0001	18.67	700	2.0079	-9.7525	-9.2749	0.4500	-0.4776	-307.2361	-344.9933	-1.0706	-1.0984
0.0001	21.33	800	2.0091	-9.7588	-9.2867	0.4600	-0.4721	-307.3541	-345.0561	-1.0699	-1.0978
0.0001	24.0	900	2.0158	-9.7704	-9.2915	0.4500	-0.4789	-307.4015	-345.1719	-1.0694	-1.0975
0.0001	26.67	1000	2.0112	-9.7625	-9.2926	0.4700	-0.4699	-307.4124	-345.0927	-1.0692	-1.0975