dpo-llama-chat

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5985	0.24	100	0.5908	-0.0098	-0.3706	0.6857	0.3608	-93.3248	-77.2335	-0.7818	-0.8133
0.5032	0.47	200	0.4768	-0.1589	-0.9349	0.8037	0.7760	-98.9677	-78.7246	-0.8669	-0.8774
0.4105	0.71	300	0.4056	-0.3303	-1.5893	0.8316	1.2589	-105.5115	-80.4384	-0.8423	-0.8361
0.3707	0.94	400	0.3501	-0.2376	-1.6094	0.8760	1.3718	-105.7129	-79.5110	-0.7540	-0.7564
0.2363	1.18	500	0.2939	-0.8615	-2.9614	0.8932	2.0999	-119.2329	-85.7499	-0.8983	-0.8797
0.1947	1.42	600	0.2463	-1.0709	-3.5879	0.9085	2.5170	-125.4976	-87.8440	-0.8982	-0.8717
0.1823	1.65	700	0.2242	-1.2056	-3.7965	0.9158	2.5909	-127.5844	-89.1917	-0.8272	-0.8112
0.1476	1.89	800	0.2042	-1.1764	-3.9644	0.9271	2.7881	-129.2632	-88.8989	-0.8622	-0.8415
0.112	2.13	900	0.1936	-1.3373	-4.3265	0.9330	2.9891	-132.8835	-90.5088	-0.8608	-0.8338
0.0949	2.36	1000	0.1928	-1.3672	-4.3992	0.9310	3.0321	-133.6114	-90.8071	-0.8584	-0.8277