Cerebras-GPT-111M-instruction-sft-lora-merged-dpo-lora

This model is a fine-tuned version of SebastianSchramm/Cerebras-GPT-111M-instruction-sft-lora-merged on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6555	0.34	300	0.6536	0.5523	0.3662	0.6271	0.1862	-798.4653	-1066.8068	-2.7199	-2.9594
0.615	0.68	600	0.6352	0.7267	0.4534	0.6380	0.2732	-797.5925	-1065.0635	-2.7194	-2.9580
0.6313	1.02	900	0.6278	0.7792	0.4662	0.6440	0.3131	-797.4653	-1064.5378	-2.7117	-2.9469
0.6218	1.36	1200	0.6295	0.7738	0.4669	0.6457	0.3069	-797.4579	-1064.5920	-2.7035	-2.9401
0.6311	1.71	1500	0.6212	0.7817	0.4456	0.6654	0.3361	-797.6708	-1064.5128	-2.7073	-2.9437
0.6107	2.05	1800	0.6223	0.8065	0.4674	0.6572	0.3391	-797.4526	-1064.2653	-2.7009	-2.9373
0.6146	2.39	2100	0.6190	0.8141	0.4648	0.6698	0.3494	-797.4793	-1064.1887	-2.6988	-2.9353
0.6347	2.73	2400	0.6214	0.8118	0.4631	0.6654	0.3487	-797.4959	-1064.2124	-2.6962	-2.9342