distilgpt2-dpo_test_run

This model is a fine-tuned version of gpt2 on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.8683	1.0	1337	0.9044	0.7444	0.2592	0.5817	0.4852	-429.5133	-506.8889	-50.2012	-45.4443
0.4795	2.0	2674	0.9425	0.1993	-0.4639	0.5959	0.6632	-436.7442	-512.3394	-54.4344	-49.5827
0.1485	3.0	4011	1.1159	-2.0134	-2.6798	0.5775	0.6664	-458.9030	-534.4666	-70.3363	-65.4014
0.0378	4.0	5348	1.3151	-3.6174	-4.7588	0.5927	1.1415	-479.6934	-550.5060	-70.8835	-65.6636
0.0127	5.0	6685	1.4381	-4.8640	-6.0585	0.5822	1.1945	-492.6903	-562.9730	-70.3612	-64.6966
0.0006	6.0	8022	1.5074	-5.3161	-6.4742	0.5837	1.1581	-496.8472	-567.4940	-70.7820	-64.9708