gpt-imdb-dpo_annealing

This model is a fine-tuned version of lvwerra/gpt2-imdb on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.2713	0.21	500	0.3576	-0.9589	-2.8806	0.8417	1.9217	-300.2507	-247.4370	-34.9635	-36.2514
0.2605	0.42	1000	0.2876	-1.8668	-5.2245	0.8708	3.3577	-299.0920	-247.9165	-39.8673	-41.1403
0.134	0.63	1500	0.2827	-3.3220	-8.2599	0.8833	4.9379	-301.8662	-250.6212	-38.4289	-39.6488
0.2246	0.83	2000	0.2412	-3.0672	-9.5366	0.9000	6.4694	-297.1335	-246.0230	-36.9979	-38.2478
0.0612	1.04	2500	0.2382	-4.4276	-12.4767	0.9062	8.0491	-298.9408	-247.7763	-38.3549	-39.5684
0.2336	1.25	3000	0.2628	-5.5352	-15.3372	0.9042	9.8020	-299.9716	-248.3611	-39.0799	-40.3999
0.1755	1.46	3500	0.2670	-6.0750	-18.0326	0.9229	11.9576	-300.3778	-247.6266	-38.3635	-39.7127
0.34	1.67	4000	0.2499	-7.2657	-20.1377	0.9208	12.8719	-299.6307	-248.2345	-38.0993	-39.2549
0.1822	1.88	4500	0.3000	-7.9584	-22.7421	0.9271	14.7838	-299.8409	-247.9176	-38.7806	-39.9153
0.153	2.08	5000	0.2972	-9.4217	-26.8046	0.9333	17.3829	-302.0991	-248.7675	-38.2977	-39.5006
0.0004	2.29	5500	0.2962	-9.6704	-28.5833	0.9354	18.9129	-300.9727	-247.8805	-38.6801	-39.9033
0.0584	2.5	6000	0.3113	-11.3462	-31.8850	0.9375	20.5388	-301.8552	-248.8479	-38.5484	-39.7563
0.0304	2.71	6500	0.3441	-12.4687	-34.7986	0.9354	22.3299	-302.1741	-249.0562	-38.8388	-40.0519
0.223	2.92	7000	0.3482	-13.2925	-37.2767	0.9354	23.9842	-302.0002	-248.9281	-38.9773	-40.1868