zephyr-7b-dpo-full

This model is a fine-tuned version of glimmerz/zephyr-7b-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.7385
Rewards/chosen: -4.7566
Rewards/rejected: -8.6166
Rewards/accuracies: 0.7560
Rewards/margins: 3.8601
Logps/rejected: -315.8341
Logps/chosen: -321.4129
Logits/rejected: -2.2590
Logits/chosen: -2.3620

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.575	0.1	100	0.5309	-0.0101	-0.6034	0.7460	0.5933	-235.7018	-273.9487	-2.6525	-2.7458
0.4759	0.21	200	0.4943	-0.0642	-1.0829	0.75	1.0187	-240.4966	-274.4892	-2.7066	-2.8006
0.5022	0.31	300	0.4824	-0.1526	-1.2517	0.7620	1.0991	-242.1845	-275.3735	-2.7362	-2.8225
0.5282	0.41	400	0.4878	-0.6794	-1.9420	0.7840	1.2626	-249.0876	-280.6413	-2.7023	-2.7924
0.5179	0.52	500	0.4805	-0.2645	-1.4485	0.7760	1.1841	-244.1532	-276.4918	-2.6773	-2.7631
0.4705	0.62	600	0.4715	-0.3016	-1.5766	0.7560	1.2750	-245.4337	-276.8629	-2.7009	-2.7838
0.5038	0.72	700	0.4790	-0.3119	-1.5731	0.7680	1.2612	-245.3986	-276.9666	-2.5409	-2.6269
0.4418	0.83	800	0.4665	-0.4564	-2.0177	0.7800	1.5612	-249.8442	-278.4113	-2.4834	-2.5636
0.5155	0.93	900	0.4770	-0.3715	-1.7079	0.7740	1.3364	-246.7468	-277.5622	-2.5118	-2.5927
0.3463	1.03	1000	0.4755	-0.5305	-1.8263	0.7680	1.2958	-247.9306	-279.1520	-2.6282	-2.7083
0.1266	1.14	1100	0.4924	-1.0131	-2.8651	0.7740	1.8519	-258.3182	-283.9783	-2.5584	-2.6430
0.0751	1.24	1200	0.5208	-1.4508	-3.6646	0.7760	2.2138	-266.3139	-288.3549	-2.5574	-2.6450
0.0306	1.34	1300	0.5779	-2.1463	-4.7450	0.7580	2.5987	-277.1172	-295.3102	-2.4957	-2.5865
0.031	1.45	1400	0.5993	-2.6730	-5.3111	0.7580	2.6381	-282.7792	-300.5774	-2.5157	-2.6051
0.0535	1.55	1500	0.5731	-2.1627	-4.7943	0.75	2.6316	-277.6110	-295.4747	-2.5616	-2.6529
0.063	1.65	1600	0.5433	-1.9823	-4.5765	0.7580	2.5942	-275.4325	-293.6702	-2.5038	-2.5985
0.0423	1.76	1700	0.5821	-2.6553	-5.4183	0.7540	2.7630	-283.8502	-300.3999	-2.4636	-2.5654
0.0559	1.86	1800	0.5657	-2.5801	-5.2643	0.7520	2.6842	-282.3106	-299.6483	-2.4843	-2.5741
0.0468	1.96	1900	0.5759	-2.4597	-5.2907	0.7480	2.8309	-282.5742	-298.4443	-2.4491	-2.5392
0.0576	2.07	2000	0.5614	-2.5997	-5.3232	0.7620	2.7235	-282.8997	-299.8446	-2.4132	-2.5016
0.0135	2.17	2100	0.6182	-3.1988	-6.3849	0.7640	3.1861	-293.5166	-305.8354	-2.4052	-2.5040
0.0149	2.27	2200	0.7075	-4.5960	-8.1955	0.7420	3.5995	-311.6229	-319.8072	-2.3535	-2.4494
0.0095	2.37	2300	0.7117	-4.2102	-7.7788	0.7540	3.5686	-307.4559	-315.9493	-2.2943	-2.3972
0.0104	2.48	2400	0.7131	-4.3371	-7.9252	0.7540	3.5881	-308.9199	-317.2180	-2.3097	-2.4097
0.008	2.58	2500	0.7328	-4.4361	-8.1696	0.7520	3.7335	-311.3636	-318.2084	-2.2756	-2.3764
0.0051	2.68	2600	0.7193	-4.2884	-7.9892	0.7600	3.7009	-309.5601	-316.7311	-2.3138	-2.4185
0.0089	2.79	2700	0.7388	-4.8991	-8.6552	0.7660	3.7561	-316.2196	-322.8380	-2.2942	-2.3960
0.0082	2.89	2800	0.7342	-4.7984	-8.6596	0.7640	3.8612	-316.2638	-321.8309	-2.2620	-2.3649
0.0094	2.99	2900	0.7374	-4.7573	-8.6168	0.7580	3.8595	-315.8361	-321.4205	-2.2595	-2.3625

Framework versions

Transformers 4.35.2
Pytorch 2.1.0
Datasets 2.15.0
Tokenizers 0.15.0

glimmerz
/

zephyr-7b-dpo-full

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train glimmerz/zephyr-7b-dpo-full

Evaluation results

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from glimmerz/zephyr-7b-sft-full

Dataset used to train glimmerz/zephyr-7b-dpo-full

Evaluation results

Finetuned from