zephyr-7b-dpo-full

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.6874
Rewards/chosen: -4.3150
Rewards/rejected: -8.0704
Rewards/accuracies: 0.7857
Rewards/margins: 3.7554
Logps/rejected: -325.6119
Logps/chosen: -339.6828
Logits/rejected: -2.6781
Logits/chosen: -2.7397

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.5397	0.1	100	0.5211	0.1287	-0.6851	0.7579	0.8138	-251.7586	-295.2458	-2.9742	-3.0033
0.4919	0.21	200	0.4873	0.0278	-1.1599	0.7897	1.1876	-256.5061	-296.2552	-3.0688	-3.0898
0.4802	0.31	300	0.5027	-0.2234	-1.3257	0.7540	1.1023	-258.1646	-298.7669	-3.0494	-3.0828
0.5134	0.41	400	0.5098	-0.2878	-1.6709	0.7698	1.3832	-261.6169	-299.4102	-2.8843	-2.9179
0.4534	0.52	500	0.4905	-0.1808	-1.6336	0.7698	1.4528	-261.2433	-298.3406	-2.9804	-3.0182
0.4976	0.62	600	0.4872	-0.2273	-1.5386	0.7659	1.3112	-260.2931	-298.8059	-2.9266	-2.9730
0.5452	0.72	700	0.4888	-0.4813	-1.6851	0.7341	1.2039	-261.7586	-301.3452	-2.9377	-2.9686
0.5342	0.83	800	0.4774	-0.3705	-1.9222	0.7857	1.5517	-264.1292	-300.2377	-2.8434	-2.8821
0.5014	0.93	900	0.4814	-0.2397	-1.6794	0.7619	1.4397	-261.7013	-298.9296	-2.8339	-2.8781
0.0785	1.03	1000	0.4821	-0.6486	-2.5221	0.7659	1.8735	-270.1282	-303.0184	-2.7561	-2.8068
0.0883	1.14	1100	0.5074	-1.3177	-3.3355	0.7540	2.0178	-278.2621	-309.7097	-2.7831	-2.8337
0.086	1.24	1200	0.5001	-1.1250	-3.2622	0.7540	2.1372	-277.5298	-307.7827	-2.7876	-2.8347
0.0919	1.34	1300	0.5054	-1.3872	-3.5531	0.8016	2.1659	-280.4383	-310.4045	-2.7662	-2.8076
0.105	1.44	1400	0.5085	-1.5140	-3.6281	0.7817	2.1141	-281.1881	-311.6723	-2.7877	-2.8291
0.0714	1.55	1500	0.5216	-1.8642	-4.0538	0.7460	2.1896	-285.4451	-315.1745	-2.7888	-2.8331
0.0874	1.65	1600	0.5050	-1.5077	-3.7276	0.7421	2.2199	-282.1837	-311.6096	-2.7751	-2.8315
0.063	1.75	1700	0.5350	-1.9441	-4.4422	0.7857	2.4980	-289.3290	-315.9738	-2.7470	-2.8054
0.0786	1.86	1800	0.5376	-2.0344	-4.4236	0.7698	2.3892	-289.1434	-316.8769	-2.7544	-2.8120
0.1117	1.96	1900	0.5335	-1.9236	-4.0369	0.7817	2.1133	-285.2767	-315.7684	-2.8365	-2.8858
0.0175	2.06	2000	0.5882	-2.8256	-5.7651	0.7619	2.9396	-302.5587	-324.7882	-2.7736	-2.8336
0.0145	2.17	2100	0.6160	-3.1789	-6.2515	0.7659	3.0725	-307.4222	-328.3220	-2.7453	-2.8019
0.0109	2.27	2200	0.6675	-3.8634	-7.3412	0.7659	3.4777	-318.3191	-335.1671	-2.7136	-2.7758
0.0144	2.37	2300	0.6555	-3.6832	-7.0603	0.7738	3.3770	-315.5101	-333.3649	-2.6841	-2.7460
0.0103	2.48	2400	0.6598	-3.7543	-7.1773	0.7579	3.4230	-316.6805	-334.0755	-2.6255	-2.6922
0.0085	2.58	2500	0.7044	-4.5468	-8.3313	0.7659	3.7845	-328.2202	-342.0003	-2.6245	-2.6937
0.0077	2.68	2600	0.6755	-3.9908	-7.6767	0.7857	3.6859	-321.6741	-336.4403	-2.6716	-2.7350
0.0098	2.79	2700	0.6890	-4.1853	-7.8875	0.7778	3.7022	-323.7826	-338.3858	-2.6895	-2.7518
0.0126	2.89	2800	0.6889	-4.2792	-8.0158	0.7778	3.7366	-325.0659	-339.3250	-2.6752	-2.7376
0.0078	2.99	2900	0.6886	-4.3139	-8.0732	0.7738	3.7593	-325.6390	-339.6714	-2.6788	-2.7404

Framework versions

Transformers 4.35.0
Pytorch 2.1.0
Datasets 2.14.6
Tokenizers 0.14.1

dongwang218
/

zephyr-7b-dpo-full

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for dongwang218/zephyr-7b-dpo-full

Evaluation results