zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5299
Rewards/chosen: -3.0720
Rewards/rejected: -4.6492
Rewards/accuracies: 0.7275
Rewards/margins: 1.5772
Logps/rejected: -728.1719
Logps/chosen: -592.3389
Logits/rejected: -1.2212
Logits/chosen: -1.3455

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Logits/chosen	Logits/rejected	Logps/chosen	Logps/rejected	Validation Loss	Rewards/accuracies	Rewards/chosen	Rewards/margins	Rewards/rejected
0.6917	0.0262	400	-3.0250	-2.9734	-284.8504	-263.2442	0.6917	0.6350	0.0029	0.0028	0.0001
0.6787	0.0523	800	0.6800	0.0242	-0.0033	0.6860	0.0276	-263.5826	-282.7138	-2.9532	-3.0046
0.6348	0.0785	1200	0.6376	-0.0096	-0.1486	0.6755	0.1390	-278.1083	-286.0981	-2.8858	-2.9319
0.629	0.1047	1600	0.6087	-0.3993	-0.6875	0.6785	0.2882	-331.9969	-325.0683	-2.7749	-2.8148
0.5602	0.1309	2000	0.5979	-0.5708	-0.9723	0.6855	0.4014	-360.4759	-342.2224	-2.7488	-2.7916
0.5783	0.1570	2400	0.5952	-0.7444	-1.2632	0.6910	0.5188	-389.5722	-359.5799	-2.6852	-2.7273
0.6364	0.1832	2800	0.6014	-2.0557	-2.8123	0.6970	0.7566	-544.4844	-490.7089	-2.0799	-2.1273
0.6807	0.2094	3200	0.5654	-2.1440	-3.0639	0.7030	0.9199	-569.6410	-499.5395	-1.6977	-1.7604
0.6616	0.2355	3600	0.5712	-2.9371	-3.9619	0.7165	1.0247	-659.4373	-578.8513	-1.2775	-1.3472
0.4475	0.2617	4000	0.5522	-2.1606	-3.0883	0.7250	0.9277	-572.0762	-501.1973	-1.6222	-1.6801
0.5934	0.2879	4400	0.5452	-2.0993	-3.0686	0.7150	0.9693	-570.1054	-495.0656	-1.5863	-1.6559
0.5422	0.3141	4800	0.5520	-2.7041	-3.8442	0.7220	1.1401	-647.6720	-555.5510	-1.5167	-1.5930
0.6307	0.3402	5200	0.5378	-2.2755	-3.3838	0.7285	1.1083	-601.6280	-512.6918	-1.6752	-1.7599
0.7039	0.3664	5600	0.5306	-1.7946	-2.8494	0.7250	1.0548	-548.1910	-464.5987	-1.6121	-1.6982
0.6561	0.3926	6000	0.5516	-2.6777	-4.0196	0.7205	1.3418	-665.2089	-552.9131	-1.6257	-1.7129
0.5698	0.4188	6400	0.5181	-2.1847	-3.1985	0.7365	1.0138	-583.0958	-503.6094	-1.6584	-1.7391
0.5919	0.4449	6800	0.5219	-1.9491	-3.1280	0.7195	1.1790	-576.0514	-480.0444	-1.6888	-1.7826
0.6161	0.4711	7200	0.5417	-2.7779	-4.2107	0.7335	1.4328	-684.3200	-562.9326	-1.4277	-1.5325
0.4585	0.4973	7600	0.5326	-2.4424	-3.8173	0.7355	1.3748	-644.9775	-529.3820	-1.5104	-1.6091
0.7168	0.5234	8000	0.5298	-2.7451	-4.1021	0.7390	1.3569	-673.4548	-559.6511	-1.3613	-1.4625
0.7179	0.5496	8400	0.5450	-3.1455	-4.6991	0.7330	1.5536	-733.1592	-599.6882	-1.2796	-1.3950
0.4405	0.5758	8800	0.5088	-1.9634	-3.1323	0.7425	1.1689	-576.4830	-481.4787	-1.5418	-1.6311
0.4464	0.6020	9200	0.5306	-2.5354	-3.9140	0.7325	1.3786	-654.6471	-538.6789	-1.3558	-1.4605
0.43	0.6281	9600	0.5292	-2.7495	-4.1617	0.7335	1.4122	-679.4191	-560.0843	-1.2192	-1.3258
0.48	0.6543	10000	0.5317	-2.5185	-3.9464	0.7245	1.4279	-657.8862	-536.9896	-1.3340	-1.4473
0.7352	0.6805	10400	0.5257	-2.7204	-4.1745	0.7315	1.4541	-680.6992	-557.1738	-1.3220	-1.4356
0.6986	0.7066	10800	0.5242	-2.8515	-4.3094	0.7300	1.4580	-694.1929	-570.2861	-1.2609	-1.3721
0.4944	0.7328	11200	0.5282	-2.8438	-4.3275	0.7320	1.4837	-695.9977	-569.5184	-1.2780	-1.3930
0.3577	0.7590	11600	0.5159	-2.7874	-4.1731	0.7345	1.3857	-680.5639	-563.8783	-1.3489	-1.4592
0.602	0.7852	12000	0.5213	-2.9605	-4.3944	0.7315	1.4339	-702.6897	-581.1863	-1.2926	-1.4077
0.4698	0.8113	12400	0.5320	-3.2528	-4.8286	0.7300	1.5759	-746.1134	-610.4158	-1.1834	-1.3076
0.4796	0.8375	12800	0.5180	-2.7532	-4.1875	0.7325	1.4343	-681.9944	-560.4576	-1.2848	-1.3996
0.4354	0.8637	13200	0.5226	-2.8473	-4.3400	0.7335	1.4927	-697.2530	-569.8687	-1.2477	-1.3671
0.4068	0.8898	13600	0.5262	-3.0065	-4.5462	0.7310	1.5397	-717.8715	-585.7884	-1.2316	-1.3538
0.5134	0.9160	14000	0.5281	-2.9950	-4.5567	0.7300	1.5617	-718.9149	-584.6379	-1.2311	-1.3549
0.7272	0.9422	14400	0.5305	-3.0852	-4.6701	0.7275	1.5849	-730.2634	-593.6614	-1.2166	-1.3417
0.3916	0.9684	14800	0.5299	-3.0770	-4.6548	0.7265	1.5778	-728.7334	-592.8383	-1.2201	-1.3446
0.4814	0.9945	15200	0.5296	-3.0725	-4.6501	0.7280	1.5776	-728.2595	-592.3885	-1.2210	-1.3453

Framework versions

PEFT 0.7.1
Transformers 4.44.2
Pytorch 2.2.2+cu121
Datasets 3.2.0
Tokenizers 0.19.0

daijiao
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for daijiao/zephyr-7b-dpo-qlora

Dataset used to train daijiao/zephyr-7b-dpo-qlora

Evaluation results