zephyr-7b-dpo-qlora

This model is a fine-tuned version of ale-bay/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.4975
Rewards/chosen: -2.4549
Rewards/rejected: -3.4757
Rewards/accuracies: 0.7490
Rewards/margins: 1.0207
Logps/rejected: -595.2866
Logps/chosen: -517.1966
Logits/rejected: -1.3432
Logits/chosen: -1.4358

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 32
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6641	0.05	100	0.6636	0.0054	-0.0681	0.6900	0.0735	-254.5337	-271.1659	-2.0436	-2.1368
0.6105	0.1	200	0.6075	-0.3236	-0.5938	0.6890	0.2702	-307.0967	-304.0613	-2.0030	-2.0919
0.5883	0.16	300	0.5817	-0.7122	-1.1286	0.7020	0.4164	-360.5768	-342.9188	-1.9914	-2.0761
0.5651	0.21	400	0.5665	-0.7901	-1.2897	0.7250	0.4996	-376.6874	-350.7093	-1.9001	-1.9820
0.5136	0.26	500	0.5520	-1.0330	-1.6646	0.7190	0.6316	-414.1808	-374.9992	-1.8081	-1.8880
0.5587	0.31	600	0.5327	-1.3215	-2.0089	0.7320	0.6874	-448.6079	-403.8534	-1.4665	-1.5609
0.5167	0.37	700	0.5299	-1.2797	-2.1992	0.7230	0.9196	-467.6413	-399.6684	-1.3918	-1.4903
0.5465	0.42	800	0.5189	-1.6646	-2.4686	0.7200	0.8041	-494.5844	-438.1617	-1.3685	-1.4642
0.5002	0.47	900	0.5142	-1.7844	-2.7217	0.7290	0.9373	-519.8885	-450.1383	-1.4179	-1.5054
0.5017	0.52	1000	0.5058	-2.6175	-3.6120	0.7360	0.9946	-608.9218	-533.4493	-1.2973	-1.3948
0.4966	0.58	1100	0.5043	-2.0581	-2.9819	0.7370	0.9239	-545.9103	-477.5080	-1.3783	-1.4740
0.5087	0.63	1200	0.5040	-2.3715	-3.3475	0.7450	0.9760	-582.4712	-508.8495	-1.3331	-1.4262
0.4799	0.68	1300	0.5011	-2.3067	-3.3444	0.7450	1.0377	-582.1562	-502.3687	-1.3340	-1.4277
0.4606	0.73	1400	0.4991	-2.5016	-3.5583	0.7430	1.0567	-603.5469	-521.8631	-1.3291	-1.4219
0.4763	0.79	1500	0.4985	-2.4979	-3.5204	0.7470	1.0225	-599.7631	-521.4944	-1.3394	-1.4325
0.5008	0.84	1600	0.4977	-2.4555	-3.4719	0.7480	1.0164	-594.9102	-517.2504	-1.3492	-1.4415
0.4654	0.89	1700	0.4976	-2.4498	-3.4672	0.7510	1.0174	-594.4417	-516.6852	-1.3478	-1.4402
0.4854	0.94	1800	0.4975	-2.4526	-3.4731	0.7480	1.0205	-595.0339	-516.9640	-1.3441	-1.4366
0.4879	0.99	1900	0.4974	-2.4531	-3.4740	0.75	1.0209	-595.1221	-517.0148	-1.3432	-1.4359

Framework versions

PEFT 0.7.1
Transformers 4.39.3
Pytorch 2.3.0+cu121
Datasets 2.19.1
Tokenizers 0.15.2

ale-bay
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for ale-bay/zephyr-7b-dpo-qlora

Dataset used to train ale-bay/zephyr-7b-dpo-qlora

Evaluation results