zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5325
Rewards/chosen: -1.2325
Rewards/rejected: -2.0565
Rewards/accuracies: 0.7656
Rewards/margins: 0.8240
Logps/rejected: -457.4398
Logps/chosen: -373.4022
Logits/rejected: 0.7596
Logits/chosen: 0.5001

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 32
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6916	0.05	100	0.6912	0.0059	0.0019	0.6484	0.0041	-251.6075	-249.5596	-2.2040	-2.2621
0.655	0.1	200	0.6498	-0.0559	-0.1762	0.7070	0.1203	-269.4106	-255.7421	-2.1011	-2.1614
0.6342	0.16	300	0.6146	-0.3407	-0.6269	0.7031	0.2862	-314.4839	-284.2224	-1.9037	-1.9793
0.6121	0.21	400	0.5946	-0.4657	-0.8916	0.7031	0.4259	-340.9551	-296.7203	-1.8717	-1.9543
0.5973	0.26	500	0.5938	-0.3681	-0.7766	0.7305	0.4085	-329.4522	-286.9666	-1.8440	-1.9282
0.5473	0.31	600	0.5774	-0.6893	-1.2264	0.7344	0.5371	-374.4341	-319.0812	-1.6815	-1.7726
0.5792	0.37	700	0.5709	-0.6635	-1.2100	0.7578	0.5465	-372.7989	-316.5072	-1.4783	-1.5775
0.5194	0.42	800	0.5590	-1.0208	-1.6453	0.7461	0.6245	-416.3269	-352.2357	-0.3791	-0.5486
0.5367	0.47	900	0.5492	-1.1477	-1.8521	0.7266	0.7044	-437.0040	-364.9276	-0.0908	-0.2899
0.5575	0.52	1000	0.5450	-1.1704	-1.9048	0.7344	0.7344	-442.2755	-367.1964	0.2761	0.0498
0.5507	0.58	1100	0.5429	-1.1040	-1.8671	0.7422	0.7631	-438.5026	-360.5551	0.5339	0.2877
0.5305	0.63	1200	0.5366	-1.1557	-1.9243	0.7578	0.7686	-444.2217	-365.7241	0.7350	0.4755
0.5171	0.68	1300	0.5304	-1.3741	-2.1678	0.7656	0.7937	-468.5735	-387.5681	0.7686	0.5029
0.4875	0.73	1400	0.5321	-1.3228	-2.1513	0.7578	0.8285	-466.9267	-382.4329	0.8566	0.5926
0.5216	0.78	1500	0.5326	-1.2006	-2.0034	0.7617	0.8028	-452.1298	-370.2103	0.7189	0.4630
0.4894	0.84	1600	0.5327	-1.2300	-2.0556	0.7656	0.8256	-457.3565	-373.1585	0.7405	0.4828
0.5179	0.89	1700	0.5326	-1.2313	-2.0558	0.7656	0.8245	-457.3720	-373.2860	0.7604	0.5012
0.5534	0.94	1800	0.5325	-1.2309	-2.0558	0.7656	0.8249	-457.3779	-373.2437	0.7550	0.4957
0.5539	0.99	1900	0.5325	-1.2325	-2.0565	0.7656	0.8240	-457.4398	-373.4022	0.7596	0.5001

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.0

lewtun
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for lewtun/zephyr-7b-dpo-qlora

Dataset used to train lewtun/zephyr-7b-dpo-qlora

Evaluation results