zephyr-7b-dpo-full

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5590
Rewards/chosen: -0.7818
Rewards/rejected: -2.7115
Rewards/accuracies: 0.7857
Rewards/margins: 1.9297
Logps/rejected: -287.3273
Logps/chosen: -289.7805
Logits/rejected: -2.4561
Logits/chosen: -2.5007

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6075	0.1	100	0.5945	0.3241	-0.1206	0.7163	0.4447	-261.4175	-278.7209	-2.6324	-2.6651
0.5341	0.21	200	0.5471	-0.0734	-1.0103	0.7639	0.9369	-270.3152	-282.6963	-2.5394	-2.5779
0.5315	0.31	300	0.5258	0.1435	-0.9757	0.7619	1.1192	-269.9694	-280.5274	-2.5337	-2.5711
0.4978	0.42	400	0.5366	-0.2177	-1.2826	0.7579	1.0649	-273.0383	-284.1391	-2.5667	-2.6011
0.5134	0.52	500	0.5340	-0.4713	-1.5140	0.7460	1.0427	-275.3516	-286.6748	-2.4488	-2.4836
0.5404	0.63	600	0.5188	-0.0534	-1.2981	0.7480	1.2447	-273.1928	-282.4962	-2.3631	-2.4039
0.5256	0.73	700	0.5270	-0.2533	-1.5704	0.7639	1.3172	-275.9163	-284.4948	-2.3224	-2.3640
0.4991	0.84	800	0.5278	-0.2394	-1.5276	0.7639	1.2882	-275.4879	-284.3556	-2.3730	-2.4144
0.5084	0.94	900	0.5457	0.2664	-0.9546	0.7619	1.2210	-269.7581	-279.2981	-2.4875	-2.5254
0.1011	1.05	1000	0.5361	-0.5236	-2.1364	0.7877	1.6129	-281.5762	-287.1976	-2.4389	-2.4774
0.0942	1.15	1100	0.5454	-0.4356	-2.2047	0.7897	1.7691	-282.2592	-286.3182	-2.4515	-2.4926
0.0817	1.26	1200	0.5530	-0.7588	-2.5855	0.7857	1.8268	-286.0674	-289.5495	-2.4441	-2.4863
0.0697	1.36	1300	0.5549	-0.5919	-2.4690	0.7798	1.8771	-284.9021	-287.8810	-2.4474	-2.4910
0.0842	1.47	1400	0.5575	-0.7425	-2.6443	0.7917	1.9018	-286.6550	-289.3871	-2.4669	-2.5100
0.075	1.57	1500	0.5590	-0.5382	-2.4532	0.7956	1.9150	-284.7438	-287.3436	-2.4699	-2.5133
0.098	1.67	1600	0.5583	-0.7761	-2.6741	0.7877	1.8980	-286.9528	-289.7227	-2.4652	-2.5092
0.0718	1.78	1700	0.5593	-0.7532	-2.6704	0.7877	1.9172	-286.9160	-289.4940	-2.4592	-2.5036
0.0828	1.88	1800	0.5606	-0.7985	-2.7306	0.7897	1.9321	-287.5178	-289.9467	-2.4560	-2.5007
0.103	1.99	1900	0.5601	-0.7805	-2.7113	0.7857	1.9309	-287.3255	-289.7666	-2.4554	-2.5002

Framework versions

Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

weqweasdas
/

zephyr-7b-dpo-full

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from

Dataset used to train weqweasdas/zephyr-7b-dpo-full

Evaluation results

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Finetuned from alignment-handbook/zephyr-7b-sft-full

Dataset used to train weqweasdas/zephyr-7b-dpo-full

Evaluation results

Finetuned from