zephyr-7b-dpo-full

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.4893
Rewards/chosen: -1.9379
Rewards/rejected: -3.0213
Rewards/accuracies: 0.7718
Rewards/margins: 1.0835
Logps/rejected: -563.9073
Logps/chosen: -477.8896
Logits/rejected: 0.6827
Logits/chosen: -0.4606

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6338	0.1	100	0.6333	-0.4184	-0.6017	0.6865	0.1833	-321.9407	-325.9421	-2.4857	-2.5392
0.5643	0.21	200	0.5547	-1.1977	-1.8547	0.7480	0.6570	-447.2422	-403.8748	0.1190	-0.4672
0.5066	0.31	300	0.5214	-0.9561	-1.7858	0.7778	0.8297	-440.3582	-379.7161	-0.7390	-1.4155
0.4941	0.42	400	0.5082	-1.2581	-2.1325	0.7599	0.8744	-475.0238	-409.9142	0.1688	-0.7662
0.506	0.52	500	0.5090	-1.1067	-2.0712	0.7639	0.9645	-468.8966	-394.7739	1.3983	0.0857
0.4893	0.63	600	0.4953	-1.4696	-2.4963	0.7579	1.0267	-511.4048	-431.0652	0.9613	-0.4181
0.4558	0.73	700	0.4937	-1.8124	-2.8894	0.7698	1.0770	-550.7128	-465.3409	0.6946	-0.4445
0.4781	0.84	800	0.4898	-1.9968	-3.0983	0.7698	1.1015	-571.6086	-483.7863	0.7311	-0.4503
0.495	0.94	900	0.4894	-1.9365	-3.0176	0.7698	1.0812	-563.5378	-477.7505	0.6757	-0.4642

Framework versions

Transformers 4.36.2
Pytorch 2.1.2+cu118
Datasets 2.14.6
Tokenizers 0.15.0

ondevicellm
/

zephyr-7b-dpo-full

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for ondevicellm/zephyr-7b-dpo-full

Dataset used to train ondevicellm/zephyr-7b-dpo-full

Evaluation results