zephyr-7b-dpo-full-ultrabin-reward-scale-01

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5340
Rewards/chosen: -1.6749
Rewards/rejected: -3.0354
Rewards/accuracies: 0.7852
Rewards/margins: 1.3605
Logps/rejected: -566.2047
Logps/chosen: -430.1231
Logits/rejected: 2.3565
Logits/chosen: 1.3978

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 55
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6687	0.1046	50	0.6493	0.0258	-0.0870	0.7070	0.1128	-271.3634	-260.0484	-2.5783	-2.6158
0.5614	0.2092	100	0.5807	-0.8058	-1.5920	0.7109	0.7862	-421.8647	-343.2120	-0.2227	-0.5290
0.5419	0.3138	150	0.5585	-1.0477	-2.0655	0.7461	1.0179	-469.2165	-367.3957	0.6415	-0.0014
0.526	0.4184	200	0.5562	-1.3989	-2.5435	0.7617	1.1446	-517.0156	-402.5200	1.7427	0.9802
0.5202	0.5230	250	0.5419	-1.1425	-2.3279	0.7891	1.1854	-495.4537	-376.8783	1.4380	0.6489
0.5054	0.6276	300	0.5450	-1.3981	-2.6883	0.7773	1.2901	-531.4894	-402.4424	2.2560	1.4771
0.497	0.7322	350	0.5302	-1.6005	-2.8675	0.7734	1.2670	-549.4120	-422.6754	2.2259	1.3704
0.5076	0.8368	400	0.5348	-1.6133	-2.9625	0.7891	1.3492	-558.9131	-423.9595	2.2785	1.3332
0.5092	0.9414	450	0.5341	-1.6701	-3.0297	0.7852	1.3596	-565.6297	-429.6380	2.3444	1.3858

Framework versions

Transformers 4.44.0.dev0
Pytorch 2.1.2
Datasets 2.20.0
Tokenizers 0.19.1

sfulay
/

zephyr-7b-dpo-full-ultrabin-reward-scale-01

zephyr-7b-dpo-full-ultrabin-reward-scale-01

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for sfulay/zephyr-7b-dpo-full-ultrabin-reward-scale-01

Dataset used to train sfulay/zephyr-7b-dpo-full-ultrabin-reward-scale-01

Evaluation results