zephyr-7b-dpo-full

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 1.3377
Rewards/chosen: -14.5626
Rewards/rejected: -18.1281
Rewards/accuracies: 0.6389
Rewards/margins: 3.5654
Logps/rejected: -2073.0146
Logps/chosen: -1738.2311
Logits/rejected: -0.6819
Logits/chosen: -1.0035

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
4.5854	0.1047	100	4.3811	-0.2719	-0.4992	0.6488	0.2273	-310.1242	-309.1552	-2.1923	-2.2813
2.6464	0.2093	200	2.6063	-9.6247	-11.6315	0.625	2.0068	-1423.3580	-1244.4360	0.6982	-0.3562
1.9069	0.3140	300	2.2624	-9.8468	-11.9256	0.6329	2.0788	-1452.7675	-1266.6490	1.5569	0.4590
1.6642	0.4186	400	1.6421	-14.4918	-17.8494	0.625	3.3576	-2045.1493	-1731.1526	-0.0875	-0.7751
1.6328	0.5233	500	1.5120	-13.0737	-16.3036	0.6389	3.2299	-1890.5623	-1589.3370	-0.0918	-0.6590
1.6032	0.6279	600	1.4752	-17.3374	-21.4238	0.6230	4.0864	-2402.5845	-2015.7072	0.6402	0.0190
1.5039	0.7326	700	1.3853	-14.1299	-17.5624	0.6528	3.4325	-2016.4491	-1694.9624	-0.4968	-0.8898
1.3527	0.8373	800	1.3663	-13.9016	-17.2583	0.6448	3.3567	-1986.0359	-1672.1306	-0.6750	-1.0375
1.5137	0.9419	900	1.3374	-14.5395	-18.1313	0.6409	3.5918	-2073.3389	-1735.9152	-0.6740	-1.0018

Framework versions

Transformers 4.40.1
Pytorch 2.1.2+cu121
Datasets 2.19.0
Tokenizers 0.19.1

Beanpow
/

zephyr-7b-dpo-full

zephyr-7b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Beanpow/zephyr-7b-dpo-full

Dataset used to train Beanpow/zephyr-7b-dpo-full

Evaluation results