zephyr-7b-dpo-full-prometheus-reward-scale-01

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.5012
Rewards/chosen: -1.7553
Rewards/rejected: -2.9981
Rewards/accuracies: 0.7198
Rewards/margins: 1.2428
Logps/rejected: -548.0841
Logps/chosen: -435.4877
Logits/rejected: 3.0596
Logits/chosen: 1.9658

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 55
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6661	0.1143	50	0.6541	-0.0354	-0.1594	0.6552	0.1240	-264.2141	-263.4992	-2.5732	-2.6193
0.561	0.2286	100	0.5663	-1.0515	-1.8367	0.7069	0.7852	-431.9447	-365.1131	-0.0076	-0.4111
0.5324	0.3429	150	0.5437	-1.4518	-2.4385	0.6853	0.9868	-492.1300	-405.1368	2.0029	1.3258
0.5261	0.4571	200	0.5247	-1.5625	-2.5913	0.6853	1.0288	-507.4055	-416.2077	2.7389	1.7313
0.5274	0.5714	250	0.5148	-1.6815	-2.8054	0.7155	1.1239	-528.8192	-428.1107	2.1266	1.0144
0.5	0.6857	300	0.5078	-1.6879	-2.8754	0.7198	1.1875	-535.8170	-428.7552	2.7028	1.5160
0.4879	0.8	350	0.5050	-1.8872	-3.0745	0.7198	1.1873	-555.7252	-448.6785	3.2477	2.2065
0.5082	0.9143	400	0.5012	-1.7553	-2.9981	0.7198	1.2428	-548.0841	-435.4877	3.0596	1.9658

Framework versions

Transformers 4.44.0.dev0
Pytorch 2.1.2
Datasets 2.20.0
Tokenizers 0.19.1

sfulay
/

zephyr-7b-dpo-full-prometheus-reward-scale-01

zephyr-7b-dpo-full-prometheus-reward-scale-01

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for sfulay/zephyr-7b-dpo-full-prometheus-reward-scale-01

Evaluation results