zephyr-7b-gpo-new-v1-i0

This model is a fine-tuned version of DUAL-GPO-2/zephyr-7b-sft-new on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Logits/chosen: -1.6150
Logits/rejected: -1.4339
Logps/chosen: -400.8042
Logps/rejected: -417.9239
Loss: 0.0367
Rewards/accuracies: 0.5930
Rewards/chosen: -0.1716
Rewards/margins: 0.0499
Rewards/rejected: -0.2215

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 2
total_train_batch_size: 8
total_eval_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Logits/chosen	Logits/rejected	Logps/chosen	Logps/rejected	Validation Loss	Rewards/accuracies	Rewards/chosen	Rewards/margins	Rewards/rejected
0.0532	0.01	100	-2.3286	-2.1100	-227.7062	-195.3472	0.0535	0.5530	0.0015	0.0005	0.0011
0.0625	0.03	200	-2.3309	-2.1124	-225.1784	-194.7161	0.0527	0.6080	0.0040	0.0023	0.0017
0.0485	0.04	300	-2.3236	-2.1050	-237.7424	-214.5471	0.0496	0.5890	-0.0085	0.0096	-0.0181
0.0361	0.05	400	-2.3720	-2.1493	-251.4063	-239.9417	0.0447	0.5990	-0.0222	0.0213	-0.0435
0.0375	0.07	500	-2.1960	-1.9821	-282.6958	-281.3289	0.0417	0.5890	-0.0535	0.0314	-0.0849
0.0522	0.08	600	-1.5433	-1.3697	-398.6434	-395.0714	0.0432	0.5920	-0.1694	0.0292	-0.1987
0.0453	0.09	700	-1.9137	-1.7203	-295.2011	-297.8420	0.0367	0.5780	-0.0660	0.0355	-0.1014
0.0293	0.1	800	-1.6150	-1.4339	-400.8042	-417.9239	0.0367	0.5930	-0.1716	0.0499	-0.2215

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-new-v1-i0

zephyr-7b-gpo-new-v1-i0

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-new-v1-i0

Dataset used to train DUAL-GPO/zephyr-7b-gpo-new-v1-i0

Evaluation results