zephyr-7b-gpo-u3-i1

This model is a fine-tuned version of DUAL-GPO/zephyr-7b-gpo-update3-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.0976
Rewards/chosen: -0.2046
Rewards/rejected: -0.1684
Rewards/accuracies: 0.3440
Rewards/margins: -0.0362
Logps/rejected: -271.7846
Logps/chosen: -287.1580
Logits/rejected: -1.8253
Logits/chosen: -1.9851

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 2
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.3803	0.4	100	0.0537	0.0	0.0	0.0	0.0	-254.9398	-266.6976	-1.8067	-1.9618
0.2732	0.8	200	0.0585	-0.0406	-0.0433	0.4405	0.0028	-259.2744	-270.7553	-1.8367	-1.9952
0.3013	1.2	300	0.0800	-0.3312	-0.3632	0.4645	0.0319	-291.2575	-299.8226	-1.8131	-1.9752
0.3433	1.6	400	0.0812	-0.3364	-0.3695	0.4675	0.0331	-291.8892	-300.3361	-1.8102	-1.9721
0.3606	2.0	500	0.1100	-0.3181	-0.2920	0.3735	-0.0262	-284.1371	-298.5123	-1.8348	-1.9970
0.3038	2.4	600	0.1092	-0.3233	-0.2979	0.3770	-0.0254	-284.7261	-299.0256	-1.8317	-1.9936
0.3161	2.8	700	0.1069	-0.3172	-0.2929	0.3800	-0.0243	-284.2322	-298.4158	-1.8345	-1.9966
0.3852	3.2	800	0.0918	-0.2304	-0.2057	0.3685	-0.0247	-275.5103	-289.7388	-1.8409	-2.0019
0.3359	3.6	900	0.0983	-0.2063	-0.1696	0.3430	-0.0368	-271.8958	-287.3323	-1.8240	-1.9838
0.3701	4.0	1000	0.0982	-0.2062	-0.1693	0.3455	-0.0368	-271.8734	-287.3159	-1.8241	-1.9838
0.4025	4.4	1100	0.0975	-0.2047	-0.1687	0.3455	-0.0359	-271.8127	-287.1649	-1.8260	-1.9858
0.3754	4.8	1200	0.0974	-0.2044	-0.1685	0.3440	-0.0359	-271.7890	-287.1331	-1.8256	-1.9853

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-u3-i1

zephyr-7b-gpo-u3-i1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-u3-i1

Dataset used to train DUAL-GPO/zephyr-7b-gpo-u3-i1

Evaluation results