zephyr-7b-gpo-iter1

This model is a fine-tuned version of DUAL-GPO/zephyr-7b-gpo-iter0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.0069
Rewards/chosen: 0.0025
Rewards/rejected: 0.0081
Rewards/accuracies: 0.4595
Rewards/margins: -0.0056
Logps/rejected: -272.5866
Logps/chosen: -298.8498
Logits/rejected: -2.1749
Logits/chosen: -2.3692

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 1
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 2
total_train_batch_size: 2
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0006	0.2	100	0.0031	-0.0541	-0.0467	0.4245	-0.0074	-278.0669	-304.5065	-2.1506	-2.3436
0.0025	0.4	200	0.0033	-0.0115	-0.0107	0.4910	-0.0008	-274.4619	-300.2420	-2.1684	-2.3612
0.0009	0.6	300	0.0030	-0.0220	-0.0216	0.4935	-0.0004	-275.5567	-301.2960	-2.1427	-2.3360
0.0013	0.8	400	0.0034	-0.0156	-0.0142	0.4935	-0.0014	-274.8156	-300.6561	-2.1462	-2.3405
0.0011	1.0	500	0.0037	-0.0565	-0.0502	0.4520	-0.0063	-278.4165	-304.7457	-2.1454	-2.3392
0.0116	1.2	600	0.0049	-0.0283	-0.0229	0.4435	-0.0054	-275.6791	-301.9266	-2.1527	-2.3449
0.015	1.4	700	0.0065	-0.0261	-0.0182	0.4450	-0.0078	-275.2170	-301.7041	-2.1650	-2.3586
0.0009	1.6	800	0.0069	0.0079	0.0124	0.4720	-0.0044	-272.1540	-298.3011	-2.1746	-2.3689
0.0109	1.8	900	0.0069	0.0024	0.0080	0.4570	-0.0057	-272.5880	-298.8583	-2.1739	-2.3682
0.0015	2.0	1000	0.0069	0.0025	0.0081	0.4595	-0.0056	-272.5866	-298.8498	-2.1749	-2.3692

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu118
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-iter1

zephyr-7b-gpo-iter1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-iter1

Dataset used to train DUAL-GPO/zephyr-7b-gpo-iter1

Evaluation results