zephyr-7b-gpo-iter0

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.0258
Rewards/chosen: -0.0580
Rewards/rejected: -0.0061
Rewards/accuracies: 0.3380
Rewards/margins: -0.0519
Logps/rejected: -249.4468
Logps/chosen: -274.3866
Logits/rejected: -2.2108
Logits/chosen: -2.4070

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 1
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 2
total_train_batch_size: 2
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0008	0.2	100	0.0019	-0.0111	-0.0138	0.5300	0.0027	-250.2170	-269.6990	-2.2026	-2.4007
0.0006	0.4	200	0.0029	-0.0237	-0.0230	0.4910	-0.0007	-251.1392	-270.9541	-2.2051	-2.4034
0.001	0.6	300	0.0019	-0.0120	-0.0142	0.5310	0.0022	-250.2602	-269.7912	-2.2008	-2.3984
0.0011	0.8	400	0.0023	-0.0201	-0.0211	0.5010	0.0011	-250.9541	-270.5950	-2.1993	-2.3968
0.0008	1.0	500	0.0021	-0.0170	-0.0189	0.5065	0.0019	-250.7260	-270.2850	-2.1982	-2.3960
0.044	1.2	600	0.0091	-0.0053	0.0198	0.3600	-0.0252	-246.8548	-269.1194	-2.1940	-2.3899
0.0682	1.4	700	0.0191	-0.0345	0.0086	0.3450	-0.0431	-247.9818	-272.0423	-2.2035	-2.3992
0.0505	1.6	800	0.0237	-0.0497	-0.0001	0.3405	-0.0496	-248.8542	-273.5587	-2.2094	-2.4056
0.0243	1.8	900	0.0259	-0.0581	-0.0062	0.3340	-0.0519	-249.4570	-274.3967	-2.2117	-2.4081
0.0697	2.0	1000	0.0258	-0.0580	-0.0061	0.3380	-0.0519	-249.4468	-274.3866	-2.2108	-2.4070

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu118
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-iter0

zephyr-7b-gpo-iter0

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-iter0

Dataset used to train DUAL-GPO/zephyr-7b-gpo-iter0

Evaluation results