qwen_cpo_entropy_0_01

This model is a fine-tuned version of trl-lib/qwen1.5-0.5b-sft on the yakazimir/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5583
Sft Loss: 3.4705
Rewards/chosen: -3.3285
Rewards/rejected: -4.3810
Rewards/accuracies: 0.7226
Rewards/margins: 1.0525
Logps/rejected: -4.3810
Logps/chosen: -3.3285
Logits/rejected: 0.2811
Logits/chosen: 0.1563

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss	Sft Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.7019	0.2141	400	0.6977	1.4219	-1.4375	-1.6032	0.5631	0.1657	-1.6032	-1.4375	0.2993	0.2138
0.6225	0.4282	800	0.6192	2.0573	-2.0770	-2.5396	0.6669	0.4626	-2.5396	-2.0770	0.3429	0.2570
0.6242	0.6422	1200	0.5882	2.6279	-2.4850	-3.1039	0.6973	0.6190	-3.1039	-2.4850	0.5237	0.4102
0.5405	0.8563	1600	0.5781	2.5442	-2.4160	-3.0202	0.7092	0.6042	-3.0202	-2.4160	0.4122	0.3042
0.6195	1.0704	2000	0.5673	2.7121	-2.5451	-3.2527	0.7129	0.7076	-3.2527	-2.5451	0.4573	0.3371
0.5895	1.2845	2400	0.5590	3.0631	-2.8962	-3.7486	0.7322	0.8524	-3.7486	-2.8962	0.3362	0.2174
0.5512	1.4986	2800	0.5563	2.9053	-2.7513	-3.5751	0.7203	0.8238	-3.5751	-2.7513	0.2892	0.1750
0.5766	1.7127	3200	0.5520	2.9643	-2.8134	-3.6655	0.7263	0.8522	-3.6655	-2.8134	0.2677	0.1562
0.5625	1.9267	3600	0.5478	3.0563	-2.8597	-3.7385	0.7255	0.8788	-3.7385	-2.8597	0.3670	0.2441
0.4702	2.1408	4000	0.5592	3.5119	-3.3071	-4.3285	0.7240	1.0214	-4.3285	-3.3071	0.2395	0.1198
0.4882	2.3549	4400	0.5601	3.5201	-3.3795	-4.4355	0.7270	1.0560	-4.4355	-3.3795	0.2852	0.1603
0.4952	2.5690	4800	0.5580	3.4402	-3.3065	-4.3570	0.7233	1.0505	-4.3570	-3.3065	0.3210	0.1936
0.4272	2.7831	5200	0.5579	3.4523	-3.3138	-4.3619	0.7233	1.0481	-4.3619	-3.3138	0.3592	0.2281
0.459	2.9972	5600	0.5583	3.4705	-3.3285	-4.3810	0.7226	1.0525	-4.3810	-3.3285	0.2811	0.1563

Framework versions

Transformers 4.44.2
Pytorch 2.2.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

yakazimir
/

qwen_cpo_entropy_0_01

qwen_cpo_entropy_0_01

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yakazimir/qwen_cpo_entropy_0_01

Dataset used to train yakazimir/qwen_cpo_entropy_0_01

Evaluation results