OpenELM-1_1B-DPO-full-self-improve

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 13.7610
Rewards/chosen: -51.0
Rewards/rejected: -46.75
Rewards/accuracies: 0.4570
Rewards/margins: -4.3438
Logps/rejected: -4960.0
Logps/chosen: -5440.0
Logits/rejected: 1.8125
Logits/chosen: 0.8477

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 64
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.2459	0.1047	100	3.8620	-12.0625	-10.5	0.4531	-1.5391	-1344.0	-1528.0	-5.5625	-5.9688
0.1787	0.2094	200	4.2236	-13.0625	-11.1875	0.4434	-1.9141	-1408.0	-1624.0	-1.0547	-1.8828
0.1064	0.3141	300	5.5584	-19.5	-16.875	0.4336	-2.5156	-1984.0	-2256.0	2.6406	1.8281
0.1114	0.4188	400	5.9626	-21.625	-19.5	0.4473	-2.1094	-2240.0	-2480.0	-2.3906	-3.2969
0.0803	0.5236	500	6.1040	-24.75	-23.375	0.4922	-1.4141	-2624.0	-2800.0	3.6562	2.4844
0.0999	0.6283	600	5.5224	-22.5	-20.375	0.4395	-2.125	-2336.0	-2576.0	2.6719	1.2969
0.0767	0.7330	700	5.9968	-24.25	-22.5	0.4648	-1.6953	-2544.0	-2736.0	0.5781	-0.4414
0.0891	0.8377	800	4.9921	-20.875	-19.125	0.4570	-1.7188	-2208.0	-2400.0	-0.3652	-1.375
0.0907	0.9424	900	3.9869	-17.25	-16.125	0.4785	-1.1328	-1896.0	-2040.0	-2.0781	-3.0469
0.028	1.0471	1000	7.5994	-27.75	-26.0	0.4824	-1.7422	-2896.0	-3104.0	-1.6328	-2.6094
0.0329	1.1518	1100	8.8766	-34.5	-33.0	0.4707	-1.7344	-3584.0	-3776.0	0.8086	-0.2539
0.0288	1.2565	1200	7.4045	-30.25	-27.875	0.4531	-2.3438	-3072.0	-3344.0	0.7969	-0.1514
0.0403	1.3613	1300	6.6099	-27.75	-25.75	0.4531	-1.9844	-2864.0	-3088.0	-2.9688	-3.8125
0.0286	1.4660	1400	12.4327	-43.75	-39.75	0.4688	-3.875	-4288.0	-4672.0	0.9492	-0.0228
0.0237	1.5707	1500	9.6342	-37.0	-33.75	0.4414	-3.25	-3664.0	-4016.0	1.4141	0.3789
0.0231	1.6754	1600	9.6624	-38.25	-34.75	0.4531	-3.5156	-3776.0	-4160.0	1.1016	0.1680
0.0199	1.7801	1700	13.2106	-48.5	-43.75	0.4512	-4.75	-4672.0	-5152.0	1.8438	0.9062
0.0202	1.8848	1800	10.3211	-41.0	-37.75	0.4492	-3.2344	-4080.0	-4416.0	0.6641	-0.2930
0.0305	1.9895	1900	9.0914	-35.5	-33.25	0.4609	-2.0625	-3616.0	-3856.0	-0.5703	-1.5
0.0093	2.0942	2000	12.3840	-45.75	-42.0	0.4512	-3.5938	-4480.0	-4864.0	0.7969	-0.1797
0.006	2.1990	2100	13.6169	-49.5	-45.25	0.4531	-4.2188	-4832.0	-5280.0	1.4062	0.4277
0.0119	2.3037	2200	12.2264	-45.75	-41.75	0.4531	-3.9844	-4480.0	-4896.0	1.4453	0.4785
0.0105	2.4084	2300	12.7440	-47.5	-43.25	0.4531	-4.125	-4608.0	-5056.0	1.4062	0.4570
0.0077	2.5131	2400	13.4844	-50.25	-45.75	0.4512	-4.3125	-4864.0	-5344.0	1.7656	0.8125
0.0149	2.6178	2500	13.7760	-51.0	-46.75	0.4551	-4.3438	-4960.0	-5408.0	1.6562	0.7031
0.0045	2.7225	2600	14.2584	-52.75	-48.25	0.4551	-4.5	-5120.0	-5600.0	1.9766	1.0078
0.0105	2.8272	2700	13.8720	-51.5	-47.0	0.4551	-4.375	-4992.0	-5472.0	1.8203	0.8516
0.0065	2.9319	2800	13.7610	-51.0	-46.75	0.4570	-4.3438	-4960.0	-5440.0	1.8125	0.8477

Framework versions

Transformers 4.44.2
Pytorch 2.3.0
Datasets 3.0.0
Tokenizers 0.19.1

CharlesLi
/

OpenELM-1_1B-DPO-full-self-improve

OpenELM-1_1B-DPO-full-self-improve

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Evaluation results