opt-125m-dpo-full

This model is a fine-tuned version of SebastianSchramm/opt-125m-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.6160
Rewards/chosen: -0.9541
Rewards/rejected: -2.0866
Rewards/accuracies: 0.6765
Rewards/margins: 1.1325
Logps/rejected: -421.7949
Logps/chosen: -541.3610
Logits/rejected: -3.0587
Logits/chosen: -3.1037

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 4
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 2
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
1.0169	0.13	1000	0.6485	-0.0512	-0.1732	0.6145	0.1220	-402.6611	-532.3322	-3.1391	-3.1931
0.9048	0.26	2000	0.6264	-0.5889	-1.0870	0.6325	0.4981	-411.7990	-537.7092	-3.0871	-3.1417
0.8198	0.39	3000	0.6522	-0.8130	-1.5553	0.6365	0.7424	-416.4820	-539.9495	-2.9890	-3.0594
0.7973	0.52	4000	0.6435	-0.7772	-1.6280	0.6450	0.8509	-417.2088	-539.5912	-3.0365	-3.1002
0.7659	0.65	5000	0.6419	-0.8487	-1.7568	0.6480	0.9081	-418.4963	-540.3063	-3.0726	-3.1246
0.6425	0.77	6000	0.6379	-0.9374	-1.9026	0.6555	0.9652	-419.9547	-541.1942	-3.1294	-3.1712
0.709	0.9	7000	0.6275	-0.8907	-1.8643	0.6610	0.9735	-419.5712	-540.7272	-3.0433	-3.0959
0.5569	1.03	8000	0.6325	-0.9352	-1.9355	0.6625	1.0003	-420.2840	-541.1722	-3.0149	-3.0760
0.6507	1.16	9000	0.6215	-0.9145	-1.9276	0.6700	1.0132	-420.2049	-540.9644	-2.9981	-3.0595
0.5921	1.29	10000	0.6201	-0.9696	-2.0113	0.6695	1.0417	-421.0416	-541.5154	-2.9905	-3.0538
0.581	1.42	11000	0.6231	-0.8880	-1.9400	0.6685	1.0521	-420.3290	-540.6996	-2.9769	-3.0403
0.6955	1.55	12000	0.6200	-0.8521	-1.9201	0.6715	1.0680	-420.1295	-540.3407	-2.9294	-3.0003
0.6388	1.68	13000	0.6221	-0.9373	-2.0216	0.6735	1.0843	-421.1445	-541.1925	-2.9834	-3.0472
0.511	1.81	14000	0.6167	-0.8495	-1.9379	0.6715	1.0884	-420.3077	-540.3145	-3.0078	-3.0625
0.5239	1.94	15000	0.6158	-0.8967	-1.9849	0.6775	1.0882	-420.7780	-540.7867	-3.0404	-3.0908
0.5769	2.07	16000	0.6220	-0.9706	-2.0850	0.6695	1.1144	-421.7786	-541.5255	-3.0230	-3.0752
0.407	2.19	17000	0.6137	-0.9421	-2.0587	0.6755	1.1166	-421.5154	-541.2402	-3.0224	-3.0743
0.5732	2.32	18000	0.6119	-0.8997	-2.0121	0.6740	1.1124	-421.0493	-540.8169	-3.0294	-3.0811
0.6627	2.45	19000	0.6143	-0.9421	-2.0649	0.6755	1.1228	-421.5779	-541.2407	-3.0363	-3.0864
0.568	2.58	20000	0.6163	-0.9679	-2.0994	0.6780	1.1316	-421.9230	-541.4983	-3.0553	-3.1021
0.5467	2.71	21000	0.6156	-0.9578	-2.0832	0.6780	1.1254	-421.7610	-541.3981	-3.0488	-3.0957
0.4785	2.84	22000	0.6160	-0.9527	-2.0818	0.6755	1.1290	-421.7462	-541.3470	-3.0554	-3.1020
0.4905	2.97	23000	0.6161	-0.9537	-2.0835	0.6770	1.1298	-421.7638	-541.3571	-3.0583	-3.1056

Framework versions

Transformers 4.35.0
Pytorch 2.1.0+cu121
Datasets 2.14.6
Tokenizers 0.14.1

SebastianSchramm
/

opt-125m-dpo-full

opt-125m-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for SebastianSchramm/opt-125m-dpo-full

Evaluation results