llama_SFT_e1_DPO_e2

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.1258
Rewards/chosen: 0.3605
Rewards/rejected: -1.7770
Rewards/accuracies: 1.0
Rewards/margins: 2.1375
Logps/rejected: -203.4181
Logps/chosen: -156.2596
Logits/rejected: -1.0532
Logits/chosen: -0.8665

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7e-07
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6825	0.1	25	0.6596	0.0243	-0.0451	0.8667	0.0694	-186.0986	-159.6209	-1.0534	-0.8570
0.6018	0.2	50	0.5820	0.0671	-0.1728	0.9800	0.2399	-187.3757	-159.1936	-1.0531	-0.8568
0.5333	0.3	75	0.5021	0.1133	-0.3236	1.0	0.4369	-188.8834	-158.7311	-1.0544	-0.8586
0.4522	0.4	100	0.4213	0.1615	-0.5029	1.0	0.6644	-190.6768	-158.2497	-1.0547	-0.8596
0.3962	0.5	125	0.3555	0.1988	-0.6844	1.0	0.8832	-192.4913	-157.8759	-1.0548	-0.8608
0.3164	0.6	150	0.2920	0.2416	-0.8872	1.0	1.1288	-194.5195	-157.4483	-1.0550	-0.8660
0.2673	0.7	175	0.2400	0.2789	-1.0936	1.0	1.3725	-196.5838	-157.0758	-1.0540	-0.8656
0.217	0.79	200	0.2008	0.3028	-1.2873	1.0	1.5900	-198.5201	-156.8367	-1.0540	-0.8668
0.1822	0.89	225	0.1694	0.3294	-1.4600	1.0	1.7894	-200.2475	-156.5703	-1.0541	-0.8674
0.1578	0.99	250	0.1483	0.3436	-1.6056	1.0	1.9492	-201.7036	-156.4280	-1.0538	-0.8668
0.1509	1.09	275	0.1364	0.3512	-1.6903	1.0	2.0414	-202.5503	-156.3527	-1.0534	-0.8666
0.1273	1.19	300	0.1322	0.3561	-1.7242	1.0	2.0804	-202.8900	-156.3031	-1.0532	-0.8657
0.1208	1.29	325	0.1284	0.3561	-1.7546	1.0	2.1106	-203.1934	-156.3038	-1.0534	-0.8668
0.1325	1.39	350	0.1270	0.3598	-1.7654	1.0	2.1252	-203.3020	-156.2663	-1.0532	-0.8665
0.1287	1.49	375	0.1263	0.3618	-1.7718	1.0	2.1336	-203.3654	-156.2462	-1.0534	-0.8666
0.1203	1.59	400	0.1252	0.3624	-1.7783	1.0	2.1407	-203.4305	-156.2402	-1.0532	-0.8666
0.1188	1.69	425	0.1254	0.3610	-1.7767	1.0	2.1377	-203.4145	-156.2542	-1.0530	-0.8664
0.1331	1.79	450	0.1253	0.3640	-1.7760	1.0	2.1400	-203.4073	-156.2242	-1.0531	-0.8662
0.1301	1.89	475	0.1252	0.3641	-1.7772	1.0	2.1413	-203.4194	-156.2230	-1.0531	-0.8667
0.1289	1.99	500	0.1258	0.3605	-1.7770	1.0	2.1375	-203.4181	-156.2596	-1.0532	-0.8665

Framework versions

PEFT 0.8.2
Transformers 4.38.1
Pytorch 2.2.0+cu118
Datasets 2.17.1
Tokenizers 0.15.2

thorirhrafn
/

llama_SFT_e1_DPO_e2

llama_SFT_e1_DPO_e2

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for thorirhrafn/llama_SFT_e1_DPO_e2

Evaluation results