llama_SFT_e1_DPO_e3

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0745
Rewards/chosen: 0.4581
Rewards/rejected: -2.2850
Rewards/accuracies: 1.0
Rewards/margins: 2.7431
Logps/rejected: -231.3585
Logps/chosen: -178.1707
Logits/rejected: -1.0559
Logits/chosen: -0.8886

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 7e-07
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6769	0.1	25	0.6597	0.0250	-0.0447	0.8267	0.0696	-208.9552	-182.5023	-1.0564	-0.8828
0.6046	0.2	50	0.5848	0.0656	-0.1679	0.9767	0.2335	-210.1874	-182.0954	-1.0566	-0.8835
0.5278	0.3	75	0.4947	0.1260	-0.3302	1.0	0.4561	-211.8100	-181.4921	-1.0569	-0.8842
0.4449	0.4	100	0.4127	0.1764	-0.5163	1.0	0.6927	-213.6715	-180.9882	-1.0574	-0.8860
0.3778	0.5	125	0.3368	0.2273	-0.7228	1.0	0.9502	-215.7366	-180.4786	-1.0575	-0.8860
0.2947	0.6	150	0.2704	0.2755	-0.9495	1.0	1.2250	-218.0034	-179.9972	-1.0567	-0.8893
0.2451	0.7	175	0.2164	0.3191	-1.1794	1.0	1.4985	-220.3023	-179.5607	-1.0570	-0.8899
0.1913	0.79	200	0.1737	0.3571	-1.4023	1.0	1.7594	-222.5310	-179.1809	-1.0573	-0.8904
0.1553	0.89	225	0.1412	0.3875	-1.6187	1.0	2.0062	-224.6958	-178.8769	-1.0564	-0.8903
0.1265	0.99	250	0.1173	0.4068	-1.8122	1.0	2.2190	-226.6304	-178.6835	-1.0564	-0.8882
0.1174	1.09	275	0.1029	0.4201	-1.9500	1.0	2.3701	-228.0080	-178.5508	-1.0555	-0.8881
0.0915	1.19	300	0.0938	0.4300	-2.0411	1.0	2.4711	-228.9190	-178.4516	-1.0555	-0.8891
0.0819	1.29	325	0.0880	0.4386	-2.1112	1.0	2.5497	-229.6201	-178.3662	-1.0554	-0.8884
0.09	1.39	350	0.0838	0.4485	-2.1597	1.0	2.6082	-230.1051	-178.2668	-1.0556	-0.8886
0.0865	1.49	375	0.0803	0.4551	-2.2012	1.0	2.6562	-230.5201	-178.2012	-1.0556	-0.8904
0.0731	1.59	400	0.0787	0.4544	-2.2264	1.0	2.6807	-230.7719	-178.2080	-1.0555	-0.8884
0.0734	1.69	425	0.0769	0.4580	-2.2470	1.0	2.7050	-230.9783	-178.1717	-1.0552	-0.8884
0.0827	1.79	450	0.0763	0.4591	-2.2582	1.0	2.7173	-231.0906	-178.1612	-1.0555	-0.8884
0.0784	1.89	475	0.0756	0.4545	-2.2709	1.0	2.7253	-231.2172	-178.2073	-1.0556	-0.8886
0.0768	1.99	500	0.0751	0.4564	-2.2804	1.0	2.7368	-231.3123	-178.1877	-1.0556	-0.8883
0.0696	2.09	525	0.0746	0.4607	-2.2824	1.0	2.7431	-231.3322	-178.1449	-1.0554	-0.8901
0.0691	2.19	550	0.0743	0.4597	-2.2852	1.0	2.7449	-231.3599	-178.1548	-1.0557	-0.8886
0.07	2.29	575	0.0747	0.4597	-2.2807	1.0	2.7404	-231.3157	-178.1549	-1.0559	-0.8889
0.0754	2.38	600	0.0744	0.4599	-2.2865	1.0	2.7463	-231.3729	-178.1530	-1.0555	-0.8903
0.0833	2.48	625	0.0743	0.4588	-2.2878	1.0	2.7466	-231.3862	-178.1637	-1.0557	-0.8887
0.0761	2.58	650	0.0746	0.4583	-2.2856	1.0	2.7440	-231.3646	-178.1684	-1.0557	-0.8889
0.0716	2.68	675	0.0745	0.4597	-2.2866	1.0	2.7463	-231.3745	-178.1546	-1.0558	-0.8905
0.0755	2.78	700	0.0746	0.4592	-2.2835	1.0	2.7426	-231.3430	-178.1601	-1.0560	-0.8889
0.063	2.88	725	0.0740	0.4618	-2.2897	1.0	2.7516	-231.4058	-178.1337	-1.0556	-0.8884
0.0743	2.98	750	0.0745	0.4581	-2.2850	1.0	2.7431	-231.3585	-178.1707	-1.0559	-0.8886

Framework versions

PEFT 0.8.2
Transformers 4.38.1
Pytorch 2.2.0+cu118
Datasets 2.17.1
Tokenizers 0.15.2

thorirhrafn
/

llama_SFT_e1_DPO_e3

llama_SFT_e1_DPO_e3

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for thorirhrafn/llama_SFT_e1_DPO_e3

Evaluation results