Llama-2-7b-hf-eval_threapist-DPO-filtered-0.2-local-version-1

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.4633	0.4	94	0.7283	-0.4834	-0.5276	0.5	0.0443	-51.3107	-51.7779	-0.9037	-0.8864
0.7207	0.8	188	0.7368	-0.2345	-0.2555	0.6000	0.0210	-48.5893	-49.2894	-0.7571	-0.7388
0.3511	1.2	282	0.7674	-0.4686	-0.6027	0.5500	0.1341	-52.0614	-51.6306	-1.0377	-1.0234
0.6543	1.6	376	0.8144	-0.4911	-0.6379	0.5	0.1468	-52.4135	-51.8557	-1.2458	-1.2342
0.123	2.0	470	0.8124	-0.4914	-0.6417	0.5	0.1503	-52.4510	-51.8579	-1.2607	-1.2496