Llama-2-7b-hf-DPO-Filtered-0.2-version-4

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6253	0.3007	289	0.4804	0.1896	-0.8278	0.75	1.0174	-59.6087	-51.2020	-0.5776	-0.5728
0.7389	0.6015	578	0.6042	-0.7550	-1.2877	0.75	0.5327	-64.2070	-60.6478	-0.5344	-0.5271
0.9409	0.9022	867	0.5284	-1.7710	-2.8304	0.8000	1.0594	-79.6340	-70.8077	-0.7636	-0.7588
0.1677	1.2029	1156	0.5894	-3.1943	-5.2391	0.8500	2.0448	-103.7217	-85.0409	-1.2480	-1.2513
0.7214	1.5036	1445	0.9016	-2.9280	-4.9877	0.75	2.0597	-101.2072	-82.3778	-1.7146	-1.7203
0.0886	1.8044	1734	0.5467	-5.1233	-7.7654	0.8000	2.6421	-128.9842	-104.3309	-1.6795	-1.6833
0.0017	2.1051	2023	0.4601	-4.5145	-7.6515	0.9000	3.1370	-127.8453	-98.2430	-1.6638	-1.6679
0.0517	2.4058	2312	0.4787	-4.6994	-7.9152	0.8500	3.2159	-130.4826	-100.0915	-1.6608	-1.6651
0.0003	2.7066	2601	0.4722	-4.6553	-7.8727	0.8500	3.2174	-130.0573	-99.6505	-1.6599	-1.6641