llama3-wpo-lora

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logps/ref Response	Logits/rejected	Logits/chosen
0.6142	0.1047	100	0.5973	0.2024	-0.1309	0.7020	0.3333	-277.9861	-290.5232	-0.5364	-0.5487	-0.5543
0.5579	0.2094	200	0.5483	-0.0751	-0.7065	0.7120	0.6313	-283.7411	-293.2985	-0.5364	-0.4847	-0.5042
0.5402	0.3141	300	0.5354	-0.1318	-0.8578	0.7260	0.7260	-285.2545	-293.8653	-0.5364	-0.4387	-0.4637
0.5112	0.4187	400	0.5277	-0.1698	-0.9670	0.7220	0.7973	-286.3469	-294.2450	-0.5364	-0.3715	-0.4030
0.5319	0.5234	500	0.5212	-0.1546	-0.9783	0.7260	0.8237	-286.4595	-294.0932	-0.5364	-0.3377	-0.3727
0.5155	0.6281	600	0.5195	-0.0851	-0.9285	0.7360	0.8434	-285.9612	-293.3980	-0.5364	-0.3247	-0.3608
0.5113	0.7328	700	0.5173	-0.1941	-1.0489	0.7340	0.8547	-287.1652	-294.4885	-0.5364	-0.3036	-0.3411
0.5268	0.8375	800	0.5177	-0.0457	-0.9023	0.7220	0.8566	-285.7000	-293.0044	-0.5364	-0.3082	-0.3453
0.4923	0.9422	900	0.5175	-0.0517	-0.9092	0.7280	0.8575	-285.7691	-293.0645	-0.5364	-0.3072	-0.3443