mistral-llm-recipes-en-ja-continuous-pretrained-v1-dev-finetune-docs-dpo-lora-debug

This model is a fine-tuned version of skim-wmt24/mistral-llm-recipes-en-ja-continuous-pretrained-v1-dev-finetune-docs-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Epoch	Step	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.2107	100	2.3145	-32.8393	1.0	35.1539	-381.9371	-359.4480	-1.3607	-0.5569
0.4215	200	2.3217	-36.0030	1.0	38.3247	-413.5737	-359.3759	-1.3774	-0.5735
0.6322	300	2.2793	-38.2847	1.0	40.5641	-436.3912	-359.8001	-1.3815	-0.5881
0.8430	400	2.2847	-38.6597	1.0	40.9444	-440.1409	-359.7462	-1.3831	-0.5885
1.0537	500	2.2864	-38.7575	1.0	41.0438	-441.1187	-359.7298	-1.3832	-0.5886
1.2645	600	2.2828	-38.9276	1.0	41.2104	-442.8197	-359.7650	-1.3838	-0.5891
1.4752	700	2.2807	-38.9749	1.0	41.2556	-443.2929	-359.7865	-1.3838	-0.5890
1.6860	800	2.2817	-39.1524	1.0	41.4341	-445.0674	-359.7761	-1.3833	-0.5885
1.8967	900	2.2829	-39.1605	1.0	41.4434	-445.1483	-359.7638	-1.3833	-0.5887
2.1075	1000	2.2801	-39.1533	1.0	41.4334	-445.0770	-359.7921	-1.3840	-0.5887