metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.111	0.21	100	0.1080	-0.3300	-0.6434	0.7148	0.3134	-375.9606	-336.5851	0.4520	0.3976
0.0697	0.42	200	0.0728	-0.5844	-1.2213	0.7422	0.6369	-433.7567	-362.0242	0.4101	0.3267
0.055	0.63	300	0.0610	-0.7945	-1.5421	0.7266	0.7476	-465.8376	-383.0369	0.2780	0.2451
0.0573	0.84	400	0.0566	-0.8305	-1.5952	0.7383	0.7647	-471.1477	-386.6394	0.2561	0.2348
0.0215	1.05	500	0.0327	-1.6150	-2.8668	0.7305	1.2517	-598.3008	-465.0880	0.2419	0.2221
0.0139	1.26	600	0.0260	-1.8080	-3.0895	0.7227	1.2815	-620.5768	-484.3871	0.2916	0.2601
0.0125	1.47	700	0.0247	-1.9121	-3.1886	0.7305	1.2765	-630.4850	-494.7950	0.2947	0.2614
0.0107	1.67	800	0.0226	-1.9947	-3.2951	0.7188	1.3004	-641.1344	-503.0576	0.3196	0.2841
0.0106	1.88	900	0.0224	-1.9945	-3.2919	0.7148	1.2974	-640.8138	-503.0325	0.3215	0.2841