metadata

tags:
  - trl
  - dpo
  - generated_from_trainer
model-index:
  - name: zephyr-7b-dpo-full
    results: []

zephyr-7b-dpo-full

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6512	0.1047	100	0.6511	-0.0190	-0.1266	0.6964	0.1076	-278.8090	-293.6257	-2.3851	-2.4490
0.5992	0.2093	200	0.5944	-0.2668	-0.6535	0.7103	0.3866	-331.5005	-318.4129	-1.7454	-1.8605
0.5469	0.3140	300	0.5530	-0.6557	-1.3199	0.7520	0.6642	-398.1460	-357.2993	-0.7401	-0.9693
0.5491	0.4186	400	0.5448	-1.0399	-1.6860	0.7282	0.6462	-434.7570	-395.7156	1.3254	0.9052
0.5351	0.5233	500	0.5296	-0.8199	-1.6144	0.7679	0.7945	-427.5919	-373.7142	2.7946	2.2107
0.4879	0.6279	600	0.5152	-1.1813	-2.0574	0.7619	0.8761	-471.8891	-409.8589	3.3049	2.6265
0.4963	0.7326	700	0.5121	-1.1447	-2.0602	0.7679	0.9156	-472.1772	-406.1937	3.7355	2.9642
0.5009	0.8373	800	0.5099	-1.1326	-2.0244	0.7679	0.8919	-468.5970	-404.9855	3.6202	2.8807
0.4926	0.9419	900	0.5090	-1.1007	-2.0002	0.7738	0.8995	-466.1724	-401.8018	3.6229	2.8669