Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
Edit model card

Visualize in Weights & Biases

ds_chat_sppo_hard_iter0_2024-09-15-01.39

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 4624.1011
  • Rewards/chosen: 0.0051
  • Rewards/rejected: -0.0370
  • Rewards/accuracies: 0.5789
  • Rewards/margins: 0.0421
  • Logps/rejected: -263.3607
  • Logps/chosen: -252.4096
  • Logits/rejected: 1.4404
  • Logits/chosen: 1.3959
  • Debug/policy Chosen Logits: 1.3959
  • Debug/policy Rejected Logits: 1.4404
  • Debug/policy Chosen Logps: -252.4096
  • Debug/policy Rejected Logps: -263.3607
  • Debug/reference Chosen Logps: -252.9185
  • Debug/reference Rejected Logps: -259.6586
  • Debug/sppo Chosen Reward In Loss: 0.5089
  • Debug/sppo Rej Reward In Loss: -3.7021
  • Debug/sppo Chosen Loss: 2526.5620
  • Debug/sppo Reject Loss: 2309.3242

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps Debug/sppo Chosen Reward In Loss Debug/sppo Rej Reward In Loss Debug/sppo Chosen Loss Debug/sppo Reject Loss
4975.3273 0.3623 100 4981.6489 -0.0033 -0.0038 0.4605 0.0004 -260.0373 -253.2532 1.7010 1.6372 1.6372 1.7010 -253.2532 -260.0373 -252.9185 -259.6586 -0.3347 -0.3786 2534.3679 2463.3860
4930.2141 0.7246 200 4924.0649 -0.0013 -0.0060 0.5789 0.0047 -260.2596 -253.0476 1.6680 1.6070 1.6070 1.6680 -253.0476 -260.2596 -252.9185 -259.6586 -0.1291 -0.6009 2514.6309 2444.3210
4841.2859 1.0870 300 4866.0864 -0.0095 -0.0185 0.5395 0.0089 -261.5047 -253.8716 1.6500 1.5926 1.5926 1.6500 -253.8716 -261.5047 -252.9185 -259.6586 -0.9531 -1.8460 2603.5461 2331.7520
4822.266 1.4493 400 4827.9761 -0.0173 -0.0295 0.5395 0.0122 -262.6080 -254.6497 1.6162 1.5603 1.5603 1.6162 -254.6497 -262.6080 -252.9185 -259.6586 -1.7313 -2.9494 2692.5408 2243.4092
4715.0469 1.8116 500 4771.2051 -0.0007 -0.0176 0.4868 0.0169 -261.4219 -252.9887 1.5898 1.5341 1.5341 1.5898 -252.9887 -261.4219 -252.9185 -259.6586 -0.0703 -1.7633 2529.2981 2376.3818
4665.2648 2.1739 600 4749.7798 0.0008 -0.0212 0.5395 0.0220 -261.7789 -252.8382 1.5688 1.5147 1.5147 1.5688 -252.8382 -261.7789 -252.9185 -259.6586 0.0803 -2.1202 2515.5928 2344.7095
4625.0359 2.5362 700 5035.4683 0.0876 0.0697 0.6447 0.0179 -252.6841 -244.1548 1.5685 1.5098 1.5098 1.5685 -244.1548 -252.6841 -252.9185 -259.6586 8.7637 6.9746 1714.2816 3259.7661
4637.3375 2.8986 800 4705.7749 -0.0031 -0.0319 0.5921 0.0287 -262.8461 -253.2311 1.5294 1.4773 1.4773 1.5294 -253.2311 -262.8461 -252.9185 -259.6586 -0.3127 -3.1874 2569.7046 2272.2061
4550.082 3.2609 900 4687.2900 -0.0001 -0.0318 0.5921 0.0317 -262.8345 -252.9287 1.5160 1.4652 1.4652 1.5160 -252.9287 -262.8345 -252.9185 -259.6586 -0.0102 -3.1759 2544.3586 2288.0042
4612.343 3.6232 1000 4670.3667 0.0005 -0.0323 0.5658 0.0328 -262.8906 -252.8681 1.5061 1.4569 1.4569 1.5061 -252.8681 -262.8906 -252.9185 -259.6586 0.0504 -3.2320 2546.7378 2296.4641
4579.3098 3.9855 1100 4676.5903 -0.0058 -0.0391 0.5263 0.0333 -263.5656 -253.4963 1.5062 1.4565 1.4565 1.5062 -253.4963 -263.5656 -252.9185 -259.6586 -0.5778 -3.9070 2616.4526 2253.1421
4461.193 4.3478 1200 4657.2646 0.0038 -0.0339 0.6053 0.0377 -263.0466 -252.5387 1.4919 1.4449 1.4449 1.4919 -252.5387 -263.0466 -252.9185 -259.6586 0.3798 -3.3879 2517.6655 2292.2590
4688.9563 4.7101 1300 4654.3955 -0.0002 -0.0373 0.5658 0.0371 -263.3885 -252.9360 1.4725 1.4244 1.4244 1.4725 -252.9360 -263.3885 -252.9185 -259.6586 -0.0175 -3.7298 2567.2290 2285.4812
4572.3969 5.0725 1400 4650.5352 -0.0014 -0.0398 0.5789 0.0384 -263.6363 -253.0607 1.4663 1.4206 1.4206 1.4663 -253.0607 -263.6363 -252.9185 -259.6586 -0.1422 -3.9776 2580.2542 2263.7637
4497.8313 5.4348 1500 4637.4077 0.0039 -0.0371 0.5658 0.0410 -263.3676 -252.5313 1.4566 1.4118 1.4118 1.4566 -252.5313 -263.3676 -252.9185 -259.6586 0.3872 -3.7090 2528.2339 2293.6980
4573.9879 5.7971 1600 4628.5752 0.0069 -0.0333 0.5921 0.0402 -262.9847 -252.2267 1.4558 1.4099 1.4099 1.4558 -252.2267 -262.9847 -252.9185 -259.6586 0.6917 -3.3261 2501.1956 2325.0657
4493.7113 6.1594 1700 4615.8252 0.0106 -0.0325 0.5921 0.0431 -262.9095 -251.8597 1.4488 1.4028 1.4028 1.4488 -251.8597 -262.9095 -252.9185 -259.6586 1.0587 -3.2509 2467.5171 2344.7961
4579.916 6.5217 1800 4618.2861 0.0059 -0.0377 0.5789 0.0436 -263.4273 -252.3270 1.4455 1.4013 1.4013 1.4455 -252.3270 -263.4273 -252.9185 -259.6586 0.5915 -3.7687 2516.5059 2301.5999
4682.2398 6.8841 1900 4613.9302 0.0060 -0.0385 0.6184 0.0445 -263.5052 -252.3165 1.4429 1.3991 1.3991 1.4429 -252.3165 -263.5052 -252.9185 -259.6586 0.6019 -3.8466 2513.9785 2293.4380
4497.943 7.2464 2000 4617.7402 0.0049 -0.0368 0.6053 0.0417 -263.3337 -252.4285 1.4409 1.3966 1.3966 1.4409 -252.4285 -263.3337 -252.9185 -259.6586 0.4900 -3.6751 2527.1399 2309.4104
4470.4805 7.6087 2100 4616.2676 0.0083 -0.0372 0.6053 0.0455 -263.3792 -252.0898 1.4419 1.3983 1.3983 1.4419 -252.0898 -263.3792 -252.9185 -259.6586 0.8286 -3.7205 2493.6099 2304.2241
4514.8016 7.9710 2200 4624.1011 0.0051 -0.0370 0.5789 0.0421 -263.3607 -252.4096 1.4404 1.3959 1.3959 1.4404 -252.4096 -263.3607 -252.9185 -259.6586 0.5089 -3.7021 2526.5620 2309.3242

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1
Downloads last month
5
Safetensors
Model size
6.91B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for yiran-wang3/ds_chat_sppo_hard_iter0_nomask_linear_schedule

Finetuned
(29)
this model

Datasets used to train yiran-wang3/ds_chat_sppo_hard_iter0_nomask_linear_schedule