Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
yiran-wang3's picture
End of training
bc1045d verified
metadata
license: other
base_model: deepseek-ai/deepseek-llm-7b-chat
tags:
  - alignment-handbook
  - trl
  - dpo
  - generated_from_trainer
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - self-generate/ds_chat_original_cn_mining_oj_iter0-binarized
  - self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized
  - self-generate/ds_chat_original_cn_rl_oj_iter0-binarized
model-index:
  - name: ds_chat_sigmoid_iter0_2024-09-14-21.15
    results: []

Visualize in Weights & Biases

ds_chat_sigmoid_iter0_2024-09-14-21.15

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 0.7009
  • Rewards/chosen: 0.3500
  • Rewards/rejected: 0.0298
  • Rewards/accuracies: 0.3289
  • Rewards/margins: 0.3202
  • Logps/rejected: -63.8274
  • Logps/chosen: -122.4480
  • Logits/rejected: 1.6952
  • Logits/chosen: 1.6350
  • Debug/policy Chosen Logits: 1.6350
  • Debug/policy Rejected Logits: 1.6952
  • Debug/policy Chosen Logps: -122.4480
  • Debug/policy Rejected Logps: -63.8274
  • Debug/reference Chosen Logps: -123.1481
  • Debug/reference Rejected Logps: -63.8871

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps
0.6965 0.3623 100 0.6848 0.1614 0.0731 0.2895 0.0882 -63.7408 -122.8253 1.7215 1.6604 1.6604 1.7215 -122.8253 -63.7408 -123.1481 -63.8871
0.7398 0.7246 200 0.7128 0.4980 0.1123 0.3289 0.3857 -63.6625 -122.1521 1.7105 1.6513 1.6513 1.7105 -122.1521 -63.6625 -123.1481 -63.8871
0.7007 1.0870 300 0.6869 0.4063 -0.0006 0.3158 0.4070 -63.8883 -122.3354 1.7138 1.6542 1.6542 1.7138 -122.3354 -63.8883 -123.1481 -63.8871
0.7084 1.4493 400 0.7388 0.4329 0.1275 0.3026 0.3054 -63.6320 -122.2823 1.7009 1.6406 1.6406 1.7009 -122.2823 -63.6320 -123.1481 -63.8871
0.693 1.8116 500 0.6927 0.1909 -0.0563 0.3158 0.2472 -63.9997 -122.7663 1.7035 1.6431 1.6431 1.7035 -122.7663 -63.9997 -123.1481 -63.8871
0.6683 2.1739 600 0.6755 0.2946 0.0203 0.3421 0.2744 -63.8465 -122.5588 1.7045 1.6442 1.6442 1.7045 -122.5588 -63.8465 -123.1481 -63.8871
0.7035 2.5362 700 0.6899 0.1404 -0.0287 0.3158 0.1691 -63.9445 -122.8673 1.7058 1.6448 1.6448 1.7058 -122.8673 -63.9445 -123.1481 -63.8871
0.685 2.8986 800 0.6978 0.4321 0.0759 0.3947 0.3562 -63.7352 -122.2839 1.7109 1.6500 1.6500 1.7109 -122.2839 -63.7352 -123.1481 -63.8871
0.6585 3.2609 900 0.7158 0.4197 0.1341 0.2763 0.2856 -63.6189 -122.3087 1.7148 1.6527 1.6527 1.7148 -122.3087 -63.6189 -123.1481 -63.8871
0.6654 3.6232 1000 0.6837 0.4128 0.0010 0.3947 0.4118 -63.8851 -122.3225 1.7064 1.6460 1.6460 1.7064 -122.3225 -63.8851 -123.1481 -63.8871
0.669 3.9855 1100 0.6801 0.2662 -0.0151 0.3816 0.2813 -63.9173 -122.6156 1.7008 1.6413 1.6413 1.7008 -122.6156 -63.9173 -123.1481 -63.8871
0.6658 4.3478 1200 0.6950 0.2165 -0.0405 0.3553 0.2570 -63.9680 -122.7150 1.6985 1.6382 1.6382 1.6985 -122.7150 -63.9680 -123.1481 -63.8871
0.6774 4.7101 1300 0.6833 0.3216 0.0373 0.3289 0.2843 -63.8124 -122.5048 1.6956 1.6371 1.6371 1.6956 -122.5048 -63.8124 -123.1481 -63.8871
0.6553 5.0725 1400 0.6871 0.4489 0.0096 0.3421 0.4393 -63.8679 -122.2503 1.6926 1.6324 1.6324 1.6926 -122.2503 -63.8679 -123.1481 -63.8871
0.655 5.4348 1500 0.6900 0.3867 0.0004 0.3553 0.3863 -63.8863 -122.3746 1.7037 1.6446 1.6446 1.7037 -122.3746 -63.8863 -123.1481 -63.8871
0.6552 5.7971 1600 0.6981 0.2816 -0.0683 0.3158 0.3498 -64.0236 -122.5849 1.6935 1.6342 1.6342 1.6935 -122.5849 -64.0236 -123.1481 -63.8871
0.6471 6.1594 1700 0.7017 0.3683 0.0204 0.3553 0.3479 -63.8463 -122.4115 1.6992 1.6385 1.6385 1.6992 -122.4115 -63.8463 -123.1481 -63.8871
0.6557 6.5217 1800 0.6957 0.2688 -0.0975 0.3026 0.3663 -64.0820 -122.6105 1.6947 1.6337 1.6337 1.6947 -122.6105 -64.0820 -123.1481 -63.8871
0.6516 6.8841 1900 0.6872 0.3905 0.0084 0.3553 0.3821 -63.8704 -122.3671 1.7002 1.6400 1.6400 1.7002 -122.3671 -63.8704 -123.1481 -63.8871
0.6542 7.2464 2000 0.6910 0.3410 0.0003 0.3289 0.3406 -63.8864 -122.4661 1.6915 1.6320 1.6320 1.6915 -122.4661 -63.8864 -123.1481 -63.8871
0.6629 7.6087 2100 0.6930 0.4245 0.0306 0.3026 0.3939 -63.8259 -122.2991 1.6968 1.6376 1.6376 1.6968 -122.2991 -63.8259 -123.1481 -63.8871
0.6427 7.9710 2200 0.7009 0.3500 0.0298 0.3289 0.3202 -63.8274 -122.4480 1.6952 1.6350 1.6350 1.6952 -122.4480 -63.8274 -123.1481 -63.8871

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1