Edit model card

distilgpt2-dpo_test_run

This model is a fine-tuned version of gpt2 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 0.9044
  • Rewards/chosen: 0.7444
  • Rewards/rejected: 0.2592
  • Rewards/accuracies: 0.5817
  • Rewards/margins: 0.4852
  • Logps/rejected: -429.5133
  • Logps/chosen: -506.8889
  • Logits/rejected: -50.2012
  • Logits/chosen: -45.4443

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 6

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.8683 1.0 1337 0.9044 0.7444 0.2592 0.5817 0.4852 -429.5133 -506.8889 -50.2012 -45.4443
0.4795 2.0 2674 0.9425 0.1993 -0.4639 0.5959 0.6632 -436.7442 -512.3394 -54.4344 -49.5827
0.1485 3.0 4011 1.1159 -2.0134 -2.6798 0.5775 0.6664 -458.9030 -534.4666 -70.3363 -65.4014
0.0378 4.0 5348 1.3151 -3.6174 -4.7588 0.5927 1.1415 -479.6934 -550.5060 -70.8835 -65.6636
0.0127 5.0 6685 1.4381 -4.8640 -6.0585 0.5822 1.1945 -492.6903 -562.9730 -70.3612 -64.6966
0.0006 6.0 8022 1.5074 -5.3161 -6.4742 0.5837 1.1581 -496.8472 -567.4940 -70.7820 -64.9708

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.1.0+cu118
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
1
Safetensors
Model size
124M params
Tensor type
F32
·

Finetuned from