Edit model card

Visualize in Weights & Biases

dpo-lora

This model is a fine-tuned version of hatakeyama-llm-team/with_halcination_little_codes_ck5200 on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2368
  • Rewards/chosen: -0.4748
  • Rewards/rejected: -7.9955
  • Rewards/accuracies: 0.7593
  • Rewards/margins: 7.5207
  • Logps/rejected: -1090.4104
  • Logps/chosen: -283.0233
  • Logits/rejected: -0.3147
  • Logits/chosen: 0.1504

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 64
  • total_train_batch_size: 128
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6307 0.16 50 0.6366 -0.0349 -0.1583 0.7096 0.1234 -306.6914 -239.0365 0.4555 0.4696
0.4525 0.32 100 0.4497 -0.2154 -1.0594 0.7125 0.8440 -396.8069 -257.0910 -1.0499 -0.6249
0.2374 0.48 150 0.2572 -0.5430 -5.0859 0.7485 4.5429 -799.4507 -289.8462 -0.3316 0.1056
0.2474 0.64 200 0.2402 -0.4918 -6.9270 0.7466 6.4352 -983.5654 -284.7264 -0.2472 0.2108
0.2291 0.8 250 0.2368 -0.5202 -8.3251 0.7554 7.8048 -1123.3704 -287.5728 -0.3350 0.1314
0.2523 0.96 300 0.2368 -0.4748 -7.9955 0.7593 7.5207 -1090.4104 -283.0233 -0.3147 0.1504

Framework versions

  • PEFT 0.11.1
  • Transformers 4.41.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1
Downloads last month
0
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for misdelivery/tk-dpo-test-lr5e-6-beta0.01