biomistral-7b-dpo-full-wo-live_qa-ep3
This model is a fine-tuned version of BioMistral/BioMistral-7B on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:
- Loss: 0.5347
- Rewards/chosen: -0.8760
- Rewards/rejected: -1.7973
- Rewards/accuracies: 0.6528
- Rewards/margins: 0.9213
- Logps/rejected: -374.4584
- Logps/chosen: -385.7039
- Logits/rejected: 1.0068
- Logits/chosen: -1.3925
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Logits/chosen | Logits/rejected | Logps/chosen | Logps/rejected | Validation Loss | Rewards/accuracies | Rewards/chosen | Rewards/margins | Rewards/rejected |
---|---|---|---|---|---|---|---|---|---|---|---|
0.3672 | 0.33 | 100 | -1.7892 | 0.0124 | -325.3138 | -259.1306 | 0.5652 | 0.75 | -0.2721 | 0.3720 | -0.6441 |
0.2495 | 0.65 | 200 | -1.4424 | 0.8731 | -366.8614 | -347.8152 | 0.5144 | 0.7153 | -0.6876 | 0.8433 | -1.5309 |
0.1708 | 0.98 | 300 | -1.3915 | 1.0056 | -385.5208 | -374.2370 | 0.5345 | 0.6528 | -0.8742 | 0.9209 | -1.7951 |
Framework versions
- Transformers 4.39.0.dev0
- Pytorch 2.1.2
- Datasets 2.14.6
- Tokenizers 0.15.2
- Downloads last month
- 3