File size: 5,357 Bytes
97b30f5
 
 
 
672fe6a
97b30f5
 
 
672fe6a
 
 
 
 
97b30f5
 
 
 
 
 
 
 
 
 
672fe6a
97b30f5
 
672fe6a
 
 
 
 
 
 
 
97b30f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
base_model: mistralai/Mistral-7B-Instruct-v0.2
tags:
- alignment-handbook
- trl
- dpo
- generated_from_trainer
- trl
- dpo
- generated_from_trainer
datasets:
- Dahoas/full-hh-rlhf
model-index:
- name: Mistral-7B-Instruct-v0.2-DPO
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Mistral-7B-Instruct-v0.2-DPO

This model is a fine-tuned version of [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) on the Dahoas/full-hh-rlhf dataset.
It achieves the following results on the evaluation set:
- Loss: 0.5782
- Rewards/chosen: -0.2120
- Rewards/rejected: -0.7002
- Rewards/accuracies: 0.6926
- Rewards/margins: 0.4883
- Logps/rejected: -296.2612
- Logps/chosen: -255.5737
- Logits/rejected: -2.4985
- Logits/chosen: -2.5472

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 2
- total_train_batch_size: 64
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1

### Training results

| Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
|:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
| 0.6628        | 0.06  | 100  | 0.6611          | 0.1337         | 0.0489           | 0.6317             | 0.0848          | -221.3471      | -221.0088    | -2.6721         | -2.7152       |
| 0.6203        | 0.11  | 200  | 0.6121          | -0.0960        | -0.4057          | 0.6609             | 0.3097          | -266.8084      | -243.9758    | -2.6213         | -2.6775       |
| 0.6134        | 0.17  | 300  | 0.6074          | -0.0623        | -0.3733          | 0.6702             | 0.3111          | -263.5724      | -240.6045    | -2.7988         | -2.8551       |
| 0.5967        | 0.23  | 400  | 0.5992          | -0.1315        | -0.5181          | 0.6782             | 0.3866          | -278.0497      | -247.5236    | -2.4576         | -2.5191       |
| 0.6216        | 0.29  | 500  | 0.5941          | -0.0370        | -0.4146          | 0.6721             | 0.3775          | -267.6940      | -238.0781    | -2.6879         | -2.7311       |
| 0.5919        | 0.34  | 600  | 0.5904          | -0.1509        | -0.5767          | 0.6865             | 0.4258          | -283.9072      | -249.4699    | -2.4044         | -2.4745       |
| 0.5769        | 0.4   | 700  | 0.5902          | -0.2407        | -0.6647          | 0.6772             | 0.4240          | -292.7129      | -258.4496    | -2.2190         | -2.2924       |
| 0.5725        | 0.46  | 800  | 0.5882          | -0.0462        | -0.4830          | 0.6837             | 0.4368          | -274.5383      | -238.9940    | -2.5276         | -2.5732       |
| 0.5814        | 0.51  | 900  | 0.5864          | -0.1178        | -0.5375          | 0.6811             | 0.4197          | -279.9914      | -246.1586    | -2.3355         | -2.4098       |
| 0.5514        | 0.57  | 1000 | 0.5839          | -0.1827        | -0.6505          | 0.6872             | 0.4678          | -291.2902      | -252.6515    | -2.4115         | -2.4855       |
| 0.5946        | 0.63  | 1100 | 0.5846          | -0.0669        | -0.5120          | 0.6846             | 0.4451          | -277.4430      | -241.0672    | -2.4475         | -2.5090       |
| 0.5988        | 0.69  | 1200 | 0.5829          | -0.2676        | -0.7315          | 0.6891             | 0.4638          | -299.3864      | -261.1408    | -2.4703         | -2.5293       |
| 0.5725        | 0.74  | 1300 | 0.5809          | -0.1107        | -0.5656          | 0.6878             | 0.4549          | -282.7961      | -245.4460    | -2.4590         | -2.5131       |
| 0.5719        | 0.8   | 1400 | 0.5793          | -0.2111        | -0.6982          | 0.6894             | 0.4871          | -296.0592      | -255.4868    | -2.4585         | -2.5096       |
| 0.5702        | 0.86  | 1500 | 0.5789          | -0.2663        | -0.7548          | 0.6888             | 0.4884          | -301.7152      | -261.0100    | -2.4746         | -2.5243       |
| 0.5854        | 0.91  | 1600 | 0.5783          | -0.2282        | -0.7193          | 0.6913             | 0.4911          | -298.1695      | -257.1977    | -2.5037         | -2.5523       |
| 0.578         | 0.97  | 1700 | 0.5782          | -0.2135        | -0.7018          | 0.6920             | 0.4884          | -296.4236      | -255.7232    | -2.4987         | -2.5475       |


### Framework versions

- Transformers 4.39.0.dev0
- Pytorch 2.3.0+cu121
- Datasets 2.14.6
- Tokenizers 0.15.2