RikkiXu commited on
Commit
ff9cd6d
1 Parent(s): 75b0ab6

Model save

Browse files
README.md CHANGED
@@ -13,17 +13,17 @@ should probably proofread and complete it, then remove this comment. -->
13
 
14
  # zephyr-7b-dpo-full
15
 
16
- This model was trained from scratch on the None dataset.
17
  It achieves the following results on the evaluation set:
18
- - Loss: 0.3310
19
- - Rewards/chosen: -0.6343
20
- - Rewards/rejected: -2.0702
21
- - Rewards/accuracies: 0.8711
22
- - Rewards/margins: 1.4359
23
- - Logps/rejected: -579.6317
24
- - Logps/chosen: -442.8079
25
- - Logits/rejected: -4.3667
26
- - Logits/chosen: -3.9322
27
 
28
  ## Model description
29
 
@@ -42,7 +42,7 @@ More information needed
42
  ### Training hyperparameters
43
 
44
  The following hyperparameters were used during training:
45
- - learning_rate: 5e-07
46
  - train_batch_size: 8
47
  - eval_batch_size: 8
48
  - seed: 42
@@ -58,21 +58,16 @@ The following hyperparameters were used during training:
58
 
59
  ### Training results
60
 
61
- | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
62
- |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
63
- | 0.5557 | 0.12 | 100 | 0.6118 | -0.3644 | -0.7265 | 0.6719 | 0.3621 | -445.2614 | -415.8167 | -2.1569 | -2.1594 |
64
- | 0.4689 | 0.23 | 200 | 0.5068 | -0.4921 | -1.3016 | 0.75 | 0.8095 | -502.7699 | -428.5852 | -2.9279 | -2.8572 |
65
- | 0.4351 | 0.35 | 300 | 0.4574 | -0.5263 | -1.5234 | 0.7930 | 0.9971 | -524.9551 | -432.0108 | -3.6524 | -3.4654 |
66
- | 0.3978 | 0.46 | 400 | 0.4130 | -0.5219 | -1.7269 | 0.8359 | 1.2050 | -545.3044 | -431.5721 | -3.8428 | -3.5190 |
67
- | 0.422 | 0.58 | 500 | 0.3804 | -0.5284 | -1.7684 | 0.8516 | 1.2400 | -549.4502 | -432.2204 | -3.9749 | -3.6652 |
68
- | 0.3728 | 0.69 | 600 | 0.3498 | -0.6801 | -2.0888 | 0.8555 | 1.4087 | -581.4929 | -447.3842 | -4.3492 | -3.9204 |
69
- | 0.4072 | 0.81 | 700 | 0.3413 | -0.5876 | -1.9622 | 0.8711 | 1.3746 | -568.8267 | -438.1348 | -4.2357 | -3.8217 |
70
- | 0.388 | 0.92 | 800 | 0.3310 | -0.6343 | -2.0702 | 0.8711 | 1.4359 | -579.6317 | -442.8079 | -4.3667 | -3.9322 |
71
 
72
 
73
  ### Framework versions
74
 
75
- - Transformers 4.38.2
76
  - Pytorch 2.1.2+cu118
77
- - Datasets 2.16.1
78
- - Tokenizers 0.15.2
 
13
 
14
  # zephyr-7b-dpo-full
15
 
16
+ This model was trained from scratch on an unknown dataset.
17
  It achieves the following results on the evaluation set:
18
+ - Loss: 0.7111
19
+ - Rewards/chosen: -0.8296
20
+ - Rewards/rejected: -0.9542
21
+ - Rewards/accuracies: 0.5625
22
+ - Rewards/margins: 0.1246
23
+ - Logps/rejected: -613.8080
24
+ - Logps/chosen: -473.4393
25
+ - Logits/rejected: -5.2824
26
+ - Logits/chosen: -5.0285
27
 
28
  ## Model description
29
 
 
42
  ### Training hyperparameters
43
 
44
  The following hyperparameters were used during training:
45
+ - learning_rate: 5e-08
46
  - train_batch_size: 8
47
  - eval_batch_size: 8
48
  - seed: 42
 
58
 
59
  ### Training results
60
 
61
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
62
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
63
+ | 0.5044 | 0.2558 | 100 | 0.7105 | -0.1129 | -0.0877 | 0.4727 | -0.0252 | -527.1570 | -401.7669 | -4.8307 | -4.6554 |
64
+ | 0.3343 | 0.5115 | 200 | 0.6982 | -0.5200 | -0.6117 | 0.5586 | 0.0918 | -579.5609 | -442.4707 | -5.1101 | -4.8657 |
65
+ | 0.2972 | 0.7673 | 300 | 0.7111 | -0.8296 | -0.9542 | 0.5625 | 0.1246 | -613.8080 | -473.4393 | -5.2824 | -5.0285 |
 
 
 
 
 
66
 
67
 
68
  ### Framework versions
69
 
70
+ - Transformers 4.40.2
71
  - Pytorch 2.1.2+cu118
72
+ - Datasets 2.19.1
73
+ - Tokenizers 0.19.1
all_results.json CHANGED
@@ -1,8 +1,9 @@
1
  {
2
  "epoch": 1.0,
3
- "train_loss": 0.44703073270859256,
4
- "train_runtime": 13842.8667,
5
- "train_samples": 111134,
6
- "train_samples_per_second": 8.028,
 
7
  "train_steps_per_second": 0.063
8
  }
 
1
  {
2
  "epoch": 1.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.4007269041922391,
5
+ "train_runtime": 6210.4356,
6
+ "train_samples": 50000,
7
+ "train_samples_per_second": 8.051,
8
  "train_steps_per_second": 0.063
9
  }
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "/mnt/bn/xuruijie-llm/checkpoints/hh-rlhf/sft_0521/checkpoint-5500/",
3
  "architectures": [
4
  "MistralForCausalLM"
5
  ],
@@ -20,7 +20,7 @@
20
  "sliding_window": 4096,
21
  "tie_word_embeddings": false,
22
  "torch_dtype": "bfloat16",
23
- "transformers_version": "4.41.1",
24
  "use_cache": false,
25
  "vocab_size": 32002
26
  }
 
1
  {
2
+ "_name_or_path": "/mnt/bn/xuruijie-llm/checkpoints/new_world/v1-ultral",
3
  "architectures": [
4
  "MistralForCausalLM"
5
  ],
 
20
  "sliding_window": 4096,
21
  "tie_word_embeddings": false,
22
  "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.40.2",
24
  "use_cache": false,
25
  "vocab_size": 32002
26
  }
generation_config.json CHANGED
@@ -2,5 +2,5 @@
2
  "_from_model_config": true,
3
  "bos_token_id": 1,
4
  "eos_token_id": 32000,
5
- "transformers_version": "4.38.2"
6
  }
 
2
  "_from_model_config": true,
3
  "bos_token_id": 1,
4
  "eos_token_id": 32000,
5
+ "transformers_version": "4.40.2"
6
  }
model-00001-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6be203077af844af5de32b657bbc2702f49a7f3828bee12a42b12c1a74280723
3
  size 4943178720
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9f251e96294c9492756c37d895155f82d849066c2471fd0aca8729a7aede122
3
  size 4943178720
model-00002-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:40ddbab26b4c540c98068ca370e93ec6d4c67b3e6238f0b4926eb9fc2f1596b5
3
  size 4999819336
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e9fd416637a101fc45a516e2d108838af2259d6ce2b49a13aba516cf189b7f0
3
  size 4999819336
model-00003-of-00003.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:76160dd669972969580331cecaeb5b1949bff1c3100a65adafe8e5985f40204c
3
  size 4540532728
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:551b3715b02a97fd2af4d39745e82e183fc10809e578801af9fb5fdd39551e6f
3
  size 4540532728
runs/May28_00-07-25_n136-112-146/events.out.tfevents.1716826553.n136-112-146.284037.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d00d7a6f35edb4776ffcc0381341f6e0a886f5ba1c80665a4f893b2276a86a7c
3
- size 28347
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:96385d3b27773ea70ec55132d40a85c9739d029bd06e376948e2a8e29febfd37
3
+ size 34893
train_results.json CHANGED
@@ -1,8 +1,9 @@
1
  {
2
  "epoch": 1.0,
3
- "train_loss": 0.44703073270859256,
4
- "train_runtime": 13842.8667,
5
- "train_samples": 111134,
6
- "train_samples_per_second": 8.028,
 
7
  "train_steps_per_second": 0.063
8
  }
 
1
  {
2
  "epoch": 1.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.4007269041922391,
5
+ "train_runtime": 6210.4356,
6
+ "train_samples": 50000,
7
+ "train_samples_per_second": 8.051,
8
  "train_steps_per_second": 0.063
9
  }
trainer_state.json CHANGED
@@ -1,21 +1,21 @@
1
  {
2
  "best_metric": null,
3
  "best_model_checkpoint": null,
4
- "epoch": 0.9994242947610823,
5
  "eval_steps": 100,
6
- "global_step": 868,
7
  "is_hyper_param_search": false,
8
  "is_local_process_zero": true,
9
  "is_world_process_zero": true,
10
  "log_history": [
11
  {
12
- "epoch": 0.0,
13
- "grad_norm": 35.129907946655834,
14
- "learning_rate": 5.747126436781609e-09,
15
- "logits/chosen": -1.8631134033203125,
16
- "logits/rejected": -1.9713879823684692,
17
- "logps/chosen": -395.93560791015625,
18
- "logps/rejected": -290.8868408203125,
19
  "loss": 0.6931,
20
  "rewards/accuracies": 0.0,
21
  "rewards/chosen": 0.0,
@@ -24,1435 +24,650 @@
24
  "step": 1
25
  },
26
  {
27
- "epoch": 0.01,
28
- "grad_norm": 38.80010312606095,
29
- "learning_rate": 5.747126436781609e-08,
30
- "logits/chosen": -2.041348457336426,
31
- "logits/rejected": -1.9895708560943604,
32
- "logps/chosen": -276.6951599121094,
33
- "logps/rejected": -224.53475952148438,
34
- "loss": 0.6932,
35
- "rewards/accuracies": 0.5,
36
- "rewards/chosen": 0.000690977496560663,
37
- "rewards/margins": 0.00028596253832802176,
38
- "rewards/rejected": 0.00040501501644030213,
39
  "step": 10
40
  },
41
  {
42
- "epoch": 0.02,
43
- "grad_norm": 34.46172746061437,
44
- "learning_rate": 1.1494252873563217e-07,
45
- "logits/chosen": -2.1249823570251465,
46
- "logits/rejected": -2.0916128158569336,
47
- "logps/chosen": -274.7043151855469,
48
- "logps/rejected": -239.5059051513672,
49
- "loss": 0.692,
50
- "rewards/accuracies": 0.5562499761581421,
51
- "rewards/chosen": 0.004372459836304188,
52
- "rewards/margins": 0.002485192846506834,
53
- "rewards/rejected": 0.0018872671062126756,
54
  "step": 20
55
  },
56
  {
57
- "epoch": 0.03,
58
- "grad_norm": 33.511897886478806,
59
- "learning_rate": 1.7241379310344828e-07,
60
- "logits/chosen": -2.1329550743103027,
61
- "logits/rejected": -2.1377687454223633,
62
- "logps/chosen": -245.65478515625,
63
- "logps/rejected": -219.1451873779297,
64
- "loss": 0.6874,
65
- "rewards/accuracies": 0.612500011920929,
66
- "rewards/chosen": 0.025536498054862022,
67
- "rewards/margins": 0.010243075899779797,
68
- "rewards/rejected": 0.015293421223759651,
69
  "step": 30
70
  },
71
  {
72
- "epoch": 0.05,
73
- "grad_norm": 30.025842330171532,
74
- "learning_rate": 2.2988505747126435e-07,
75
- "logits/chosen": -2.1390576362609863,
76
- "logits/rejected": -2.130236864089966,
77
- "logps/chosen": -259.17510986328125,
78
- "logps/rejected": -241.2788848876953,
79
- "loss": 0.676,
80
- "rewards/accuracies": 0.6937500238418579,
81
- "rewards/chosen": 0.08977474272251129,
82
- "rewards/margins": 0.036073412746191025,
83
- "rewards/rejected": 0.05370132252573967,
84
  "step": 40
85
  },
86
  {
87
- "epoch": 0.06,
88
- "grad_norm": 26.283273801474603,
89
- "learning_rate": 2.873563218390804e-07,
90
- "logits/chosen": -2.1333060264587402,
91
- "logits/rejected": -2.1513326168060303,
92
- "logps/chosen": -234.34951782226562,
93
- "logps/rejected": -225.25820922851562,
94
- "loss": 0.6593,
95
- "rewards/accuracies": 0.71875,
96
- "rewards/chosen": 0.14864465594291687,
97
- "rewards/margins": 0.07653863728046417,
98
- "rewards/rejected": 0.0721060186624527,
99
  "step": 50
100
  },
101
  {
102
- "epoch": 0.07,
103
- "grad_norm": 25.615307740380704,
104
- "learning_rate": 3.4482758620689656e-07,
105
- "logits/chosen": -2.084411859512329,
106
- "logits/rejected": -2.1329734325408936,
107
- "logps/chosen": -269.8699035644531,
108
- "logps/rejected": -241.92623901367188,
109
- "loss": 0.6364,
110
- "rewards/accuracies": 0.762499988079071,
111
- "rewards/chosen": 0.24832992255687714,
112
- "rewards/margins": 0.15597295761108398,
113
- "rewards/rejected": 0.09235697984695435,
114
  "step": 60
115
  },
116
  {
117
- "epoch": 0.08,
118
- "grad_norm": 24.88575351107564,
119
- "learning_rate": 4.0229885057471266e-07,
120
- "logits/chosen": -2.2113332748413086,
121
- "logits/rejected": -2.2210419178009033,
122
- "logps/chosen": -264.09210205078125,
123
- "logps/rejected": -247.2978515625,
124
- "loss": 0.6202,
125
- "rewards/accuracies": 0.7749999761581421,
126
- "rewards/chosen": 0.18283841013908386,
127
- "rewards/margins": 0.1584111750125885,
128
- "rewards/rejected": 0.024427231401205063,
129
  "step": 70
130
  },
131
  {
132
- "epoch": 0.09,
133
- "grad_norm": 27.297246578083513,
134
- "learning_rate": 4.597701149425287e-07,
135
- "logits/chosen": -2.150770425796509,
136
- "logits/rejected": -2.1732804775238037,
137
- "logps/chosen": -274.4136962890625,
138
- "logps/rejected": -255.2479248046875,
139
- "loss": 0.5908,
140
- "rewards/accuracies": 0.768750011920929,
141
- "rewards/chosen": 0.08161883056163788,
142
- "rewards/margins": 0.29927313327789307,
143
- "rewards/rejected": -0.2176542729139328,
144
  "step": 80
145
  },
146
  {
147
- "epoch": 0.1,
148
- "grad_norm": 28.084151245430267,
149
- "learning_rate": 4.999817969178237e-07,
150
- "logits/chosen": -2.112828254699707,
151
- "logits/rejected": -2.1538023948669434,
152
- "logps/chosen": -304.37359619140625,
153
- "logps/rejected": -301.63006591796875,
154
- "loss": 0.5668,
155
- "rewards/accuracies": 0.7749999761581421,
156
- "rewards/chosen": -0.11925234645605087,
157
- "rewards/margins": 0.42082518339157104,
158
- "rewards/rejected": -0.5400775074958801,
159
  "step": 90
160
  },
161
  {
162
- "epoch": 0.12,
163
- "grad_norm": 31.821457063829286,
164
- "learning_rate": 4.996582603056428e-07,
165
- "logits/chosen": -2.128788948059082,
166
- "logits/rejected": -2.174410581588745,
167
- "logps/chosen": -313.7573547363281,
168
- "logps/rejected": -319.2353820800781,
169
- "loss": 0.5557,
170
- "rewards/accuracies": 0.7250000238418579,
171
- "rewards/chosen": -0.30633050203323364,
172
- "rewards/margins": 0.45107221603393555,
173
- "rewards/rejected": -0.7574027180671692,
174
  "step": 100
175
  },
176
  {
177
- "epoch": 0.12,
178
- "eval_logits/chosen": -2.159374713897705,
179
- "eval_logits/rejected": -2.1569156646728516,
180
- "eval_logps/chosen": -415.816650390625,
181
- "eval_logps/rejected": -445.2613525390625,
182
- "eval_loss": 0.6118258833885193,
183
- "eval_rewards/accuracies": 0.671875,
184
- "eval_rewards/chosen": -0.36439263820648193,
185
- "eval_rewards/margins": 0.36210644245147705,
186
- "eval_rewards/rejected": -0.7264990210533142,
187
- "eval_runtime": 97.4725,
188
- "eval_samples_per_second": 20.519,
189
- "eval_steps_per_second": 0.328,
190
  "step": 100
191
  },
192
  {
193
- "epoch": 0.13,
194
- "grad_norm": 36.68008853886969,
195
- "learning_rate": 4.989308132738126e-07,
196
- "logits/chosen": -2.0658252239227295,
197
- "logits/rejected": -2.0949950218200684,
198
- "logps/chosen": -320.2489013671875,
199
- "logps/rejected": -331.5747375488281,
200
- "loss": 0.5429,
201
- "rewards/accuracies": 0.75,
202
- "rewards/chosen": -0.4602568745613098,
203
- "rewards/margins": 0.5536943674087524,
204
- "rewards/rejected": -1.013951301574707,
205
  "step": 110
206
  },
207
  {
208
- "epoch": 0.14,
209
- "grad_norm": 33.72936391730864,
210
- "learning_rate": 4.978006327248536e-07,
211
- "logits/chosen": -1.987633466720581,
212
- "logits/rejected": -1.984405755996704,
213
- "logps/chosen": -354.09942626953125,
214
- "logps/rejected": -387.8695068359375,
215
- "loss": 0.5007,
216
- "rewards/accuracies": 0.768750011920929,
217
- "rewards/chosen": -0.6619850993156433,
218
- "rewards/margins": 0.6314666271209717,
219
- "rewards/rejected": -1.2934516668319702,
220
  "step": 120
221
  },
222
  {
223
- "epoch": 0.15,
224
- "grad_norm": 56.97154479868482,
225
- "learning_rate": 4.962695471250032e-07,
226
- "logits/chosen": -1.9375616312026978,
227
- "logits/rejected": -1.9765644073486328,
228
- "logps/chosen": -340.1041564941406,
229
- "logps/rejected": -322.09417724609375,
230
- "loss": 0.5522,
231
- "rewards/accuracies": 0.7562500238418579,
232
- "rewards/chosen": -0.7063130140304565,
233
- "rewards/margins": 0.4115590453147888,
234
- "rewards/rejected": -1.1178721189498901,
235
  "step": 130
236
  },
237
  {
238
- "epoch": 0.16,
239
- "grad_norm": 37.7377259153511,
240
- "learning_rate": 4.94340033546025e-07,
241
- "logits/chosen": -2.1993556022644043,
242
- "logits/rejected": -2.185791492462158,
243
- "logps/chosen": -320.29559326171875,
244
- "logps/rejected": -332.7876281738281,
245
- "loss": 0.5222,
246
- "rewards/accuracies": 0.7749999761581421,
247
- "rewards/chosen": -0.33287590742111206,
248
- "rewards/margins": 0.6350983381271362,
249
- "rewards/rejected": -0.9679743051528931,
250
  "step": 140
251
  },
252
  {
253
- "epoch": 0.17,
254
- "grad_norm": 35.68760199609404,
255
- "learning_rate": 4.920152136576705e-07,
256
- "logits/chosen": -2.324455738067627,
257
- "logits/rejected": -2.3316075801849365,
258
- "logps/chosen": -342.5655517578125,
259
- "logps/rejected": -415.856689453125,
260
- "loss": 0.4726,
261
- "rewards/accuracies": 0.800000011920929,
262
- "rewards/chosen": -0.7312036156654358,
263
- "rewards/margins": 0.9272117614746094,
264
- "rewards/rejected": -1.65841543674469,
265
  "step": 150
266
  },
267
  {
268
- "epoch": 0.18,
269
- "grad_norm": 37.470403244655515,
270
- "learning_rate": 4.892988486772756e-07,
271
- "logits/chosen": -2.515658140182495,
272
- "logits/rejected": -2.526738405227661,
273
- "logps/chosen": -348.3170471191406,
274
- "logps/rejected": -359.30633544921875,
275
- "loss": 0.5029,
276
- "rewards/accuracies": 0.768750011920929,
277
- "rewards/chosen": -0.5424179434776306,
278
- "rewards/margins": 0.6865107417106628,
279
- "rewards/rejected": -1.228928804397583,
280
  "step": 160
281
  },
282
  {
283
- "epoch": 0.2,
284
- "grad_norm": 45.91976760863689,
285
- "learning_rate": 4.861953332846629e-07,
286
- "logits/chosen": -2.555476665496826,
287
- "logits/rejected": -2.5736944675445557,
288
- "logps/chosen": -318.83306884765625,
289
- "logps/rejected": -354.2410583496094,
290
- "loss": 0.4971,
291
- "rewards/accuracies": 0.762499988079071,
292
- "rewards/chosen": -0.5986114740371704,
293
- "rewards/margins": 0.7181976437568665,
294
- "rewards/rejected": -1.3168091773986816,
295
  "step": 170
296
  },
297
  {
298
- "epoch": 0.21,
299
- "grad_norm": 39.45147614462311,
300
- "learning_rate": 4.827096885121953e-07,
301
- "logits/chosen": -2.725339412689209,
302
- "logits/rejected": -2.790322780609131,
303
- "logps/chosen": -317.9709777832031,
304
- "logps/rejected": -369.8156433105469,
305
- "loss": 0.4663,
306
- "rewards/accuracies": 0.8062499761581421,
307
- "rewards/chosen": -0.6137017607688904,
308
- "rewards/margins": 0.9897038340568542,
309
- "rewards/rejected": -1.6034053564071655,
310
  "step": 180
311
  },
312
  {
313
- "epoch": 0.22,
314
- "grad_norm": 36.629506229246836,
315
- "learning_rate": 4.788475536214821e-07,
316
- "logits/chosen": -2.7514729499816895,
317
- "logits/rejected": -2.8462166786193848,
318
- "logps/chosen": -364.15350341796875,
319
- "logps/rejected": -426.7947692871094,
320
- "loss": 0.4836,
321
- "rewards/accuracies": 0.6937500238418579,
322
- "rewards/chosen": -1.0376867055892944,
323
- "rewards/margins": 0.9606195688247681,
324
- "rewards/rejected": -1.9983062744140625,
325
  "step": 190
326
  },
327
  {
328
- "epoch": 0.23,
329
- "grad_norm": 34.16905422902762,
330
- "learning_rate": 4.746151769798818e-07,
331
- "logits/chosen": -2.6336653232574463,
332
- "logits/rejected": -2.803025722503662,
333
- "logps/chosen": -316.927001953125,
334
- "logps/rejected": -380.2194519042969,
335
- "loss": 0.4689,
336
- "rewards/accuracies": 0.831250011920929,
337
- "rewards/chosen": -0.6853088140487671,
338
- "rewards/margins": 0.9961303472518921,
339
- "rewards/rejected": -1.6814391613006592,
340
  "step": 200
341
  },
342
  {
343
- "epoch": 0.23,
344
- "eval_logits/chosen": -2.857208251953125,
345
- "eval_logits/rejected": -2.9278781414031982,
346
- "eval_logps/chosen": -428.585205078125,
347
- "eval_logps/rejected": -502.7698669433594,
348
- "eval_loss": 0.5067635178565979,
349
- "eval_rewards/accuracies": 0.75,
350
- "eval_rewards/chosen": -0.49207818508148193,
351
- "eval_rewards/margins": 0.8095061779022217,
352
- "eval_rewards/rejected": -1.3015843629837036,
353
- "eval_runtime": 97.5856,
354
- "eval_samples_per_second": 20.495,
355
- "eval_steps_per_second": 0.328,
356
  "step": 200
357
  },
358
  {
359
- "epoch": 0.24,
360
- "grad_norm": 37.010315481275924,
361
- "learning_rate": 4.7001940595156055e-07,
362
- "logits/chosen": -2.86403226852417,
363
- "logits/rejected": -2.915546178817749,
364
- "logps/chosen": -384.78106689453125,
365
- "logps/rejected": -438.4271545410156,
366
- "loss": 0.4609,
367
- "rewards/accuracies": 0.699999988079071,
368
- "rewards/chosen": -0.7915019989013672,
369
- "rewards/margins": 0.7616507411003113,
370
- "rewards/rejected": -1.5531526803970337,
371
  "step": 210
372
  },
373
  {
374
- "epoch": 0.25,
375
- "grad_norm": 44.86769250632604,
376
- "learning_rate": 4.650676758194623e-07,
377
- "logits/chosen": -2.865612268447876,
378
- "logits/rejected": -2.9143929481506348,
379
- "logps/chosen": -381.2469177246094,
380
- "logps/rejected": -441.4414978027344,
381
- "loss": 0.4525,
382
- "rewards/accuracies": 0.7749999761581421,
383
- "rewards/chosen": -0.8893365859985352,
384
- "rewards/margins": 1.0212167501449585,
385
- "rewards/rejected": -1.910552978515625,
386
  "step": 220
387
  },
388
  {
389
- "epoch": 0.26,
390
- "grad_norm": 38.712466368217875,
391
- "learning_rate": 4.5976799775611215e-07,
392
- "logits/chosen": -3.0207266807556152,
393
- "logits/rejected": -3.1139583587646484,
394
- "logps/chosen": -360.5721130371094,
395
- "logps/rejected": -431.66455078125,
396
- "loss": 0.4648,
397
- "rewards/accuracies": 0.71875,
398
- "rewards/chosen": -0.9129144549369812,
399
- "rewards/margins": 0.9306351542472839,
400
- "rewards/rejected": -1.8435497283935547,
401
  "step": 230
402
  },
403
  {
404
- "epoch": 0.28,
405
- "grad_norm": 52.80138142568913,
406
- "learning_rate": 4.5412894586271543e-07,
407
- "logits/chosen": -3.160139322280884,
408
- "logits/rejected": -3.2236475944519043,
409
- "logps/chosen": -365.31378173828125,
410
- "logps/rejected": -439.4544372558594,
411
- "loss": 0.4615,
412
- "rewards/accuracies": 0.731249988079071,
413
- "rewards/chosen": -0.9612631797790527,
414
- "rewards/margins": 1.0881351232528687,
415
- "rewards/rejected": -2.049398183822632,
416
  "step": 240
417
  },
418
  {
419
- "epoch": 0.29,
420
- "grad_norm": 32.70960725306299,
421
- "learning_rate": 4.481596432975201e-07,
422
- "logits/chosen": -3.0666606426239014,
423
- "logits/rejected": -3.2191436290740967,
424
- "logps/chosen": -358.38006591796875,
425
- "logps/rejected": -420.057861328125,
426
- "loss": 0.4566,
427
- "rewards/accuracies": 0.762499988079071,
428
- "rewards/chosen": -0.7017436623573303,
429
- "rewards/margins": 0.9394323229789734,
430
- "rewards/rejected": -1.6411759853363037,
431
  "step": 250
432
  },
433
  {
434
- "epoch": 0.3,
435
- "grad_norm": 41.45445866458212,
436
- "learning_rate": 4.41869747515886e-07,
437
- "logits/chosen": -2.946607828140259,
438
- "logits/rejected": -3.0797600746154785,
439
- "logps/chosen": -370.9931335449219,
440
- "logps/rejected": -431.0896911621094,
441
- "loss": 0.4354,
442
- "rewards/accuracies": 0.8812500238418579,
443
- "rewards/chosen": -0.5950635075569153,
444
- "rewards/margins": 1.3376449346542358,
445
- "rewards/rejected": -1.932708740234375,
446
  "step": 260
447
  },
448
  {
449
- "epoch": 0.31,
450
- "grad_norm": 88.17109933362657,
451
- "learning_rate": 4.352694346459396e-07,
452
- "logits/chosen": -2.953822374343872,
453
- "logits/rejected": -3.0674452781677246,
454
- "logps/chosen": -338.36041259765625,
455
- "logps/rejected": -427.46630859375,
456
- "loss": 0.4567,
457
- "rewards/accuracies": 0.8187500238418579,
458
- "rewards/chosen": -0.6696688532829285,
459
- "rewards/margins": 1.309070348739624,
460
- "rewards/rejected": -1.9787391424179077,
461
  "step": 270
462
  },
463
  {
464
- "epoch": 0.32,
465
- "grad_norm": 34.8827153941344,
466
- "learning_rate": 4.2836938302509256e-07,
467
- "logits/chosen": -3.1520888805389404,
468
- "logits/rejected": -3.3024184703826904,
469
- "logps/chosen": -344.01055908203125,
470
- "logps/rejected": -420.909423828125,
471
- "loss": 0.4396,
472
- "rewards/accuracies": 0.8125,
473
- "rewards/chosen": -0.8860230445861816,
474
- "rewards/margins": 1.1038401126861572,
475
- "rewards/rejected": -1.989863395690918,
476
  "step": 280
477
  },
478
  {
479
- "epoch": 0.33,
480
- "grad_norm": 35.155375356854826,
481
- "learning_rate": 4.2118075592405874e-07,
482
- "logits/chosen": -3.3648910522460938,
483
- "logits/rejected": -3.546940565109253,
484
- "logps/chosen": -351.7215881347656,
485
- "logps/rejected": -464.04541015625,
486
- "loss": 0.4268,
487
- "rewards/accuracies": 0.7749999761581421,
488
- "rewards/chosen": -1.1209633350372314,
489
- "rewards/margins": 1.3907191753387451,
490
- "rewards/rejected": -2.5116827487945557,
491
  "step": 290
492
  },
493
  {
494
- "epoch": 0.35,
495
- "grad_norm": 34.14384163098325,
496
- "learning_rate": 4.137151834863213e-07,
497
- "logits/chosen": -3.596907377243042,
498
- "logits/rejected": -3.7265594005584717,
499
- "logps/chosen": -371.98834228515625,
500
- "logps/rejected": -488.18975830078125,
501
- "loss": 0.4351,
502
- "rewards/accuracies": 0.8125,
503
- "rewards/chosen": -0.8739571571350098,
504
- "rewards/margins": 1.3305376768112183,
505
- "rewards/rejected": -2.2044949531555176,
506
  "step": 300
507
  },
508
  {
509
- "epoch": 0.35,
510
- "eval_logits/chosen": -3.4653797149658203,
511
- "eval_logits/rejected": -3.652350425720215,
512
- "eval_logps/chosen": -432.01080322265625,
513
- "eval_logps/rejected": -524.9551391601562,
514
- "eval_loss": 0.4574473202228546,
515
- "eval_rewards/accuracies": 0.79296875,
516
- "eval_rewards/chosen": -0.5263344645500183,
517
- "eval_rewards/margins": 0.9971021413803101,
518
- "eval_rewards/rejected": -1.5234365463256836,
519
- "eval_runtime": 97.4167,
520
- "eval_samples_per_second": 20.53,
521
- "eval_steps_per_second": 0.328,
522
  "step": 300
523
  },
524
  {
525
- "epoch": 0.36,
526
- "grad_norm": 45.59368283329677,
527
- "learning_rate": 4.059847439122671e-07,
528
- "logits/chosen": -3.523773193359375,
529
- "logits/rejected": -3.7083168029785156,
530
- "logps/chosen": -374.57525634765625,
531
- "logps/rejected": -469.28533935546875,
532
- "loss": 0.4288,
533
- "rewards/accuracies": 0.8187500238418579,
534
- "rewards/chosen": -1.021723747253418,
535
- "rewards/margins": 1.3690807819366455,
536
- "rewards/rejected": -2.3908047676086426,
537
  "step": 310
538
  },
539
  {
540
- "epoch": 0.37,
541
- "grad_norm": 41.39688664580757,
542
- "learning_rate": 3.98001943918432e-07,
543
- "logits/chosen": -3.278153896331787,
544
- "logits/rejected": -3.5683510303497314,
545
- "logps/chosen": -419.21868896484375,
546
- "logps/rejected": -439.51837158203125,
547
- "loss": 0.4438,
548
- "rewards/accuracies": 0.78125,
549
- "rewards/chosen": -0.8592584729194641,
550
- "rewards/margins": 1.153443694114685,
551
- "rewards/rejected": -2.012701988220215,
552
  "step": 320
553
  },
554
  {
555
- "epoch": 0.38,
556
- "grad_norm": 48.21450790202707,
557
- "learning_rate": 3.8977969850346866e-07,
558
- "logits/chosen": -3.3799691200256348,
559
- "logits/rejected": -3.6289849281311035,
560
- "logps/chosen": -297.0557861328125,
561
- "logps/rejected": -381.7215270996094,
562
- "loss": 0.4422,
563
- "rewards/accuracies": 0.8187500238418579,
564
- "rewards/chosen": -0.5932060480117798,
565
- "rewards/margins": 1.224426507949829,
566
- "rewards/rejected": -1.8176324367523193,
567
  "step": 330
568
  },
569
  {
570
- "epoch": 0.39,
571
- "grad_norm": 43.46878894172161,
572
- "learning_rate": 3.8133131005357465e-07,
573
- "logits/chosen": -3.379418134689331,
574
- "logits/rejected": -3.6483395099639893,
575
- "logps/chosen": -331.81390380859375,
576
- "logps/rejected": -441.87957763671875,
577
- "loss": 0.4286,
578
- "rewards/accuracies": 0.731249988079071,
579
- "rewards/chosen": -0.8891918063163757,
580
- "rewards/margins": 1.391618013381958,
581
- "rewards/rejected": -2.2808098793029785,
582
  "step": 340
583
  },
584
  {
585
- "epoch": 0.4,
586
- "grad_norm": 45.9545965651521,
587
- "learning_rate": 3.7267044682118435e-07,
588
- "logits/chosen": -3.411963939666748,
589
- "logits/rejected": -3.587937593460083,
590
- "logps/chosen": -346.03521728515625,
591
- "logps/rejected": -445.9239807128906,
592
- "loss": 0.4628,
593
- "rewards/accuracies": 0.768750011920929,
594
- "rewards/chosen": -0.8978070020675659,
595
- "rewards/margins": 1.216806173324585,
596
- "rewards/rejected": -2.1146132946014404,
597
  "step": 350
598
  },
599
  {
600
- "epoch": 0.41,
601
- "grad_norm": 46.21389082842024,
602
- "learning_rate": 3.638111208117425e-07,
603
- "logits/chosen": -3.389965772628784,
604
- "logits/rejected": -3.6526336669921875,
605
- "logps/chosen": -344.18798828125,
606
- "logps/rejected": -436.8316955566406,
607
- "loss": 0.4428,
608
- "rewards/accuracies": 0.78125,
609
- "rewards/chosen": -0.7493476867675781,
610
- "rewards/margins": 1.355184555053711,
611
- "rewards/rejected": -2.10453200340271,
612
  "step": 360
613
  },
614
  {
615
- "epoch": 0.43,
616
- "grad_norm": 36.698900906534384,
617
- "learning_rate": 3.5476766511433605e-07,
618
- "logits/chosen": -3.333527088165283,
619
- "logits/rejected": -3.72214937210083,
620
- "logps/chosen": -369.4108581542969,
621
- "logps/rejected": -476.28759765625,
622
- "loss": 0.4406,
623
- "rewards/accuracies": 0.762499988079071,
624
- "rewards/chosen": -0.7729538679122925,
625
- "rewards/margins": 1.6176202297210693,
626
- "rewards/rejected": -2.3905742168426514,
627
  "step": 370
628
  },
629
  {
630
- "epoch": 0.44,
631
- "grad_norm": 36.19981517638812,
632
- "learning_rate": 3.455547107128602e-07,
633
- "logits/chosen": -3.6403069496154785,
634
- "logits/rejected": -3.8383822441101074,
635
- "logps/chosen": -354.4368591308594,
636
- "logps/rejected": -423.34710693359375,
637
- "loss": 0.4292,
638
- "rewards/accuracies": 0.824999988079071,
639
- "rewards/chosen": -0.5367406606674194,
640
- "rewards/margins": 1.240304946899414,
641
- "rewards/rejected": -1.777045488357544,
642
  "step": 380
643
  },
644
  {
645
- "epoch": 0.45,
646
- "grad_norm": 46.389830142880655,
647
- "learning_rate": 3.361871628152338e-07,
648
- "logits/chosen": -3.556725263595581,
649
- "logits/rejected": -3.9898459911346436,
650
- "logps/chosen": -356.79791259765625,
651
- "logps/rejected": -472.49267578125,
652
- "loss": 0.4147,
653
- "rewards/accuracies": 0.793749988079071,
654
- "rewards/chosen": -0.7947701811790466,
655
- "rewards/margins": 1.4320166110992432,
656
- "rewards/rejected": -2.2267866134643555,
657
  "step": 390
658
  },
659
- {
660
- "epoch": 0.46,
661
- "grad_norm": 44.843204651815924,
662
- "learning_rate": 3.2668017673896077e-07,
663
- "logits/chosen": -3.4979865550994873,
664
- "logits/rejected": -3.931288957595825,
665
- "logps/chosen": -354.89971923828125,
666
- "logps/rejected": -465.4832458496094,
667
- "loss": 0.3978,
668
- "rewards/accuracies": 0.762499988079071,
669
- "rewards/chosen": -0.9876173138618469,
670
- "rewards/margins": 1.556265115737915,
671
- "rewards/rejected": -2.543882369995117,
672
- "step": 400
673
- },
674
- {
675
- "epoch": 0.46,
676
- "eval_logits/chosen": -3.5189661979675293,
677
- "eval_logits/rejected": -3.8427517414093018,
678
- "eval_logps/chosen": -431.57208251953125,
679
- "eval_logps/rejected": -545.3043823242188,
680
- "eval_loss": 0.4129987061023712,
681
- "eval_rewards/accuracies": 0.8359375,
682
- "eval_rewards/chosen": -0.5219463109970093,
683
- "eval_rewards/margins": 1.204982876777649,
684
- "eval_rewards/rejected": -1.7269293069839478,
685
- "eval_runtime": 97.4385,
686
- "eval_samples_per_second": 20.526,
687
- "eval_steps_per_second": 0.328,
688
- "step": 400
689
- },
690
- {
691
- "epoch": 0.47,
692
- "grad_norm": 47.76955752807069,
693
- "learning_rate": 3.1704913339205103e-07,
694
- "logits/chosen": -3.461193799972534,
695
- "logits/rejected": -3.710247755050659,
696
- "logps/chosen": -387.45135498046875,
697
- "logps/rejected": -511.35479736328125,
698
- "loss": 0.4208,
699
- "rewards/accuracies": 0.762499988079071,
700
- "rewards/chosen": -0.8673557043075562,
701
- "rewards/margins": 1.5246984958648682,
702
- "rewards/rejected": -2.392054319381714,
703
- "step": 410
704
- },
705
- {
706
- "epoch": 0.48,
707
- "grad_norm": 39.876214448639374,
708
- "learning_rate": 3.0730961438896885e-07,
709
- "logits/chosen": -3.5806972980499268,
710
- "logits/rejected": -3.978752613067627,
711
- "logps/chosen": -361.4602966308594,
712
- "logps/rejected": -514.3541870117188,
713
- "loss": 0.4073,
714
- "rewards/accuracies": 0.8500000238418579,
715
- "rewards/chosen": -0.8498995900154114,
716
- "rewards/margins": 1.8191858530044556,
717
- "rewards/rejected": -2.6690852642059326,
718
- "step": 420
719
- },
720
- {
721
- "epoch": 0.5,
722
- "grad_norm": 38.614610181092246,
723
- "learning_rate": 2.9747737684186795e-07,
724
- "logits/chosen": -3.636990785598755,
725
- "logits/rejected": -3.9850502014160156,
726
- "logps/chosen": -363.47320556640625,
727
- "logps/rejected": -452.01422119140625,
728
- "loss": 0.428,
729
- "rewards/accuracies": 0.7749999761581421,
730
- "rewards/chosen": -0.8123588562011719,
731
- "rewards/margins": 1.3265635967254639,
732
- "rewards/rejected": -2.138922691345215,
733
- "step": 430
734
- },
735
- {
736
- "epoch": 0.51,
737
- "grad_norm": 42.82087860142725,
738
- "learning_rate": 2.8756832786789663e-07,
739
- "logits/chosen": -3.677544116973877,
740
- "logits/rejected": -4.333060264587402,
741
- "logps/chosen": -382.77850341796875,
742
- "logps/rejected": -544.0916748046875,
743
- "loss": 0.4003,
744
- "rewards/accuracies": 0.84375,
745
- "rewards/chosen": -0.8904153108596802,
746
- "rewards/margins": 2.0021920204162598,
747
- "rewards/rejected": -2.8926072120666504,
748
- "step": 440
749
- },
750
- {
751
- "epoch": 0.52,
752
- "grad_norm": 38.24345657824867,
753
- "learning_rate": 2.7759849885381747e-07,
754
- "logits/chosen": -3.5210156440734863,
755
- "logits/rejected": -4.05521297454834,
756
- "logps/chosen": -353.104736328125,
757
- "logps/rejected": -479.227294921875,
758
- "loss": 0.4046,
759
- "rewards/accuracies": 0.7875000238418579,
760
- "rewards/chosen": -0.8441875576972961,
761
- "rewards/margins": 1.6644071340560913,
762
- "rewards/rejected": -2.508594512939453,
763
- "step": 450
764
- },
765
- {
766
- "epoch": 0.53,
767
- "grad_norm": 45.50414470118614,
768
- "learning_rate": 2.675840195195762e-07,
769
- "logits/chosen": -3.977940320968628,
770
- "logits/rejected": -4.435048580169678,
771
- "logps/chosen": -329.555419921875,
772
- "logps/rejected": -462.91546630859375,
773
- "loss": 0.4234,
774
- "rewards/accuracies": 0.800000011920929,
775
- "rewards/chosen": -0.9291930198669434,
776
- "rewards/margins": 1.639059066772461,
777
- "rewards/rejected": -2.568251609802246,
778
- "step": 460
779
- },
780
- {
781
- "epoch": 0.54,
782
- "grad_norm": 35.15027402538317,
783
- "learning_rate": 2.575410918227829e-07,
784
- "logits/chosen": -3.773422956466675,
785
- "logits/rejected": -4.276736736297607,
786
- "logps/chosen": -353.8682556152344,
787
- "logps/rejected": -494.83465576171875,
788
- "loss": 0.4037,
789
- "rewards/accuracies": 0.8125,
790
- "rewards/chosen": -0.9828575253486633,
791
- "rewards/margins": 1.6659753322601318,
792
- "rewards/rejected": -2.6488327980041504,
793
- "step": 470
794
- },
795
- {
796
- "epoch": 0.55,
797
- "grad_norm": 39.80925602363646,
798
- "learning_rate": 2.474859637463226e-07,
799
- "logits/chosen": -3.872162342071533,
800
- "logits/rejected": -4.276012897491455,
801
- "logps/chosen": -351.5787048339844,
802
- "logps/rejected": -477.1171875,
803
- "loss": 0.4087,
804
- "rewards/accuracies": 0.8500000238418579,
805
- "rewards/chosen": -0.8465379476547241,
806
- "rewards/margins": 1.8354904651641846,
807
- "rewards/rejected": -2.682028293609619,
808
- "step": 480
809
- },
810
- {
811
- "epoch": 0.56,
812
- "grad_norm": 42.97505867432159,
813
- "learning_rate": 2.3743490301150355e-07,
814
- "logits/chosen": -3.802455186843872,
815
- "logits/rejected": -4.2998151779174805,
816
- "logps/chosen": -407.70086669921875,
817
- "logps/rejected": -548.3660888671875,
818
- "loss": 0.4037,
819
- "rewards/accuracies": 0.8062499761581421,
820
- "rewards/chosen": -1.167244553565979,
821
- "rewards/margins": 1.6480038166046143,
822
- "rewards/rejected": -2.8152482509613037,
823
- "step": 490
824
- },
825
- {
826
- "epoch": 0.58,
827
- "grad_norm": 42.20357617999531,
828
- "learning_rate": 2.274041707592724e-07,
829
- "logits/chosen": -3.8027241230010986,
830
- "logits/rejected": -4.319924831390381,
831
- "logps/chosen": -367.37005615234375,
832
- "logps/rejected": -488.4521484375,
833
- "loss": 0.422,
834
- "rewards/accuracies": 0.8187500238418579,
835
- "rewards/chosen": -1.0884382724761963,
836
- "rewards/margins": 1.6187102794647217,
837
- "rewards/rejected": -2.707148551940918,
838
- "step": 500
839
- },
840
- {
841
- "epoch": 0.58,
842
- "eval_logits/chosen": -3.665224552154541,
843
- "eval_logits/rejected": -3.9749138355255127,
844
- "eval_logps/chosen": -432.22039794921875,
845
- "eval_logps/rejected": -549.4501953125,
846
- "eval_loss": 0.3803901970386505,
847
- "eval_rewards/accuracies": 0.8515625,
848
- "eval_rewards/chosen": -0.5284299254417419,
849
- "eval_rewards/margins": 1.2399569749832153,
850
- "eval_rewards/rejected": -1.7683868408203125,
851
- "eval_runtime": 97.3437,
852
- "eval_samples_per_second": 20.546,
853
- "eval_steps_per_second": 0.329,
854
- "step": 500
855
- },
856
- {
857
- "epoch": 0.59,
858
- "grad_norm": 33.051939245132566,
859
- "learning_rate": 2.17409995242075e-07,
860
- "logits/chosen": -3.456348419189453,
861
- "logits/rejected": -3.8555312156677246,
862
- "logps/chosen": -373.09674072265625,
863
- "logps/rejected": -477.450927734375,
864
- "loss": 0.4091,
865
- "rewards/accuracies": 0.8812500238418579,
866
- "rewards/chosen": -0.6009220480918884,
867
- "rewards/margins": 1.515817403793335,
868
- "rewards/rejected": -2.116739511489868,
869
- "step": 510
870
- },
871
- {
872
- "epoch": 0.6,
873
- "grad_norm": 41.383478497964475,
874
- "learning_rate": 2.0746854556892544e-07,
875
- "logits/chosen": -3.676243305206299,
876
- "logits/rejected": -4.275219917297363,
877
- "logps/chosen": -366.2404479980469,
878
- "logps/rejected": -504.5061950683594,
879
- "loss": 0.4012,
880
- "rewards/accuracies": 0.800000011920929,
881
- "rewards/chosen": -0.9555729627609253,
882
- "rewards/margins": 1.7532062530517578,
883
- "rewards/rejected": -2.7087793350219727,
884
- "step": 520
885
- },
886
- {
887
- "epoch": 0.61,
888
- "grad_norm": 44.14308194161755,
889
- "learning_rate": 1.9759590554616173e-07,
890
- "logits/chosen": -3.820181369781494,
891
- "logits/rejected": -4.501372814178467,
892
- "logps/chosen": -359.88836669921875,
893
- "logps/rejected": -508.40576171875,
894
- "loss": 0.3817,
895
- "rewards/accuracies": 0.856249988079071,
896
- "rewards/chosen": -1.0667827129364014,
897
- "rewards/margins": 1.8904939889907837,
898
- "rewards/rejected": -2.9572768211364746,
899
- "step": 530
900
- },
901
- {
902
- "epoch": 0.62,
903
- "grad_norm": 39.56708257885465,
904
- "learning_rate": 1.8780804765620746e-07,
905
- "logits/chosen": -3.6856751441955566,
906
- "logits/rejected": -4.133795738220215,
907
- "logps/chosen": -404.65997314453125,
908
- "logps/rejected": -513.8917236328125,
909
- "loss": 0.3999,
910
- "rewards/accuracies": 0.78125,
911
- "rewards/chosen": -1.0721806287765503,
912
- "rewards/margins": 1.6528995037078857,
913
- "rewards/rejected": -2.7250800132751465,
914
- "step": 540
915
- },
916
- {
917
- "epoch": 0.63,
918
- "grad_norm": 44.047245390890936,
919
- "learning_rate": 1.7812080721643973e-07,
920
- "logits/chosen": -3.6747565269470215,
921
- "logits/rejected": -4.137899398803711,
922
- "logps/chosen": -376.62615966796875,
923
- "logps/rejected": -503.29266357421875,
924
- "loss": 0.4151,
925
- "rewards/accuracies": 0.800000011920929,
926
- "rewards/chosen": -0.8548510670661926,
927
- "rewards/margins": 1.795414686203003,
928
- "rewards/rejected": -2.650266170501709,
929
- "step": 550
930
- },
931
- {
932
- "epoch": 0.64,
933
- "grad_norm": 44.733238192011385,
934
- "learning_rate": 1.6854985675997063e-07,
935
- "logits/chosen": -3.412196397781372,
936
- "logits/rejected": -3.9336624145507812,
937
- "logps/chosen": -365.46795654296875,
938
- "logps/rejected": -478.50177001953125,
939
- "loss": 0.415,
940
- "rewards/accuracies": 0.793749988079071,
941
- "rewards/chosen": -0.848841667175293,
942
- "rewards/margins": 1.497734785079956,
943
- "rewards/rejected": -2.346576690673828,
944
- "step": 560
945
- },
946
- {
947
- "epoch": 0.66,
948
- "grad_norm": 50.78656478745754,
949
- "learning_rate": 1.5911068067978818e-07,
950
- "logits/chosen": -3.570014476776123,
951
- "logits/rejected": -3.9861984252929688,
952
- "logps/chosen": -386.4393005371094,
953
- "logps/rejected": -537.798095703125,
954
- "loss": 0.3867,
955
- "rewards/accuracies": 0.831250011920929,
956
- "rewards/chosen": -0.9028172492980957,
957
- "rewards/margins": 1.7015517950057983,
958
- "rewards/rejected": -2.6043689250946045,
959
- "step": 570
960
- },
961
- {
962
- "epoch": 0.67,
963
- "grad_norm": 37.91828473174147,
964
- "learning_rate": 1.4981855017728197e-07,
965
- "logits/chosen": -3.787682294845581,
966
- "logits/rejected": -4.240976333618164,
967
- "logps/chosen": -353.59393310546875,
968
- "logps/rejected": -502.26544189453125,
969
- "loss": 0.3785,
970
- "rewards/accuracies": 0.831250011920929,
971
- "rewards/chosen": -1.1089229583740234,
972
- "rewards/margins": 1.7554630041122437,
973
- "rewards/rejected": -2.8643860816955566,
974
- "step": 580
975
- },
976
- {
977
- "epoch": 0.68,
978
- "grad_norm": 41.30885920488077,
979
- "learning_rate": 1.406884985556804e-07,
980
- "logits/chosen": -3.8037681579589844,
981
- "logits/rejected": -4.2103681564331055,
982
- "logps/chosen": -368.0989685058594,
983
- "logps/rejected": -479.09893798828125,
984
- "loss": 0.4082,
985
- "rewards/accuracies": 0.800000011920929,
986
- "rewards/chosen": -0.9101566076278687,
987
- "rewards/margins": 1.4272234439849854,
988
- "rewards/rejected": -2.3373799324035645,
989
- "step": 590
990
- },
991
- {
992
- "epoch": 0.69,
993
- "grad_norm": 33.201388393145585,
994
- "learning_rate": 1.3173529689837354e-07,
995
- "logits/chosen": -3.958705186843872,
996
- "logits/rejected": -4.450201988220215,
997
- "logps/chosen": -338.07318115234375,
998
- "logps/rejected": -459.9598083496094,
999
- "loss": 0.3728,
1000
- "rewards/accuracies": 0.8187500238418579,
1001
- "rewards/chosen": -0.8850952386856079,
1002
- "rewards/margins": 1.6357240676879883,
1003
- "rewards/rejected": -2.5208191871643066,
1004
- "step": 600
1005
- },
1006
- {
1007
- "epoch": 0.69,
1008
- "eval_logits/chosen": -3.9204437732696533,
1009
- "eval_logits/rejected": -4.349194526672363,
1010
- "eval_logps/chosen": -447.3841857910156,
1011
- "eval_logps/rejected": -581.492919921875,
1012
- "eval_loss": 0.3498460352420807,
1013
- "eval_rewards/accuracies": 0.85546875,
1014
- "eval_rewards/chosen": -0.6800678968429565,
1015
- "eval_rewards/margins": 1.408746600151062,
1016
- "eval_rewards/rejected": -2.0888142585754395,
1017
- "eval_runtime": 97.3846,
1018
- "eval_samples_per_second": 20.537,
1019
- "eval_steps_per_second": 0.329,
1020
- "step": 600
1021
- },
1022
- {
1023
- "epoch": 0.7,
1024
- "grad_norm": 38.28289884444149,
1025
- "learning_rate": 1.2297343017146726e-07,
1026
- "logits/chosen": -3.8682377338409424,
1027
- "logits/rejected": -4.303684234619141,
1028
- "logps/chosen": -397.2498474121094,
1029
- "logps/rejected": -542.2492065429688,
1030
- "loss": 0.3926,
1031
- "rewards/accuracies": 0.824999988079071,
1032
- "rewards/chosen": -1.052696704864502,
1033
- "rewards/margins": 1.9617723226547241,
1034
- "rewards/rejected": -3.0144691467285156,
1035
- "step": 610
1036
- },
1037
- {
1038
- "epoch": 0.71,
1039
- "grad_norm": 44.09127513083321,
1040
- "learning_rate": 1.1441707378923474e-07,
1041
- "logits/chosen": -3.8735458850860596,
1042
- "logits/rejected": -4.310673713684082,
1043
- "logps/chosen": -406.3232727050781,
1044
- "logps/rejected": -514.9022216796875,
1045
- "loss": 0.3998,
1046
- "rewards/accuracies": 0.7875000238418579,
1047
- "rewards/chosen": -1.1163777112960815,
1048
- "rewards/margins": 1.5480899810791016,
1049
- "rewards/rejected": -2.664468288421631,
1050
- "step": 620
1051
- },
1052
- {
1053
- "epoch": 0.73,
1054
- "grad_norm": 55.66131693545821,
1055
- "learning_rate": 1.06080070680377e-07,
1056
- "logits/chosen": -3.904125690460205,
1057
- "logits/rejected": -4.380388259887695,
1058
- "logps/chosen": -392.47015380859375,
1059
- "logps/rejected": -493.78594970703125,
1060
- "loss": 0.413,
1061
- "rewards/accuracies": 0.75,
1062
- "rewards/chosen": -1.1083078384399414,
1063
- "rewards/margins": 1.3549706935882568,
1064
- "rewards/rejected": -2.4632785320281982,
1065
- "step": 630
1066
- },
1067
- {
1068
- "epoch": 0.74,
1069
- "grad_norm": 47.57857500564574,
1070
- "learning_rate": 9.797590889219587e-08,
1071
- "logits/chosen": -3.7111058235168457,
1072
- "logits/rejected": -4.241570472717285,
1073
- "logps/chosen": -390.62103271484375,
1074
- "logps/rejected": -555.2783813476562,
1075
- "loss": 0.3857,
1076
- "rewards/accuracies": 0.8187500238418579,
1077
- "rewards/chosen": -0.9739618301391602,
1078
- "rewards/margins": 1.9729865789413452,
1079
- "rewards/rejected": -2.946948289871216,
1080
- "step": 640
1081
- },
1082
- {
1083
- "epoch": 0.75,
1084
- "grad_norm": 54.021801628642386,
1085
- "learning_rate": 9.011769976891367e-08,
1086
- "logits/chosen": -3.948495388031006,
1087
- "logits/rejected": -4.6028947830200195,
1088
- "logps/chosen": -348.66473388671875,
1089
- "logps/rejected": -499.11309814453125,
1090
- "loss": 0.3966,
1091
- "rewards/accuracies": 0.862500011920929,
1092
- "rewards/chosen": -0.8676958084106445,
1093
- "rewards/margins": 1.9891424179077148,
1094
- "rewards/rejected": -2.8568382263183594,
1095
- "step": 650
1096
- },
1097
- {
1098
- "epoch": 0.76,
1099
- "grad_norm": 42.830226175531045,
1100
- "learning_rate": 8.251815673944218e-08,
1101
- "logits/chosen": -3.9013257026672363,
1102
- "logits/rejected": -4.469305992126465,
1103
- "logps/chosen": -356.7492370605469,
1104
- "logps/rejected": -473.14599609375,
1105
- "loss": 0.3908,
1106
- "rewards/accuracies": 0.7875000238418579,
1107
- "rewards/chosen": -1.1369212865829468,
1108
- "rewards/margins": 1.5198627710342407,
1109
- "rewards/rejected": -2.6567840576171875,
1110
- "step": 660
1111
- },
1112
- {
1113
- "epoch": 0.77,
1114
- "grad_norm": 35.97105604628807,
1115
- "learning_rate": 7.518957474892148e-08,
1116
- "logits/chosen": -3.7510364055633545,
1117
- "logits/rejected": -4.5982208251953125,
1118
- "logps/chosen": -359.37835693359375,
1119
- "logps/rejected": -531.6082153320312,
1120
- "loss": 0.3667,
1121
- "rewards/accuracies": 0.8374999761581421,
1122
- "rewards/chosen": -0.8581075668334961,
1123
- "rewards/margins": 2.218301773071289,
1124
- "rewards/rejected": -3.076409101486206,
1125
- "step": 670
1126
- },
1127
- {
1128
- "epoch": 0.78,
1129
- "grad_norm": 48.90631940415997,
1130
- "learning_rate": 6.814381036730274e-08,
1131
- "logits/chosen": -3.967960834503174,
1132
- "logits/rejected": -4.668499946594238,
1133
- "logps/chosen": -347.4683837890625,
1134
- "logps/rejected": -495.95947265625,
1135
- "loss": 0.4186,
1136
- "rewards/accuracies": 0.793749988079071,
1137
- "rewards/chosen": -1.1356189250946045,
1138
- "rewards/margins": 1.912573218345642,
1139
- "rewards/rejected": -3.048192262649536,
1140
- "step": 680
1141
- },
1142
- {
1143
- "epoch": 0.79,
1144
- "grad_norm": 44.70641719300435,
1145
- "learning_rate": 6.139226260715872e-08,
1146
- "logits/chosen": -3.9364871978759766,
1147
- "logits/rejected": -4.475349426269531,
1148
- "logps/chosen": -397.2518310546875,
1149
- "logps/rejected": -526.1500244140625,
1150
- "loss": 0.3962,
1151
- "rewards/accuracies": 0.8187500238418579,
1152
- "rewards/chosen": -1.067858099937439,
1153
- "rewards/margins": 1.8494539260864258,
1154
- "rewards/rejected": -2.9173121452331543,
1155
- "step": 690
1156
- },
1157
- {
1158
- "epoch": 0.81,
1159
- "grad_norm": 38.79291863118847,
1160
- "learning_rate": 5.4945854481754734e-08,
1161
- "logits/chosen": -3.8525912761688232,
1162
- "logits/rejected": -4.3074188232421875,
1163
- "logps/chosen": -403.70281982421875,
1164
- "logps/rejected": -516.3580322265625,
1165
- "loss": 0.4072,
1166
- "rewards/accuracies": 0.8062499761581421,
1167
- "rewards/chosen": -0.8484451174736023,
1168
- "rewards/margins": 1.8069216012954712,
1169
- "rewards/rejected": -2.6553666591644287,
1170
- "step": 700
1171
- },
1172
- {
1173
- "epoch": 0.81,
1174
- "eval_logits/chosen": -3.8217344284057617,
1175
- "eval_logits/rejected": -4.235738754272461,
1176
- "eval_logps/chosen": -438.1347961425781,
1177
- "eval_logps/rejected": -568.8267211914062,
1178
- "eval_loss": 0.34126752614974976,
1179
- "eval_rewards/accuracies": 0.87109375,
1180
- "eval_rewards/chosen": -0.587573766708374,
1181
- "eval_rewards/margins": 1.3745783567428589,
1182
- "eval_rewards/rejected": -1.9621522426605225,
1183
- "eval_runtime": 97.3651,
1184
- "eval_samples_per_second": 20.541,
1185
- "eval_steps_per_second": 0.329,
1186
- "step": 700
1187
- },
1188
- {
1189
- "epoch": 0.82,
1190
- "grad_norm": 42.4289491699843,
1191
- "learning_rate": 4.881501533321605e-08,
1192
- "logits/chosen": -3.841318130493164,
1193
- "logits/rejected": -4.349331378936768,
1194
- "logps/chosen": -369.50439453125,
1195
- "logps/rejected": -468.9639587402344,
1196
- "loss": 0.4047,
1197
- "rewards/accuracies": 0.7875000238418579,
1198
- "rewards/chosen": -0.9862189292907715,
1199
- "rewards/margins": 1.4564803838729858,
1200
- "rewards/rejected": -2.4426989555358887,
1201
- "step": 710
1202
- },
1203
- {
1204
- "epoch": 0.83,
1205
- "grad_norm": 34.598541000047966,
1206
- "learning_rate": 4.300966395938377e-08,
1207
- "logits/chosen": -3.7932441234588623,
1208
- "logits/rejected": -4.211984634399414,
1209
- "logps/chosen": -371.55584716796875,
1210
- "logps/rejected": -509.62091064453125,
1211
- "loss": 0.3922,
1212
- "rewards/accuracies": 0.8500000238418579,
1213
- "rewards/chosen": -1.0055599212646484,
1214
- "rewards/margins": 1.8010333776474,
1215
- "rewards/rejected": -2.806593179702759,
1216
- "step": 720
1217
- },
1218
- {
1219
- "epoch": 0.84,
1220
- "grad_norm": 43.981942186199504,
1221
- "learning_rate": 3.7539192566655246e-08,
1222
- "logits/chosen": -3.78422212600708,
1223
- "logits/rejected": -4.4000091552734375,
1224
- "logps/chosen": -382.2842102050781,
1225
- "logps/rejected": -524.8851928710938,
1226
- "loss": 0.3892,
1227
- "rewards/accuracies": 0.793749988079071,
1228
- "rewards/chosen": -0.9714164733886719,
1229
- "rewards/margins": 2.023409128189087,
1230
- "rewards/rejected": -2.9948253631591797,
1231
- "step": 730
1232
- },
1233
- {
1234
- "epoch": 0.85,
1235
- "grad_norm": 47.73160653198726,
1236
- "learning_rate": 3.24124515747731e-08,
1237
- "logits/chosen": -3.7954139709472656,
1238
- "logits/rejected": -4.441324234008789,
1239
- "logps/chosen": -391.67987060546875,
1240
- "logps/rejected": -511.4283752441406,
1241
- "loss": 0.4052,
1242
- "rewards/accuracies": 0.78125,
1243
- "rewards/chosen": -1.0643631219863892,
1244
- "rewards/margins": 1.8877410888671875,
1245
- "rewards/rejected": -2.952104330062866,
1246
- "step": 740
1247
- },
1248
- {
1249
- "epoch": 0.86,
1250
- "grad_norm": 32.946663985678285,
1251
- "learning_rate": 2.763773529814506e-08,
1252
- "logits/chosen": -3.753929853439331,
1253
- "logits/rejected": -4.232656478881836,
1254
- "logps/chosen": -393.05145263671875,
1255
- "logps/rejected": -542.4010009765625,
1256
- "loss": 0.3847,
1257
- "rewards/accuracies": 0.824999988079071,
1258
- "rewards/chosen": -1.14042067527771,
1259
- "rewards/margins": 1.8287441730499268,
1260
- "rewards/rejected": -2.9691648483276367,
1261
- "step": 750
1262
- },
1263
- {
1264
- "epoch": 0.88,
1265
- "grad_norm": 41.44563269413962,
1266
- "learning_rate": 2.3222768526860698e-08,
1267
- "logits/chosen": -3.9827494621276855,
1268
- "logits/rejected": -4.4769439697265625,
1269
- "logps/chosen": -360.85772705078125,
1270
- "logps/rejected": -520.6256103515625,
1271
- "loss": 0.4024,
1272
- "rewards/accuracies": 0.8374999761581421,
1273
- "rewards/chosen": -0.9786790013313293,
1274
- "rewards/margins": 1.8364450931549072,
1275
- "rewards/rejected": -2.815124034881592,
1276
- "step": 760
1277
- },
1278
- {
1279
- "epoch": 0.89,
1280
- "grad_norm": 41.56548534990344,
1281
- "learning_rate": 1.9174694029115146e-08,
1282
- "logits/chosen": -3.9031333923339844,
1283
- "logits/rejected": -4.248320579528809,
1284
- "logps/chosen": -380.02532958984375,
1285
- "logps/rejected": -533.628662109375,
1286
- "loss": 0.3734,
1287
- "rewards/accuracies": 0.7562500238418579,
1288
- "rewards/chosen": -1.0660566091537476,
1289
- "rewards/margins": 1.5903499126434326,
1290
- "rewards/rejected": -2.6564066410064697,
1291
- "step": 770
1292
- },
1293
- {
1294
- "epoch": 0.9,
1295
- "grad_norm": 44.05878651405709,
1296
- "learning_rate": 1.5500060995258134e-08,
1297
- "logits/chosen": -3.970170259475708,
1298
- "logits/rejected": -4.581896781921387,
1299
- "logps/chosen": -381.94769287109375,
1300
- "logps/rejected": -505.8392639160156,
1301
- "loss": 0.3764,
1302
- "rewards/accuracies": 0.8374999761581421,
1303
- "rewards/chosen": -1.0537292957305908,
1304
- "rewards/margins": 1.9894670248031616,
1305
- "rewards/rejected": -3.043196201324463,
1306
- "step": 780
1307
- },
1308
- {
1309
- "epoch": 0.91,
1310
- "grad_norm": 36.69938487477932,
1311
- "learning_rate": 1.2204814442165812e-08,
1312
- "logits/chosen": -3.7230095863342285,
1313
- "logits/rejected": -4.348383903503418,
1314
- "logps/chosen": -402.3287658691406,
1315
- "logps/rejected": -528.0650634765625,
1316
- "loss": 0.3848,
1317
- "rewards/accuracies": 0.8374999761581421,
1318
- "rewards/chosen": -0.8664236068725586,
1319
- "rewards/margins": 1.9163984060287476,
1320
- "rewards/rejected": -2.7828221321105957,
1321
- "step": 790
1322
- },
1323
- {
1324
- "epoch": 0.92,
1325
- "grad_norm": 77.77225540035298,
1326
- "learning_rate": 9.294285595075669e-09,
1327
- "logits/chosen": -3.828099012374878,
1328
- "logits/rejected": -4.454224586486816,
1329
- "logps/chosen": -401.0879211425781,
1330
- "logps/rejected": -550.5147705078125,
1331
- "loss": 0.388,
1332
- "rewards/accuracies": 0.862500011920929,
1333
- "rewards/chosen": -1.066781759262085,
1334
- "rewards/margins": 1.9501686096191406,
1335
- "rewards/rejected": -3.0169506072998047,
1336
- "step": 800
1337
- },
1338
- {
1339
- "epoch": 0.92,
1340
- "eval_logits/chosen": -3.932162046432495,
1341
- "eval_logits/rejected": -4.366703510284424,
1342
- "eval_logps/chosen": -442.8078918457031,
1343
- "eval_logps/rejected": -579.6317138671875,
1344
- "eval_loss": 0.3309651017189026,
1345
- "eval_rewards/accuracies": 0.87109375,
1346
- "eval_rewards/chosen": -0.6343047022819519,
1347
- "eval_rewards/margins": 1.435897707939148,
1348
- "eval_rewards/rejected": -2.070202350616455,
1349
- "eval_runtime": 97.3029,
1350
- "eval_samples_per_second": 20.554,
1351
- "eval_steps_per_second": 0.329,
1352
- "step": 800
1353
- },
1354
- {
1355
- "epoch": 0.93,
1356
- "grad_norm": 41.84703205105403,
1357
- "learning_rate": 6.773183262446914e-09,
1358
- "logits/chosen": -4.019244194030762,
1359
- "logits/rejected": -4.569401741027832,
1360
- "logps/chosen": -403.5866394042969,
1361
- "logps/rejected": -542.3660888671875,
1362
- "loss": 0.4038,
1363
- "rewards/accuracies": 0.831250011920929,
1364
- "rewards/chosen": -1.0097706317901611,
1365
- "rewards/margins": 1.7525224685668945,
1366
- "rewards/rejected": -2.7622933387756348,
1367
- "step": 810
1368
- },
1369
- {
1370
- "epoch": 0.94,
1371
- "grad_norm": 43.70341312986973,
1372
- "learning_rate": 4.645586217799452e-09,
1373
- "logits/chosen": -4.0331034660339355,
1374
- "logits/rejected": -4.535307884216309,
1375
- "logps/chosen": -368.22479248046875,
1376
- "logps/rejected": -495.12615966796875,
1377
- "loss": 0.4055,
1378
- "rewards/accuracies": 0.793749988079071,
1379
- "rewards/chosen": -0.9445031881332397,
1380
- "rewards/margins": 1.8414275646209717,
1381
- "rewards/rejected": -2.785930871963501,
1382
- "step": 820
1383
- },
1384
- {
1385
- "epoch": 0.96,
1386
- "grad_norm": 39.756368871146144,
1387
- "learning_rate": 2.9149366008568987e-09,
1388
- "logits/chosen": -3.8695130348205566,
1389
- "logits/rejected": -4.424549102783203,
1390
- "logps/chosen": -382.7538757324219,
1391
- "logps/rejected": -543.4766845703125,
1392
- "loss": 0.3869,
1393
- "rewards/accuracies": 0.831250011920929,
1394
- "rewards/chosen": -0.9384629130363464,
1395
- "rewards/margins": 2.0100338459014893,
1396
- "rewards/rejected": -2.9484963417053223,
1397
- "step": 830
1398
- },
1399
- {
1400
- "epoch": 0.97,
1401
- "grad_norm": 41.81245135848943,
1402
- "learning_rate": 1.5840343486700215e-09,
1403
- "logits/chosen": -3.9648585319519043,
1404
- "logits/rejected": -4.646535396575928,
1405
- "logps/chosen": -393.53857421875,
1406
- "logps/rejected": -504.774658203125,
1407
- "loss": 0.385,
1408
- "rewards/accuracies": 0.8125,
1409
- "rewards/chosen": -0.9394033551216125,
1410
- "rewards/margins": 1.7747853994369507,
1411
- "rewards/rejected": -2.714188814163208,
1412
- "step": 840
1413
- },
1414
- {
1415
- "epoch": 0.98,
1416
- "grad_norm": 43.5733624216965,
1417
- "learning_rate": 6.550326657293881e-10,
1418
- "logits/chosen": -4.054198265075684,
1419
- "logits/rejected": -4.520740509033203,
1420
- "logps/chosen": -346.19580078125,
1421
- "logps/rejected": -487.6908264160156,
1422
- "loss": 0.3841,
1423
- "rewards/accuracies": 0.7749999761581421,
1424
- "rewards/chosen": -1.0814578533172607,
1425
- "rewards/margins": 1.7187023162841797,
1426
- "rewards/rejected": -2.8001601696014404,
1427
- "step": 850
1428
- },
1429
- {
1430
- "epoch": 0.99,
1431
- "grad_norm": 41.788009788359076,
1432
- "learning_rate": 1.2943454039654467e-10,
1433
- "logits/chosen": -4.023499488830566,
1434
- "logits/rejected": -4.535801887512207,
1435
- "logps/chosen": -351.8265686035156,
1436
- "logps/rejected": -521.2282104492188,
1437
- "loss": 0.385,
1438
- "rewards/accuracies": 0.8187500238418579,
1439
- "rewards/chosen": -1.1059845685958862,
1440
- "rewards/margins": 1.947348952293396,
1441
- "rewards/rejected": -3.0533337593078613,
1442
- "step": 860
1443
- },
1444
  {
1445
  "epoch": 1.0,
1446
- "step": 868,
1447
  "total_flos": 0.0,
1448
- "train_loss": 0.44703073270859256,
1449
- "train_runtime": 13842.8667,
1450
- "train_samples_per_second": 8.028,
1451
  "train_steps_per_second": 0.063
1452
  }
1453
  ],
1454
  "logging_steps": 10,
1455
- "max_steps": 868,
1456
  "num_input_tokens_seen": 0,
1457
  "num_train_epochs": 1,
1458
  "save_steps": 100,
 
1
  {
2
  "best_metric": null,
3
  "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
  "eval_steps": 100,
6
+ "global_step": 391,
7
  "is_hyper_param_search": false,
8
  "is_local_process_zero": true,
9
  "is_world_process_zero": true,
10
  "log_history": [
11
  {
12
+ "epoch": 0.0025575447570332483,
13
+ "grad_norm": 42.05566856949971,
14
+ "learning_rate": 1.25e-09,
15
+ "logits/chosen": -4.623842239379883,
16
+ "logits/rejected": -4.85917854309082,
17
+ "logps/chosen": -239.31422424316406,
18
+ "logps/rejected": -207.56365966796875,
19
  "loss": 0.6931,
20
  "rewards/accuracies": 0.0,
21
  "rewards/chosen": 0.0,
 
24
  "step": 1
25
  },
26
  {
27
+ "epoch": 0.02557544757033248,
28
+ "grad_norm": 39.368501883614215,
29
+ "learning_rate": 1.25e-08,
30
+ "logits/chosen": -4.334437370300293,
31
+ "logits/rejected": -4.64446496963501,
32
+ "logps/chosen": -265.1294250488281,
33
+ "logps/rejected": -215.75079345703125,
34
+ "loss": 0.6929,
35
+ "rewards/accuracies": 0.4513888955116272,
36
+ "rewards/chosen": -0.00022488113609142601,
37
+ "rewards/margins": 0.0002015345817198977,
38
+ "rewards/rejected": -0.0004264157032594085,
39
  "step": 10
40
  },
41
  {
42
+ "epoch": 0.05115089514066496,
43
+ "grad_norm": 41.32160817525484,
44
+ "learning_rate": 2.5e-08,
45
+ "logits/chosen": -4.507131576538086,
46
+ "logits/rejected": -4.741620063781738,
47
+ "logps/chosen": -267.7641906738281,
48
+ "logps/rejected": -216.6431427001953,
49
+ "loss": 0.6928,
50
+ "rewards/accuracies": 0.574999988079071,
51
+ "rewards/chosen": 0.001757637015543878,
52
+ "rewards/margins": 0.0020956869702786207,
53
+ "rewards/rejected": -0.000338050042046234,
54
  "step": 20
55
  },
56
  {
57
+ "epoch": 0.07672634271099744,
58
+ "grad_norm": 43.9248767727728,
59
+ "learning_rate": 3.75e-08,
60
+ "logits/chosen": -4.58504581451416,
61
+ "logits/rejected": -4.763764381408691,
62
+ "logps/chosen": -258.29669189453125,
63
+ "logps/rejected": -214.74630737304688,
64
+ "loss": 0.6912,
65
+ "rewards/accuracies": 0.6312500238418579,
66
+ "rewards/chosen": 0.00236110738478601,
67
+ "rewards/margins": 0.0044965180568397045,
68
+ "rewards/rejected": -0.002135410439223051,
69
  "step": 30
70
  },
71
  {
72
+ "epoch": 0.10230179028132992,
73
+ "grad_norm": 43.260460439031476,
74
+ "learning_rate": 5e-08,
75
+ "logits/chosen": -4.622879981994629,
76
+ "logits/rejected": -4.708461284637451,
77
+ "logps/chosen": -252.55868530273438,
78
+ "logps/rejected": -220.5706329345703,
79
+ "loss": 0.6843,
80
+ "rewards/accuracies": 0.793749988079071,
81
+ "rewards/chosen": 0.010404979810118675,
82
+ "rewards/margins": 0.019400382414460182,
83
+ "rewards/rejected": -0.008995403535664082,
84
  "step": 40
85
  },
86
  {
87
+ "epoch": 0.1278772378516624,
88
+ "grad_norm": 42.12222343545266,
89
+ "learning_rate": 4.989992961303737e-08,
90
+ "logits/chosen": -4.523015022277832,
91
+ "logits/rejected": -4.722769737243652,
92
+ "logps/chosen": -269.7854919433594,
93
+ "logps/rejected": -228.0283660888672,
94
+ "loss": 0.6709,
95
+ "rewards/accuracies": 0.84375,
96
+ "rewards/chosen": 0.021838409826159477,
97
+ "rewards/margins": 0.04340698570013046,
98
+ "rewards/rejected": -0.021568577736616135,
99
  "step": 50
100
  },
101
  {
102
+ "epoch": 0.1534526854219949,
103
+ "grad_norm": 41.2743182340779,
104
+ "learning_rate": 4.960051957873725e-08,
105
+ "logits/chosen": -4.632551670074463,
106
+ "logits/rejected": -4.7638654708862305,
107
+ "logps/chosen": -237.92984008789062,
108
+ "logps/rejected": -220.9913787841797,
109
+ "loss": 0.647,
110
+ "rewards/accuracies": 0.862500011920929,
111
+ "rewards/chosen": 0.0319262370467186,
112
+ "rewards/margins": 0.09162561595439911,
113
+ "rewards/rejected": -0.05969937518239021,
114
  "step": 60
115
  },
116
  {
117
+ "epoch": 0.17902813299232737,
118
+ "grad_norm": 39.1856071564541,
119
+ "learning_rate": 4.910416686333906e-08,
120
+ "logits/chosen": -4.551320552825928,
121
+ "logits/rejected": -4.791552543640137,
122
+ "logps/chosen": -249.8735809326172,
123
+ "logps/rejected": -230.2266082763672,
124
+ "loss": 0.6219,
125
+ "rewards/accuracies": 0.862500011920929,
126
+ "rewards/chosen": 0.032683782279491425,
127
+ "rewards/margins": 0.1573253571987152,
128
+ "rewards/rejected": -0.12464158236980438,
129
  "step": 70
130
  },
131
  {
132
+ "epoch": 0.20460358056265984,
133
+ "grad_norm": 40.933579332132446,
134
+ "learning_rate": 4.841484508350678e-08,
135
+ "logits/chosen": -4.6027655601501465,
136
+ "logits/rejected": -4.830323696136475,
137
+ "logps/chosen": -258.7153625488281,
138
+ "logps/rejected": -251.62368774414062,
139
+ "loss": 0.5731,
140
+ "rewards/accuracies": 0.8500000238418579,
141
+ "rewards/chosen": 0.025046274065971375,
142
+ "rewards/margins": 0.2380959540605545,
143
+ "rewards/rejected": -0.21304969489574432,
144
  "step": 80
145
  },
146
  {
147
+ "epoch": 0.23017902813299232,
148
+ "grad_norm": 40.245773651041624,
149
+ "learning_rate": 4.7538072695020406e-08,
150
+ "logits/chosen": -4.726979732513428,
151
+ "logits/rejected": -4.926435470581055,
152
+ "logps/chosen": -248.79373168945312,
153
+ "logps/rejected": -241.45919799804688,
154
+ "loss": 0.5258,
155
+ "rewards/accuracies": 0.8500000238418579,
156
+ "rewards/chosen": 0.024190178140997887,
157
+ "rewards/margins": 0.41048234701156616,
158
+ "rewards/rejected": -0.38629215955734253,
159
  "step": 90
160
  },
161
  {
162
+ "epoch": 0.2557544757033248,
163
+ "grad_norm": 36.66878437086382,
164
+ "learning_rate": 4.6480868814055416e-08,
165
+ "logits/chosen": -4.612161636352539,
166
+ "logits/rejected": -4.8301801681518555,
167
+ "logps/chosen": -252.9623260498047,
168
+ "logps/rejected": -275.35980224609375,
169
+ "loss": 0.5044,
170
+ "rewards/accuracies": 0.84375,
171
+ "rewards/chosen": 0.03755917400121689,
172
+ "rewards/margins": 0.5200667977333069,
173
+ "rewards/rejected": -0.4825075566768646,
174
  "step": 100
175
  },
176
  {
177
+ "epoch": 0.2557544757033248,
178
+ "eval_logits/chosen": -4.655425071716309,
179
+ "eval_logits/rejected": -4.8306565284729,
180
+ "eval_logps/chosen": -401.76690673828125,
181
+ "eval_logps/rejected": -527.1570434570312,
182
+ "eval_loss": 0.7104954123497009,
183
+ "eval_rewards/accuracies": 0.47265625,
184
+ "eval_rewards/chosen": -0.11292455345392227,
185
+ "eval_rewards/margins": -0.02523117884993553,
186
+ "eval_rewards/rejected": -0.08769337832927704,
187
+ "eval_runtime": 98.6531,
188
+ "eval_samples_per_second": 20.273,
189
+ "eval_steps_per_second": 0.324,
190
  "step": 100
191
  },
192
  {
193
+ "epoch": 0.2813299232736573,
194
+ "grad_norm": 33.07145934836895,
195
+ "learning_rate": 4.525169702472916e-08,
196
+ "logits/chosen": -4.64766788482666,
197
+ "logits/rejected": -4.833477020263672,
198
+ "logps/chosen": -244.29296875,
199
+ "logps/rejected": -270.46356201171875,
200
+ "loss": 0.4684,
201
+ "rewards/accuracies": 0.856249988079071,
202
+ "rewards/chosen": 0.04892749339342117,
203
+ "rewards/margins": 0.586704432964325,
204
+ "rewards/rejected": -0.5377769470214844,
205
  "step": 110
206
  },
207
  {
208
+ "epoch": 0.3069053708439898,
209
+ "grad_norm": 33.5848438881061,
210
+ "learning_rate": 4.386039762276975e-08,
211
+ "logits/chosen": -4.567343235015869,
212
+ "logits/rejected": -4.763147830963135,
213
+ "logps/chosen": -260.3200988769531,
214
+ "logps/rejected": -289.777587890625,
215
+ "loss": 0.4413,
216
+ "rewards/accuracies": 0.887499988079071,
217
+ "rewards/chosen": 0.15669476985931396,
218
+ "rewards/margins": 0.7683843374252319,
219
+ "rewards/rejected": -0.611689567565918,
220
  "step": 120
221
  },
222
  {
223
+ "epoch": 0.33248081841432225,
224
+ "grad_norm": 33.705176699923506,
225
+ "learning_rate": 4.231810883773999e-08,
226
+ "logits/chosen": -4.627255916595459,
227
+ "logits/rejected": -4.849207878112793,
228
+ "logps/chosen": -243.1725616455078,
229
+ "logps/rejected": -288.8663635253906,
230
+ "loss": 0.4074,
231
+ "rewards/accuracies": 0.893750011920929,
232
+ "rewards/chosen": 0.15271326899528503,
233
+ "rewards/margins": 0.8880899548530579,
234
+ "rewards/rejected": -0.7353767156600952,
235
  "step": 130
236
  },
237
  {
238
+ "epoch": 0.35805626598465473,
239
+ "grad_norm": 28.886999424269227,
240
+ "learning_rate": 4.063717766448194e-08,
241
+ "logits/chosen": -4.684683322906494,
242
+ "logits/rejected": -4.887022972106934,
243
+ "logps/chosen": -271.5019226074219,
244
+ "logps/rejected": -315.28717041015625,
245
+ "loss": 0.3858,
246
+ "rewards/accuracies": 0.8500000238418579,
247
+ "rewards/chosen": 0.1269315481185913,
248
+ "rewards/margins": 0.9580682516098022,
249
+ "rewards/rejected": -0.8311365842819214,
250
  "step": 140
251
  },
252
  {
253
+ "epoch": 0.3836317135549872,
254
+ "grad_norm": 34.24616985646881,
255
+ "learning_rate": 3.8831061017632845e-08,
256
+ "logits/chosen": -4.733763694763184,
257
+ "logits/rejected": -4.927274703979492,
258
+ "logps/chosen": -237.88955688476562,
259
+ "logps/rejected": -312.8567199707031,
260
+ "loss": 0.3767,
261
+ "rewards/accuracies": 0.893750011920929,
262
+ "rewards/chosen": 0.14528730511665344,
263
+ "rewards/margins": 1.0826618671417236,
264
+ "rewards/rejected": -0.937374472618103,
265
  "step": 150
266
  },
267
  {
268
+ "epoch": 0.4092071611253197,
269
+ "grad_norm": 32.75808894341811,
270
+ "learning_rate": 3.691421800053269e-08,
271
+ "logits/chosen": -4.803037166595459,
272
+ "logits/rejected": -4.954171657562256,
273
+ "logps/chosen": -236.3014373779297,
274
+ "logps/rejected": -314.69329833984375,
275
+ "loss": 0.3428,
276
+ "rewards/accuracies": 0.8812500238418579,
277
+ "rewards/chosen": 0.1515505015850067,
278
+ "rewards/margins": 1.1423927545547485,
279
+ "rewards/rejected": -0.9908422231674194,
280
  "step": 160
281
  },
282
  {
283
+ "epoch": 0.43478260869565216,
284
+ "grad_norm": 27.946215188997492,
285
+ "learning_rate": 3.490199415097892e-08,
286
+ "logits/chosen": -4.722456932067871,
287
+ "logits/rejected": -4.946799278259277,
288
+ "logps/chosen": -246.8963623046875,
289
+ "logps/rejected": -339.89215087890625,
290
+ "loss": 0.3253,
291
+ "rewards/accuracies": 0.90625,
292
+ "rewards/chosen": 0.14177896082401276,
293
+ "rewards/margins": 1.3426704406738281,
294
+ "rewards/rejected": -1.2008916139602661,
295
  "step": 170
296
  },
297
  {
298
+ "epoch": 0.46035805626598464,
299
+ "grad_norm": 30.919214925357334,
300
+ "learning_rate": 3.2810498590513937e-08,
301
+ "logits/chosen": -4.84631872177124,
302
+ "logits/rejected": -5.061424255371094,
303
+ "logps/chosen": -226.3748016357422,
304
+ "logps/rejected": -312.2584533691406,
305
+ "loss": 0.3453,
306
+ "rewards/accuracies": 0.856249988079071,
307
+ "rewards/chosen": 0.10488839447498322,
308
+ "rewards/margins": 1.2672706842422485,
309
+ "rewards/rejected": -1.1623823642730713,
310
  "step": 180
311
  },
312
  {
313
+ "epoch": 0.4859335038363171,
314
+ "grad_norm": 35.218898553113945,
315
+ "learning_rate": 3.065647506074306e-08,
316
+ "logits/chosen": -4.800053596496582,
317
+ "logits/rejected": -4.945647716522217,
318
+ "logps/chosen": -250.15170288085938,
319
+ "logps/rejected": -353.6054382324219,
320
+ "loss": 0.3307,
321
+ "rewards/accuracies": 0.8687499761581421,
322
+ "rewards/chosen": 0.10741142183542252,
323
+ "rewards/margins": 1.2918050289154053,
324
+ "rewards/rejected": -1.1843936443328857,
325
  "step": 190
326
  },
327
  {
328
+ "epoch": 0.5115089514066496,
329
+ "grad_norm": 34.27582139337608,
330
+ "learning_rate": 2.8457167879118325e-08,
331
+ "logits/chosen": -4.8446269035339355,
332
+ "logits/rejected": -5.029626846313477,
333
+ "logps/chosen": -245.35513305664062,
334
+ "logps/rejected": -340.13250732421875,
335
+ "loss": 0.3343,
336
+ "rewards/accuracies": 0.824999988079071,
337
+ "rewards/chosen": 0.06784123182296753,
338
+ "rewards/margins": 1.3157011270523071,
339
+ "rewards/rejected": -1.2478597164154053,
340
  "step": 200
341
  },
342
  {
343
+ "epoch": 0.5115089514066496,
344
+ "eval_logits/chosen": -4.865734100341797,
345
+ "eval_logits/rejected": -5.110110282897949,
346
+ "eval_logps/chosen": -442.470703125,
347
+ "eval_logps/rejected": -579.5608520507812,
348
+ "eval_loss": 0.6981855630874634,
349
+ "eval_rewards/accuracies": 0.55859375,
350
+ "eval_rewards/chosen": -0.51996248960495,
351
+ "eval_rewards/margins": 0.09176936745643616,
352
+ "eval_rewards/rejected": -0.6117318868637085,
353
+ "eval_runtime": 98.533,
354
+ "eval_samples_per_second": 20.298,
355
+ "eval_steps_per_second": 0.325,
356
  "step": 200
357
  },
358
  {
359
+ "epoch": 0.5370843989769821,
360
+ "grad_norm": 27.49657850496402,
361
+ "learning_rate": 2.6230183887296952e-08,
362
+ "logits/chosen": -4.9495344161987305,
363
+ "logits/rejected": -5.170707702636719,
364
+ "logps/chosen": -253.7008514404297,
365
+ "logps/rejected": -388.9451904296875,
366
+ "loss": 0.3045,
367
+ "rewards/accuracies": 0.9312499761581421,
368
+ "rewards/chosen": 0.12184244394302368,
369
+ "rewards/margins": 1.7248340845108032,
370
+ "rewards/rejected": -1.6029917001724243,
371
  "step": 210
372
  },
373
  {
374
+ "epoch": 0.5626598465473146,
375
+ "grad_norm": 31.60560013082535,
376
+ "learning_rate": 2.3993351497264626e-08,
377
+ "logits/chosen": -4.796377658843994,
378
+ "logits/rejected": -5.133594989776611,
379
+ "logps/chosen": -251.61669921875,
380
+ "logps/rejected": -385.34405517578125,
381
+ "loss": 0.3118,
382
+ "rewards/accuracies": 0.9125000238418579,
383
+ "rewards/chosen": 0.11194615066051483,
384
+ "rewards/margins": 1.718654990196228,
385
+ "rewards/rejected": -1.6067088842391968,
386
  "step": 220
387
  },
388
  {
389
+ "epoch": 0.5882352941176471,
390
+ "grad_norm": 31.96350965289743,
391
+ "learning_rate": 2.1764577963648613e-08,
392
+ "logits/chosen": -4.8748345375061035,
393
+ "logits/rejected": -5.184638977050781,
394
+ "logps/chosen": -258.0860900878906,
395
+ "logps/rejected": -386.3064270019531,
396
+ "loss": 0.3152,
397
+ "rewards/accuracies": 0.90625,
398
+ "rewards/chosen": 0.028830066323280334,
399
+ "rewards/margins": 1.6547828912734985,
400
+ "rewards/rejected": -1.6259527206420898,
401
  "step": 230
402
  },
403
  {
404
+ "epoch": 0.6138107416879796,
405
+ "grad_norm": 34.58110281626622,
406
+ "learning_rate": 1.9561706024845818e-08,
407
+ "logits/chosen": -4.866055488586426,
408
+ "logits/rejected": -5.124758243560791,
409
+ "logps/chosen": -271.0610046386719,
410
+ "logps/rejected": -398.76727294921875,
411
+ "loss": 0.3015,
412
+ "rewards/accuracies": 0.9125000238418579,
413
+ "rewards/chosen": 0.017621681094169617,
414
+ "rewards/margins": 1.7959582805633545,
415
+ "rewards/rejected": -1.778336763381958,
416
  "step": 240
417
  },
418
  {
419
+ "epoch": 0.639386189258312,
420
+ "grad_norm": 34.427123510289576,
421
+ "learning_rate": 1.740237106064383e-08,
422
+ "logits/chosen": -5.0298542976379395,
423
+ "logits/rejected": -5.25708532333374,
424
+ "logps/chosen": -248.6929168701172,
425
+ "logps/rejected": -353.6767272949219,
426
+ "loss": 0.2999,
427
+ "rewards/accuracies": 0.856249988079071,
428
+ "rewards/chosen": 8.549987978767604e-05,
429
+ "rewards/margins": 1.4607607126235962,
430
+ "rewards/rejected": -1.4606752395629883,
431
  "step": 250
432
  },
433
  {
434
+ "epoch": 0.6649616368286445,
435
+ "grad_norm": 36.61434250379136,
436
+ "learning_rate": 1.530385990987863e-08,
437
+ "logits/chosen": -4.881124019622803,
438
+ "logits/rejected": -5.179129600524902,
439
+ "logps/chosen": -254.1592254638672,
440
+ "logps/rejected": -414.9335021972656,
441
+ "loss": 0.2863,
442
+ "rewards/accuracies": 0.925000011920929,
443
+ "rewards/chosen": -0.001106788171455264,
444
+ "rewards/margins": 1.950743317604065,
445
+ "rewards/rejected": -1.9518499374389648,
446
  "step": 260
447
  },
448
  {
449
+ "epoch": 0.690537084398977,
450
+ "grad_norm": 30.68695587447776,
451
+ "learning_rate": 1.3282972478382409e-08,
452
+ "logits/chosen": -5.040741920471191,
453
+ "logits/rejected": -5.266045570373535,
454
+ "logps/chosen": -260.43017578125,
455
+ "logps/rejected": -401.3609924316406,
456
+ "loss": 0.2997,
457
+ "rewards/accuracies": 0.918749988079071,
458
+ "rewards/chosen": 0.006030815653502941,
459
+ "rewards/margins": 1.8693902492523193,
460
+ "rewards/rejected": -1.8633596897125244,
461
  "step": 270
462
  },
463
  {
464
+ "epoch": 0.7161125319693095,
465
+ "grad_norm": 31.65072997235767,
466
+ "learning_rate": 1.1355887245137383e-08,
467
+ "logits/chosen": -4.975480556488037,
468
+ "logits/rejected": -5.191180229187012,
469
+ "logps/chosen": -262.6047668457031,
470
+ "logps/rejected": -436.56243896484375,
471
+ "loss": 0.2763,
472
+ "rewards/accuracies": 0.9125000238418579,
473
+ "rewards/chosen": 0.019021058455109596,
474
+ "rewards/margins": 1.9552510976791382,
475
+ "rewards/rejected": -1.9362300634384155,
476
  "step": 280
477
  },
478
  {
479
+ "epoch": 0.7416879795396419,
480
+ "grad_norm": 44.14600223942157,
481
+ "learning_rate": 9.538031743343628e-09,
482
+ "logits/chosen": -4.857875823974609,
483
+ "logits/rejected": -5.190948009490967,
484
+ "logps/chosen": -269.9219055175781,
485
+ "logps/rejected": -394.4927978515625,
486
+ "loss": 0.281,
487
+ "rewards/accuracies": 0.90625,
488
+ "rewards/chosen": 0.02686493471264839,
489
+ "rewards/margins": 1.815498948097229,
490
+ "rewards/rejected": -1.7886340618133545,
491
  "step": 290
492
  },
493
  {
494
+ "epoch": 0.7672634271099744,
495
+ "grad_norm": 39.91454735081478,
496
+ "learning_rate": 7.843959053281662e-09,
497
+ "logits/chosen": -4.9824910163879395,
498
+ "logits/rejected": -5.185948371887207,
499
+ "logps/chosen": -254.4317626953125,
500
+ "logps/rejected": -409.8275451660156,
501
+ "loss": 0.2972,
502
+ "rewards/accuracies": 0.862500011920929,
503
+ "rewards/chosen": -0.01023593544960022,
504
+ "rewards/margins": 1.928384780883789,
505
+ "rewards/rejected": -1.9386205673217773,
506
  "step": 300
507
  },
508
  {
509
+ "epoch": 0.7672634271099744,
510
+ "eval_logits/chosen": -5.028459072113037,
511
+ "eval_logits/rejected": -5.28243350982666,
512
+ "eval_logps/chosen": -473.4393005371094,
513
+ "eval_logps/rejected": -613.8079833984375,
514
+ "eval_loss": 0.7111232280731201,
515
+ "eval_rewards/accuracies": 0.5625,
516
+ "eval_rewards/chosen": -0.8296481966972351,
517
+ "eval_rewards/margins": 0.12455525994300842,
518
+ "eval_rewards/rejected": -0.9542034864425659,
519
+ "eval_runtime": 98.5847,
520
+ "eval_samples_per_second": 20.287,
521
+ "eval_steps_per_second": 0.325,
522
  "step": 300
523
  },
524
  {
525
+ "epoch": 0.7928388746803069,
526
+ "grad_norm": 28.907759477900694,
527
+ "learning_rate": 6.28723129572247e-09,
528
+ "logits/chosen": -4.889031887054443,
529
+ "logits/rejected": -5.217998504638672,
530
+ "logps/chosen": -259.29278564453125,
531
+ "logps/rejected": -429.45001220703125,
532
+ "loss": 0.2863,
533
+ "rewards/accuracies": 0.925000011920929,
534
+ "rewards/chosen": 0.01846831850707531,
535
+ "rewards/margins": 2.0909886360168457,
536
+ "rewards/rejected": -2.0725202560424805,
537
  "step": 310
538
  },
539
  {
540
+ "epoch": 0.8184143222506394,
541
+ "grad_norm": 35.34285789208096,
542
+ "learning_rate": 4.880311058593617e-09,
543
+ "logits/chosen": -5.006392002105713,
544
+ "logits/rejected": -5.3091888427734375,
545
+ "logps/chosen": -254.5725555419922,
546
+ "logps/rejected": -404.92706298828125,
547
+ "loss": 0.2987,
548
+ "rewards/accuracies": 0.875,
549
+ "rewards/chosen": -0.094082310795784,
550
+ "rewards/margins": 1.8906733989715576,
551
+ "rewards/rejected": -1.9847558736801147,
552
  "step": 320
553
  },
554
  {
555
+ "epoch": 0.8439897698209718,
556
+ "grad_norm": 31.108205637173466,
557
+ "learning_rate": 3.6344616260994942e-09,
558
+ "logits/chosen": -4.935029983520508,
559
+ "logits/rejected": -5.206645965576172,
560
+ "logps/chosen": -279.4919128417969,
561
+ "logps/rejected": -436.8589782714844,
562
+ "loss": 0.2806,
563
+ "rewards/accuracies": 0.8500000238418579,
564
+ "rewards/chosen": -0.11343076080083847,
565
+ "rewards/margins": 1.9281622171401978,
566
+ "rewards/rejected": -2.041593074798584,
567
  "step": 330
568
  },
569
  {
570
+ "epoch": 0.8695652173913043,
571
+ "grad_norm": 37.630889083981245,
572
+ "learning_rate": 2.5596568090246547e-09,
573
+ "logits/chosen": -4.956147193908691,
574
+ "logits/rejected": -5.322269439697266,
575
+ "logps/chosen": -274.47479248046875,
576
+ "logps/rejected": -390.50115966796875,
577
+ "loss": 0.2837,
578
+ "rewards/accuracies": 0.8812500238418579,
579
+ "rewards/chosen": -0.0032872497104108334,
580
+ "rewards/margins": 1.8339039087295532,
581
+ "rewards/rejected": -1.8371912240982056,
582
  "step": 340
583
  },
584
  {
585
+ "epoch": 0.8951406649616368,
586
+ "grad_norm": 33.50567544254766,
587
+ "learning_rate": 1.6645010980854079e-09,
588
+ "logits/chosen": -4.992954254150391,
589
+ "logits/rejected": -5.131080150604248,
590
+ "logps/chosen": -267.6521911621094,
591
+ "logps/rejected": -406.9331359863281,
592
+ "loss": 0.3087,
593
+ "rewards/accuracies": 0.90625,
594
+ "rewards/chosen": -0.12181039899587631,
595
+ "rewards/margins": 1.66982102394104,
596
+ "rewards/rejected": -1.7916314601898193,
597
  "step": 350
598
  },
599
  {
600
+ "epoch": 0.9207161125319693,
601
+ "grad_norm": 36.41873473091959,
602
+ "learning_rate": 9.561607795526006e-10,
603
+ "logits/chosen": -4.958062171936035,
604
+ "logits/rejected": -5.163455963134766,
605
+ "logps/chosen": -269.3998107910156,
606
+ "logps/rejected": -414.72039794921875,
607
+ "loss": 0.2917,
608
+ "rewards/accuracies": 0.8812500238418579,
609
+ "rewards/chosen": -0.07764892280101776,
610
+ "rewards/margins": 1.7949295043945312,
611
+ "rewards/rejected": -1.8725783824920654,
612
  "step": 360
613
  },
614
  {
615
+ "epoch": 0.9462915601023018,
616
+ "grad_norm": 29.734783261980905,
617
+ "learning_rate": 4.403065646083809e-10,
618
+ "logits/chosen": -4.9909467697143555,
619
+ "logits/rejected": -5.146109104156494,
620
+ "logps/chosen": -261.77947998046875,
621
+ "logps/rejected": -422.8021545410156,
622
+ "loss": 0.292,
623
+ "rewards/accuracies": 0.8500000238418579,
624
+ "rewards/chosen": -0.04958271235227585,
625
+ "rewards/margins": 1.8099679946899414,
626
+ "rewards/rejected": -1.8595508337020874,
627
  "step": 370
628
  },
629
  {
630
+ "epoch": 0.9718670076726342,
631
+ "grad_norm": 37.248731518669295,
632
+ "learning_rate": 1.2106819172520434e-10,
633
+ "logits/chosen": -5.114525318145752,
634
+ "logits/rejected": -5.3700361251831055,
635
+ "logps/chosen": -265.19122314453125,
636
+ "logps/rejected": -416.8501892089844,
637
+ "loss": 0.2878,
638
+ "rewards/accuracies": 0.887499988079071,
639
+ "rewards/chosen": -0.0310853011906147,
640
+ "rewards/margins": 1.9321701526641846,
641
+ "rewards/rejected": -1.9632552862167358,
642
  "step": 380
643
  },
644
  {
645
+ "epoch": 0.9974424552429667,
646
+ "grad_norm": 52.48837519556826,
647
+ "learning_rate": 1.0013655036916758e-12,
648
+ "logits/chosen": -5.090936660766602,
649
+ "logits/rejected": -5.337624549865723,
650
+ "logps/chosen": -265.92279052734375,
651
+ "logps/rejected": -421.610595703125,
652
+ "loss": 0.2755,
653
+ "rewards/accuracies": 0.8687499761581421,
654
+ "rewards/chosen": -0.06250263750553131,
655
+ "rewards/margins": 1.948452353477478,
656
+ "rewards/rejected": -2.0109550952911377,
657
  "step": 390
658
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
659
  {
660
  "epoch": 1.0,
661
+ "step": 391,
662
  "total_flos": 0.0,
663
+ "train_loss": 0.4007269041922391,
664
+ "train_runtime": 6210.4356,
665
+ "train_samples_per_second": 8.051,
666
  "train_steps_per_second": 0.063
667
  }
668
  ],
669
  "logging_steps": 10,
670
+ "max_steps": 391,
671
  "num_input_tokens_seen": 0,
672
  "num_train_epochs": 1,
673
  "save_steps": 100,
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:cf525e556a4e72ad76dc3263558be495a00b73c02de0b6ea713d4bfeb6a07eb0
3
- size 6456
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fa7e3e53a21d58800798c272a65a4f1f4bdea1a718fe29f85dc9ce2a41691db
3
+ size 6328