File size: 103,610 Bytes
7baae4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
2024-03-09 17:05:59,774 INFO [train.py:1065] (3/4) Training started
2024-03-09 17:05:59,774 INFO [train.py:1075] (3/4) Device: cuda:3
2024-03-09 17:05:59,855 INFO [lexicon.py:168] (3/4) Loading pre-compiled data/lang_char/Linv.pt
2024-03-09 17:05:59,869 INFO [train.py:1086] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2989b0b1186fa6022932804f5b39fbb2781ebf42', 'k2-git-date': 'Fri Nov 24 11:34:10 2023', 'lhotse-version': '1.22.0.dev+git.d8ed1bbb.dirty', 'torch-version': '1.11.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'dev/mdcc', 'icefall-git-sha1': '8b7ca604-clean', 'icefall-git-date': 'Sat Mar 9 14:09:58 2024', 'icefall-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/icefall-1.0-py3.9.egg', 'k2-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/k2-1.24.4.dev20231207+cuda10.2.torch1.11.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/lhotse-1.22.0.dev0+git.d8ed1bbb.dirty-py3.9.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-1207150844-f49d8c4f4-c49d5', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 31, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 1, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 4852}
2024-03-09 17:05:59,869 INFO [train.py:1088] (3/4) About to create model
2024-03-09 17:06:00,577 INFO [train.py:1092] (3/4) Number of model parameters: 74470867
2024-03-09 17:06:00,578 INFO [checkpoint.py:112] (3/4) Loading checkpoint from zipformer/exp/epoch-30.pt
2024-03-09 17:06:07,828 INFO [train.py:1107] (3/4) Using DDP
2024-03-09 17:06:08,429 INFO [train.py:1119] (3/4) Loading optimizer state dict
2024-03-09 17:06:09,483 INFO [train.py:1127] (3/4) Loading scheduler state dict
2024-03-09 17:06:09,484 INFO [asr_datamodule.py:368] (3/4) About to get train cuts
2024-03-09 17:06:09,530 INFO [asr_datamodule.py:376] (3/4) About to get valid cuts
2024-03-09 17:06:09,532 INFO [asr_datamodule.py:195] (3/4) About to get Musan cuts
2024-03-09 17:06:11,951 INFO [asr_datamodule.py:200] (3/4) Enable MUSAN
2024-03-09 17:06:11,951 INFO [asr_datamodule.py:223] (3/4) Enable SpecAugment
2024-03-09 17:06:11,951 INFO [asr_datamodule.py:224] (3/4) Time warp factor: 80
2024-03-09 17:06:11,952 INFO [asr_datamodule.py:234] (3/4) Num frame mask: 10
2024-03-09 17:06:11,952 INFO [asr_datamodule.py:247] (3/4) About to create train dataset
2024-03-09 17:06:11,952 INFO [asr_datamodule.py:273] (3/4) Using DynamicBucketingSampler.
2024-03-09 17:06:12,773 INFO [asr_datamodule.py:290] (3/4) About to create train dataloader
2024-03-09 17:06:12,773 INFO [asr_datamodule.py:315] (3/4) About to create dev dataset
2024-03-09 17:06:13,100 INFO [asr_datamodule.py:332] (3/4) About to create dev dataloader
2024-03-09 17:06:13,100 INFO [train.py:1205] (3/4) Loading grad scaler state dict
2024-03-09 17:06:53,813 INFO [train.py:997] (3/4) Epoch 31, batch 0, loss[loss=0.1304, simple_loss=0.2259, pruned_loss=0.0175, over 22859.00 frames. ], tot_loss[loss=0.1304, simple_loss=0.2259, pruned_loss=0.0175, over 22859.00 frames. ], batch size: 608, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:06:53,813 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:07:03,243 INFO [train.py:1029] (3/4) Epoch 31, validation: loss=0.2089, simple_loss=0.3019, pruned_loss=0.05794, over 452978.00 frames. 
2024-03-09 17:07:03,244 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 24707MB
2024-03-09 17:07:04,460 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0
2024-03-09 17:07:07,758 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.33 vs. limit=15.0
2024-03-09 17:07:17,162 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0
2024-03-09 17:07:52,777 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=31800.0, ans=0.125
2024-03-09 17:07:54,416 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=31800.0, ans=0.05
2024-03-09 17:07:58,931 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=31800.0, ans=0.125
2024-03-09 17:08:06,679 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=31866.666666666668, ans=0.125
2024-03-09 17:08:12,843 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=31866.666666666668, ans=0.125
2024-03-09 17:08:17,439 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=31866.666666666668, ans=0.125
2024-03-09 17:08:21,814 INFO [train.py:997] (3/4) Epoch 31, batch 50, loss[loss=0.1526, simple_loss=0.2376, pruned_loss=0.03384, over 23894.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2352, pruned_loss=0.02855, over 1067497.66 frames. ], batch size: 153, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:08:22,149 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31933.333333333332, ans=0.1
2024-03-09 17:08:22,296 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31933.333333333332, ans=0.125
2024-03-09 17:08:25,128 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=31933.333333333332, ans=0.035
2024-03-09 17:08:49,153 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=12.0
2024-03-09 17:08:54,736 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.909e+01 7.298e+01 7.941e+01 8.893e+01 1.039e+02, threshold=1.588e+02, percent-clipped=0.0
2024-03-09 17:09:14,617 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32066.666666666668, ans=0.125
2024-03-09 17:09:20,494 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=32133.333333333332, ans=0.125
2024-03-09 17:09:22,097 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=32133.333333333332, ans=0.2
2024-03-09 17:09:35,993 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=32200.0, ans=0.0038695652173913048
2024-03-09 17:09:48,298 INFO [train.py:997] (3/4) Epoch 31, batch 100, loss[loss=0.1465, simple_loss=0.2404, pruned_loss=0.02631, over 24124.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2362, pruned_loss=0.02852, over 1889551.81 frames. ], batch size: 326, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:10:07,962 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.69 vs. limit=15.0
2024-03-09 17:10:57,040 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=32533.333333333332, ans=0.003797101449275363
2024-03-09 17:11:07,523 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=32600.0, ans=0.95
2024-03-09 17:11:08,727 INFO [train.py:997] (3/4) Epoch 31, batch 150, loss[loss=0.144, simple_loss=0.2385, pruned_loss=0.02476, over 24218.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.236, pruned_loss=0.0283, over 2512128.61 frames. ], batch size: 295, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:12:00,725 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:06,856 INFO [train.py:997] (3/4) Epoch 32, batch 0, loss[loss=0.1392, simple_loss=0.2259, pruned_loss=0.02618, over 24189.00 frames. ], tot_loss[loss=0.1392, simple_loss=0.2259, pruned_loss=0.02618, over 24189.00 frames. ], batch size: 188, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:12:06,857 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:12:16,508 INFO [train.py:1029] (3/4) Epoch 32, validation: loss=0.2101, simple_loss=0.3027, pruned_loss=0.0588, over 452978.00 frames. 
2024-03-09 17:12:16,509 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:12:18,468 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:20,032 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:32,055 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.203e+01 7.071e+01 7.685e+01 8.593e+01 1.169e+02, threshold=1.537e+02, percent-clipped=0.0
2024-03-09 17:12:57,882 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=32786.666666666664, ans=0.0
2024-03-09 17:13:04,085 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=32853.333333333336, ans=0.125
2024-03-09 17:13:04,120 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=32853.333333333336, ans=0.125
2024-03-09 17:13:27,364 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32920.0, ans=0.1
2024-03-09 17:13:30,334 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=32920.0, ans=0.2
2024-03-09 17:13:34,611 INFO [train.py:997] (3/4) Epoch 32, batch 50, loss[loss=0.1395, simple_loss=0.2327, pruned_loss=0.02316, over 24223.00 frames. ], tot_loss[loss=0.1444, simple_loss=0.2332, pruned_loss=0.02779, over 1062126.86 frames. ], batch size: 295, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:14:03,108 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=33053.333333333336, ans=0.02
2024-03-09 17:14:19,668 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=33120.0, ans=0.003669565217391305
2024-03-09 17:14:28,865 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=33186.666666666664, ans=0.125
2024-03-09 17:14:59,393 INFO [train.py:997] (3/4) Epoch 32, batch 100, loss[loss=0.1445, simple_loss=0.228, pruned_loss=0.03048, over 23594.00 frames. ], tot_loss[loss=0.1437, simple_loss=0.2328, pruned_loss=0.02729, over 1881462.97 frames. ], batch size: 128, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:15:15,496 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.885e+01 7.174e+01 7.568e+01 8.159e+01 1.038e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:15:16,785 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0
2024-03-09 17:15:24,306 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.99 vs. limit=10.0
2024-03-09 17:15:29,949 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0
2024-03-09 17:15:32,381 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=33453.333333333336, ans=0.125
2024-03-09 17:15:38,638 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=33453.333333333336, ans=0.125
2024-03-09 17:15:44,194 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=33453.333333333336, ans=0.125
2024-03-09 17:15:55,127 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=33520.0, ans=0.125
2024-03-09 17:16:19,711 INFO [train.py:997] (3/4) Epoch 32, batch 150, loss[loss=0.1458, simple_loss=0.236, pruned_loss=0.02785, over 24173.00 frames. ], tot_loss[loss=0.1437, simple_loss=0.2326, pruned_loss=0.02735, over 2518627.96 frames. ], batch size: 217, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:17:08,589 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33706.666666666664, ans=0.1
2024-03-09 17:17:14,939 INFO [train.py:997] (3/4) Epoch 33, batch 0, loss[loss=0.1338, simple_loss=0.2188, pruned_loss=0.02447, over 24247.00 frames. ], tot_loss[loss=0.1338, simple_loss=0.2188, pruned_loss=0.02447, over 24247.00 frames. ], batch size: 217, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:17:14,940 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:17:24,826 INFO [train.py:1029] (3/4) Epoch 33, validation: loss=0.2104, simple_loss=0.3043, pruned_loss=0.05821, over 452978.00 frames. 
2024-03-09 17:17:24,826 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:17:30,083 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=33706.666666666664, ans=0.2
2024-03-09 17:17:51,680 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=33773.333333333336, ans=0.125
2024-03-09 17:17:53,406 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=33773.333333333336, ans=0.125
2024-03-09 17:18:02,914 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=33840.0, ans=0.125
2024-03-09 17:18:15,959 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.76 vs. limit=15.0
2024-03-09 17:18:27,894 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=33973.333333333336, ans=10.0
2024-03-09 17:18:29,370 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=33973.333333333336, ans=0.125
2024-03-09 17:18:40,357 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33973.333333333336, ans=0.1
2024-03-09 17:18:42,009 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=34040.0, ans=0.0
2024-03-09 17:18:43,168 INFO [train.py:997] (3/4) Epoch 33, batch 50, loss[loss=0.1443, simple_loss=0.2367, pruned_loss=0.02593, over 24153.00 frames. ], tot_loss[loss=0.1413, simple_loss=0.2307, pruned_loss=0.02598, over 1070736.99 frames. ], batch size: 345, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:18:45,940 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.81 vs. limit=6.0
2024-03-09 17:18:46,184 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+01 7.058e+01 7.697e+01 8.414e+01 1.529e+02, threshold=1.539e+02, percent-clipped=1.0
2024-03-09 17:18:49,642 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=34040.0, ans=0.125
2024-03-09 17:19:23,995 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34173.333333333336, ans=0.125
2024-03-09 17:19:28,716 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34173.333333333336, ans=0.125
2024-03-09 17:19:53,913 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0
2024-03-09 17:20:08,479 INFO [train.py:997] (3/4) Epoch 33, batch 100, loss[loss=0.14, simple_loss=0.2332, pruned_loss=0.0234, over 24264.00 frames. ], tot_loss[loss=0.1423, simple_loss=0.2313, pruned_loss=0.02668, over 1886315.86 frames. ], batch size: 267, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:20:19,576 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34373.333333333336, ans=0.125
2024-03-09 17:20:21,030 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=34373.333333333336, ans=0.125
2024-03-09 17:20:22,653 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34440.0, ans=0.1
2024-03-09 17:20:27,278 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=34440.0, ans=0.0033826086956521735
2024-03-09 17:20:59,234 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=34573.333333333336, ans=0.003353623188405797
2024-03-09 17:21:28,186 INFO [train.py:997] (3/4) Epoch 33, batch 150, loss[loss=0.1482, simple_loss=0.2403, pruned_loss=0.02803, over 24108.00 frames. ], tot_loss[loss=0.1436, simple_loss=0.2334, pruned_loss=0.02687, over 2524659.63 frames. ], batch size: 366, lr: 1.34e-02, grad_scale: 64.0
2024-03-09 17:21:31,130 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.628e+01 7.574e+01 8.231e+01 9.009e+01 1.365e+02, threshold=1.646e+02, percent-clipped=0.0
2024-03-09 17:21:37,145 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=34706.666666666664, ans=0.0033246376811594206
2024-03-09 17:22:22,790 INFO [train.py:997] (3/4) Epoch 34, batch 0, loss[loss=0.1447, simple_loss=0.2278, pruned_loss=0.03075, over 24317.00 frames. ], tot_loss[loss=0.1447, simple_loss=0.2278, pruned_loss=0.03075, over 24317.00 frames. ], batch size: 208, lr: 1.32e-02, grad_scale: 64.0
2024-03-09 17:22:22,791 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:22:32,283 INFO [train.py:1029] (3/4) Epoch 34, validation: loss=0.2117, simple_loss=0.3053, pruned_loss=0.0591, over 452978.00 frames. 
2024-03-09 17:22:32,284 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:22:37,367 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=34760.0, ans=0.125
2024-03-09 17:22:37,470 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34760.0, ans=0.0
2024-03-09 17:23:03,789 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34893.333333333336, ans=0.0
2024-03-09 17:23:23,858 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=34960.0, ans=0.125
2024-03-09 17:23:31,654 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=34960.0, ans=0.003269565217391304
2024-03-09 17:23:49,725 INFO [train.py:997] (3/4) Epoch 34, batch 50, loss[loss=0.1437, simple_loss=0.2395, pruned_loss=0.02396, over 23916.00 frames. ], tot_loss[loss=0.1412, simple_loss=0.2305, pruned_loss=0.02589, over 1068620.07 frames. ], batch size: 387, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:24:15,972 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=35160.0, ans=0.0
2024-03-09 17:24:17,515 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=35160.0, ans=0.003226086956521739
2024-03-09 17:24:21,400 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0
2024-03-09 17:24:26,819 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=35226.666666666664, ans=0.09899494936611666
2024-03-09 17:24:31,974 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.96 vs. limit=22.5
2024-03-09 17:24:37,428 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=35226.666666666664, ans=0.125
2024-03-09 17:24:46,675 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35293.333333333336, ans=0.1
2024-03-09 17:24:54,478 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=35293.333333333336, ans=0.0
2024-03-09 17:25:04,712 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.790e+01 6.987e+01 7.379e+01 8.041e+01 1.553e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 17:25:13,959 INFO [train.py:997] (3/4) Epoch 34, batch 100, loss[loss=0.1508, simple_loss=0.2387, pruned_loss=0.03142, over 24214.00 frames. ], tot_loss[loss=0.1421, simple_loss=0.2317, pruned_loss=0.02628, over 1886370.96 frames. ], batch size: 198, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:25:31,143 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=35493.333333333336, ans=0.125
2024-03-09 17:25:34,292 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=35493.333333333336, ans=0.125
2024-03-09 17:25:44,874 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35560.0, ans=0.1
2024-03-09 17:26:24,197 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=35693.333333333336, ans=0.125
2024-03-09 17:26:32,970 INFO [train.py:997] (3/4) Epoch 34, batch 150, loss[loss=0.1418, simple_loss=0.2355, pruned_loss=0.024, over 24099.00 frames. ], tot_loss[loss=0.1434, simple_loss=0.2326, pruned_loss=0.02708, over 2530717.88 frames. ], batch size: 345, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:27:26,418 INFO [train.py:997] (3/4) Epoch 35, batch 0, loss[loss=0.1572, simple_loss=0.2507, pruned_loss=0.03183, over 23655.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2507, pruned_loss=0.03183, over 23655.00 frames. ], batch size: 485, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:27:26,418 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:27:38,585 INFO [train.py:1029] (3/4) Epoch 35, validation: loss=0.2098, simple_loss=0.3027, pruned_loss=0.05849, over 452978.00 frames. 
2024-03-09 17:27:38,586 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:27:56,501 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.71 vs. limit=15.0
2024-03-09 17:28:08,890 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.73 vs. limit=15.0
2024-03-09 17:28:15,937 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=35946.666666666664, ans=0.2
2024-03-09 17:28:16,328 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0
2024-03-09 17:28:25,599 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:28:30,315 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=36013.333333333336, ans=0.5
2024-03-09 17:28:34,546 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.276e+01 7.140e+01 7.953e+01 8.912e+01 1.249e+02, threshold=1.591e+02, percent-clipped=0.0
2024-03-09 17:28:58,519 INFO [train.py:997] (3/4) Epoch 35, batch 50, loss[loss=0.1423, simple_loss=0.2305, pruned_loss=0.02702, over 24238.00 frames. ], tot_loss[loss=0.1403, simple_loss=0.2286, pruned_loss=0.02598, over 1073848.84 frames. ], batch size: 241, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:28:58,774 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=36146.666666666664, ans=0.125
2024-03-09 17:29:24,711 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=36213.333333333336, ans=0.0
2024-03-09 17:29:26,246 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=36213.333333333336, ans=0.125
2024-03-09 17:29:48,176 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=36346.666666666664, ans=0.125
2024-03-09 17:30:05,057 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=36413.333333333336, ans=0.002953623188405796
2024-03-09 17:30:15,679 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=36413.333333333336, ans=0.002953623188405796
2024-03-09 17:30:18,437 INFO [train.py:997] (3/4) Epoch 35, batch 100, loss[loss=0.1327, simple_loss=0.219, pruned_loss=0.02318, over 23649.00 frames. ], tot_loss[loss=0.1402, simple_loss=0.2289, pruned_loss=0.02575, over 1901305.08 frames. ], batch size: 128, lr: 1.29e-02, grad_scale: 128.0
2024-03-09 17:30:36,909 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=36546.666666666664, ans=0.035
2024-03-09 17:30:37,534 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0
2024-03-09 17:30:49,824 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0
2024-03-09 17:31:05,526 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36680.0, ans=0.125
2024-03-09 17:31:18,165 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.800e+01 7.204e+01 7.789e+01 8.601e+01 1.817e+02, threshold=1.558e+02, percent-clipped=1.0
2024-03-09 17:31:38,613 INFO [train.py:997] (3/4) Epoch 35, batch 150, loss[loss=0.1292, simple_loss=0.2144, pruned_loss=0.02196, over 19900.00 frames. ], tot_loss[loss=0.1413, simple_loss=0.2305, pruned_loss=0.02599, over 2517737.96 frames. ], batch size: 60, lr: 1.29e-02, grad_scale: 64.0
2024-03-09 17:32:32,816 INFO [train.py:997] (3/4) Epoch 36, batch 0, loss[loss=0.1595, simple_loss=0.2567, pruned_loss=0.03116, over 23607.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.2567, pruned_loss=0.03116, over 23607.00 frames. ], batch size: 486, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:32:32,817 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:32:40,710 INFO [zipformer.py:1858] (3/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.9559, 2.3706, 2.5634, 2.5911], device='cuda:3')
2024-03-09 17:32:42,864 INFO [train.py:1029] (3/4) Epoch 36, validation: loss=0.212, simple_loss=0.307, pruned_loss=0.05847, over 452978.00 frames. 
2024-03-09 17:32:42,864 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:32:52,454 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=36866.666666666664, ans=0.125
2024-03-09 17:32:54,046 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=36866.666666666664, ans=0.002855072463768117
2024-03-09 17:33:03,350 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=36933.333333333336, ans=0.2
2024-03-09 17:33:21,719 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37000.0, ans=0.1
2024-03-09 17:33:30,096 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.01 vs. limit=10.0
2024-03-09 17:33:35,682 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=37066.666666666664, ans=0.125
2024-03-09 17:34:10,525 INFO [train.py:997] (3/4) Epoch 36, batch 50, loss[loss=0.1393, simple_loss=0.2257, pruned_loss=0.02643, over 24229.00 frames. ], tot_loss[loss=0.141, simple_loss=0.2306, pruned_loss=0.02566, over 1083429.72 frames. ], batch size: 229, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:34:17,641 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.91 vs. limit=6.0
2024-03-09 17:34:20,042 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=37200.0, ans=0.0
2024-03-09 17:34:23,142 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=37200.0, ans=0.0027826086956521745
2024-03-09 17:34:30,771 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=37266.666666666664, ans=0.125
2024-03-09 17:34:32,424 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37266.666666666664, ans=0.1
2024-03-09 17:34:37,822 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0
2024-03-09 17:34:42,119 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37333.333333333336, ans=0.125
2024-03-09 17:34:55,597 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.060e+01 6.975e+01 7.752e+01 8.346e+01 1.468e+02, threshold=1.550e+02, percent-clipped=0.0
2024-03-09 17:35:13,888 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.46 vs. limit=15.0
2024-03-09 17:35:16,339 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37466.666666666664, ans=0.125
2024-03-09 17:35:16,439 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37466.666666666664, ans=0.1
2024-03-09 17:35:19,597 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=37466.666666666664, ans=0.0
2024-03-09 17:35:28,455 INFO [train.py:997] (3/4) Epoch 36, batch 100, loss[loss=0.1411, simple_loss=0.2352, pruned_loss=0.02345, over 24173.00 frames. ], tot_loss[loss=0.1414, simple_loss=0.2317, pruned_loss=0.02551, over 1901434.56 frames. ], batch size: 327, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:35:41,988 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5
2024-03-09 17:35:49,914 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.40 vs. limit=15.0
2024-03-09 17:35:58,420 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37600.0, ans=0.125
2024-03-09 17:36:17,688 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=37733.333333333336, ans=0.0
2024-03-09 17:36:20,160 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=37733.333333333336, ans=0.0
2024-03-09 17:36:34,088 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37800.0, ans=0.1
2024-03-09 17:36:36,452 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.70 vs. limit=22.5
2024-03-09 17:36:40,133 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=37800.0, ans=0.2
2024-03-09 17:36:46,685 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=37800.0, ans=0.5
2024-03-09 17:36:50,960 INFO [train.py:997] (3/4) Epoch 36, batch 150, loss[loss=0.154, simple_loss=0.2485, pruned_loss=0.02981, over 23956.00 frames. ], tot_loss[loss=0.1407, simple_loss=0.2314, pruned_loss=0.025, over 2528549.44 frames. ], batch size: 416, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:37:46,094 INFO [train.py:997] (3/4) Epoch 37, batch 0, loss[loss=0.1261, simple_loss=0.2107, pruned_loss=0.02078, over 23768.00 frames. ], tot_loss[loss=0.1261, simple_loss=0.2107, pruned_loss=0.02078, over 23768.00 frames. ], batch size: 117, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:37:46,095 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:37:55,593 INFO [train.py:1029] (3/4) Epoch 37, validation: loss=0.2112, simple_loss=0.3044, pruned_loss=0.05893, over 452978.00 frames. 
2024-03-09 17:37:55,594 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:37:58,206 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.39 vs. limit=22.5
2024-03-09 17:38:02,052 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=37920.0, ans=0.125
2024-03-09 17:38:21,861 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=37986.666666666664, ans=0.0026115942028985505
2024-03-09 17:38:25,038 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=37986.666666666664, ans=0.0
2024-03-09 17:38:30,964 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.112e+01 7.137e+01 7.682e+01 8.524e+01 1.300e+02, threshold=1.536e+02, percent-clipped=0.0
2024-03-09 17:38:35,949 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=38053.333333333336, ans=0.0
2024-03-09 17:39:13,926 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=38186.666666666664, ans=0.125
2024-03-09 17:39:18,716 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:39:20,073 INFO [train.py:997] (3/4) Epoch 37, batch 50, loss[loss=0.1372, simple_loss=0.2284, pruned_loss=0.02299, over 24264.00 frames. ], tot_loss[loss=0.1389, simple_loss=0.2299, pruned_loss=0.02396, over 1063994.42 frames. ], batch size: 254, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:39:25,531 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0
2024-03-09 17:39:27,193 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.11 vs. limit=12.0
2024-03-09 17:39:40,538 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:42,034 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:54,804 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=38386.666666666664, ans=0.95
2024-03-09 17:40:03,083 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.73 vs. limit=15.0
2024-03-09 17:40:16,218 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=38453.333333333336, ans=0.2
2024-03-09 17:40:34,755 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=38520.0, ans=0.125
2024-03-09 17:40:40,615 INFO [train.py:997] (3/4) Epoch 37, batch 100, loss[loss=0.1576, simple_loss=0.2546, pruned_loss=0.03027, over 23809.00 frames. ], tot_loss[loss=0.1395, simple_loss=0.2304, pruned_loss=0.02426, over 1881976.31 frames. ], batch size: 447, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:41:13,701 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0
2024-03-09 17:41:15,923 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.868e+01 6.991e+01 7.571e+01 8.226e+01 1.121e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:42:00,691 INFO [train.py:997] (3/4) Epoch 37, batch 150, loss[loss=0.1369, simple_loss=0.231, pruned_loss=0.02144, over 24177.00 frames. ], tot_loss[loss=0.1398, simple_loss=0.2302, pruned_loss=0.02471, over 2517978.69 frames. ], batch size: 345, lr: 1.24e-02, grad_scale: 64.0
2024-03-09 17:42:52,945 INFO [train.py:997] (3/4) Epoch 38, batch 0, loss[loss=0.14, simple_loss=0.2301, pruned_loss=0.02497, over 24294.00 frames. ], tot_loss[loss=0.14, simple_loss=0.2301, pruned_loss=0.02497, over 24294.00 frames. ], batch size: 281, lr: 1.23e-02, grad_scale: 64.0
2024-03-09 17:42:52,946 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:43:02,281 INFO [train.py:1029] (3/4) Epoch 38, validation: loss=0.2136, simple_loss=0.3079, pruned_loss=0.05959, over 452978.00 frames. 
2024-03-09 17:43:02,281 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:43:13,153 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=38973.333333333336, ans=0.125
2024-03-09 17:43:17,912 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38973.333333333336, ans=0.0
2024-03-09 17:43:21,031 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39040.0, ans=0.125
2024-03-09 17:43:21,986 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0
2024-03-09 17:43:39,300 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=39106.666666666664, ans=0.125
2024-03-09 17:43:43,977 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=39106.666666666664, ans=0.2
2024-03-09 17:44:27,811 INFO [train.py:997] (3/4) Epoch 38, batch 50, loss[loss=0.1516, simple_loss=0.2365, pruned_loss=0.03341, over 24172.00 frames. ], tot_loss[loss=0.1376, simple_loss=0.2267, pruned_loss=0.02422, over 1065371.03 frames. ], batch size: 217, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:44:31,341 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=39306.666666666664, ans=0.0023246376811594206
2024-03-09 17:44:46,022 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0
2024-03-09 17:44:48,016 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.028e+01 7.170e+01 7.896e+01 8.779e+01 1.113e+02, threshold=1.579e+02, percent-clipped=0.0
2024-03-09 17:44:48,302 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=39373.333333333336, ans=0.125
2024-03-09 17:44:49,809 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39373.333333333336, ans=0.1
2024-03-09 17:45:10,429 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0
2024-03-09 17:45:18,573 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=39506.666666666664, ans=0.04949747468305833
2024-03-09 17:45:34,336 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=39573.333333333336, ans=0.125
2024-03-09 17:45:46,175 INFO [train.py:997] (3/4) Epoch 38, batch 100, loss[loss=0.1162, simple_loss=0.2123, pruned_loss=0.01007, over 21384.00 frames. ], tot_loss[loss=0.1405, simple_loss=0.2295, pruned_loss=0.02576, over 1882146.18 frames. ], batch size: 718, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:46:30,015 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.67 vs. limit=22.5
2024-03-09 17:46:43,686 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0
2024-03-09 17:46:47,730 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=39840.0, ans=0.0
2024-03-09 17:47:07,594 INFO [train.py:997] (3/4) Epoch 38, batch 150, loss[loss=0.1462, simple_loss=0.2428, pruned_loss=0.0248, over 23946.00 frames. ], tot_loss[loss=0.1399, simple_loss=0.2299, pruned_loss=0.02494, over 2501198.54 frames. ], batch size: 387, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:48:03,475 INFO [train.py:997] (3/4) Epoch 39, batch 0, loss[loss=0.1398, simple_loss=0.2286, pruned_loss=0.02548, over 24199.00 frames. ], tot_loss[loss=0.1398, simple_loss=0.2286, pruned_loss=0.02548, over 24199.00 frames. ], batch size: 217, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:48:03,476 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:48:11,687 INFO [zipformer.py:1858] (3/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.7097, 5.2407, 5.6102, 5.2988], device='cuda:3')
2024-03-09 17:48:12,746 INFO [train.py:1029] (3/4) Epoch 39, validation: loss=0.2141, simple_loss=0.3082, pruned_loss=0.06004, over 452978.00 frames. 
2024-03-09 17:48:12,746 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:48:26,644 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.993e+01 6.884e+01 7.356e+01 8.157e+01 1.068e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 17:48:42,213 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=40093.333333333336, ans=0.125
2024-03-09 17:49:03,034 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.78 vs. limit=6.0
2024-03-09 17:49:08,258 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40226.666666666664, ans=0.1
2024-03-09 17:49:09,884 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=40226.666666666664, ans=0.0
2024-03-09 17:49:29,147 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0
2024-03-09 17:49:41,667 INFO [train.py:997] (3/4) Epoch 39, batch 50, loss[loss=0.1371, simple_loss=0.2304, pruned_loss=0.02189, over 24247.00 frames. ], tot_loss[loss=0.1366, simple_loss=0.2274, pruned_loss=0.02289, over 1077777.57 frames. ], batch size: 281, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:49:54,246 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=40360.0, ans=0.125
2024-03-09 17:50:14,995 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0
2024-03-09 17:50:20,209 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=40493.333333333336, ans=0.035
2024-03-09 17:50:22,754 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=15.0
2024-03-09 17:50:26,893 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.23 vs. limit=22.5
2024-03-09 17:50:35,465 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=40560.0, ans=0.125
2024-03-09 17:50:59,997 INFO [train.py:997] (3/4) Epoch 39, batch 100, loss[loss=0.1445, simple_loss=0.2396, pruned_loss=0.02469, over 24066.00 frames. ], tot_loss[loss=0.1401, simple_loss=0.2304, pruned_loss=0.02493, over 1883949.65 frames. ], batch size: 365, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:51:00,249 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=40693.333333333336, ans=0.125
2024-03-09 17:51:08,194 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:51:09,406 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.940e+01 6.841e+01 7.461e+01 8.103e+01 1.250e+02, threshold=1.492e+02, percent-clipped=0.0
2024-03-09 17:51:20,281 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=40760.0, ans=0.0
2024-03-09 17:51:27,867 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=40760.0, ans=0.0020086956521739134
2024-03-09 17:52:18,380 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=40960.0, ans=0.2
2024-03-09 17:52:21,037 INFO [train.py:997] (3/4) Epoch 39, batch 150, loss[loss=0.1416, simple_loss=0.2332, pruned_loss=0.02503, over 24144.00 frames. ], tot_loss[loss=0.1388, simple_loss=0.2294, pruned_loss=0.02411, over 2518309.01 frames. ], batch size: 345, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:52:27,404 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=41026.666666666664, ans=0.0
2024-03-09 17:53:16,194 INFO [train.py:997] (3/4) Epoch 40, batch 0, loss[loss=0.1332, simple_loss=0.2279, pruned_loss=0.01927, over 24126.00 frames. ], tot_loss[loss=0.1332, simple_loss=0.2279, pruned_loss=0.01927, over 24126.00 frames. ], batch size: 366, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:53:16,194 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:53:25,708 INFO [train.py:1029] (3/4) Epoch 40, validation: loss=0.2148, simple_loss=0.3085, pruned_loss=0.06058, over 452978.00 frames. 
2024-03-09 17:53:25,709 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:53:59,429 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.98 vs. limit=10.0
2024-03-09 17:54:05,976 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:07,426 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41213.333333333336, ans=0.1
2024-03-09 17:54:12,332 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:20,635 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.20 vs. limit=15.0
2024-03-09 17:54:47,012 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.979e+01 7.013e+01 7.603e+01 8.055e+01 1.247e+02, threshold=1.521e+02, percent-clipped=0.0
2024-03-09 17:54:51,547 INFO [train.py:997] (3/4) Epoch 40, batch 50, loss[loss=0.1389, simple_loss=0.2284, pruned_loss=0.02473, over 24190.00 frames. ], tot_loss[loss=0.1375, simple_loss=0.2284, pruned_loss=0.02328, over 1068897.50 frames. ], batch size: 280, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:54:58,058 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=41413.333333333336, ans=0.04949747468305833
2024-03-09 17:55:10,278 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:55:31,194 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0
2024-03-09 17:55:34,853 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=41546.666666666664, ans=0.025
2024-03-09 17:55:58,723 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=15.0
2024-03-09 17:56:11,517 INFO [train.py:997] (3/4) Epoch 40, batch 100, loss[loss=0.1397, simple_loss=0.2378, pruned_loss=0.02077, over 24135.00 frames. ], tot_loss[loss=0.137, simple_loss=0.2274, pruned_loss=0.02327, over 1889769.59 frames. ], batch size: 366, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:56:17,034 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0
2024-03-09 17:56:23,776 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=41746.666666666664, ans=0.2
2024-03-09 17:56:37,121 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=41813.333333333336, ans=0.0017797101449275356
2024-03-09 17:56:47,820 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=41880.0, ans=0.0017652173913043478
2024-03-09 17:57:15,167 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=42013.333333333336, ans=0.2
2024-03-09 17:57:18,162 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=42013.333333333336, ans=0.2
2024-03-09 17:57:25,919 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.807e+01 6.999e+01 7.479e+01 8.341e+01 1.133e+02, threshold=1.496e+02, percent-clipped=0.0
2024-03-09 17:57:29,674 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=42080.0, ans=0.025
2024-03-09 17:57:30,894 INFO [train.py:997] (3/4) Epoch 40, batch 150, loss[loss=0.128, simple_loss=0.2208, pruned_loss=0.01763, over 22982.00 frames. ], tot_loss[loss=0.137, simple_loss=0.2273, pruned_loss=0.02336, over 2516987.52 frames. ], batch size: 609, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:57:32,673 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=42080.0, ans=0.0
2024-03-09 17:57:34,271 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42080.0, ans=0.1
2024-03-09 17:58:21,372 INFO [train.py:997] (3/4) Epoch 41, batch 0, loss[loss=0.131, simple_loss=0.2185, pruned_loss=0.02179, over 24135.00 frames. ], tot_loss[loss=0.131, simple_loss=0.2185, pruned_loss=0.02179, over 24135.00 frames. ], batch size: 240, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 17:58:21,372 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 17:58:30,940 INFO [train.py:1029] (3/4) Epoch 41, validation: loss=0.2136, simple_loss=0.3076, pruned_loss=0.05982, over 452978.00 frames. 
2024-03-09 17:58:30,941 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 17:58:47,663 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=42200.0, ans=0.125
2024-03-09 17:58:52,538 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=42200.0, ans=0.0016956521739130443
2024-03-09 17:59:24,724 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=42333.333333333336, ans=0.125
2024-03-09 17:59:41,599 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=42400.0, ans=0.2
2024-03-09 17:59:53,550 INFO [train.py:997] (3/4) Epoch 41, batch 50, loss[loss=0.1266, simple_loss=0.2204, pruned_loss=0.01633, over 24085.00 frames. ], tot_loss[loss=0.1351, simple_loss=0.2258, pruned_loss=0.02225, over 1067454.82 frames. ], batch size: 344, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:00:19,292 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0
2024-03-09 18:00:37,651 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.37 vs. limit=15.0
2024-03-09 18:00:38,647 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=42600.0, ans=0.125
2024-03-09 18:00:55,611 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+01 7.025e+01 7.943e+01 8.921e+01 1.202e+02, threshold=1.589e+02, percent-clipped=0.0
2024-03-09 18:01:00,430 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=42733.333333333336, ans=0.001579710144927535
2024-03-09 18:01:05,928 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.09 vs. limit=15.0
2024-03-09 18:01:11,216 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=42733.333333333336, ans=0.0
2024-03-09 18:01:14,024 INFO [train.py:997] (3/4) Epoch 41, batch 100, loss[loss=0.1442, simple_loss=0.2327, pruned_loss=0.02788, over 23047.00 frames. ], tot_loss[loss=0.1369, simple_loss=0.2276, pruned_loss=0.02308, over 1884862.24 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:01:33,418 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.97 vs. limit=22.5
2024-03-09 18:01:38,854 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=42866.666666666664, ans=0.125
2024-03-09 18:01:39,583 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=42866.666666666664, ans=22.5
2024-03-09 18:01:55,600 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=42933.333333333336, ans=0.125
2024-03-09 18:02:03,130 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43000.0, ans=0.125
2024-03-09 18:02:05,560 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=43000.0, ans=0.125
2024-03-09 18:02:11,520 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=43000.0, ans=0.125
2024-03-09 18:02:34,745 INFO [train.py:997] (3/4) Epoch 41, batch 150, loss[loss=0.1369, simple_loss=0.2339, pruned_loss=0.01992, over 23932.00 frames. ], tot_loss[loss=0.1372, simple_loss=0.228, pruned_loss=0.02321, over 2527203.76 frames. ], batch size: 387, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:02:35,558 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0
2024-03-09 18:02:37,864 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=43133.333333333336, ans=0.2
2024-03-09 18:02:39,559 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=43133.333333333336, ans=0.0014927536231884048
2024-03-09 18:03:28,793 INFO [train.py:997] (3/4) Epoch 42, batch 0, loss[loss=0.1431, simple_loss=0.2415, pruned_loss=0.02238, over 23966.00 frames. ], tot_loss[loss=0.1431, simple_loss=0.2415, pruned_loss=0.02238, over 23966.00 frames. ], batch size: 416, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:03:28,793 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:03:38,340 INFO [train.py:1029] (3/4) Epoch 42, validation: loss=0.2135, simple_loss=0.3075, pruned_loss=0.05972, over 452978.00 frames. 
2024-03-09 18:03:38,341 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:04:03,041 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=43253.333333333336, ans=0.0
2024-03-09 18:04:04,592 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=43253.333333333336, ans=0.0014666666666666665
2024-03-09 18:04:07,114 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=12.0
2024-03-09 18:04:09,198 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:04:12,193 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43320.0, ans=0.0
2024-03-09 18:04:12,230 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=43320.0, ans=0.0
2024-03-09 18:04:29,006 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+01 6.812e+01 7.244e+01 8.018e+01 1.063e+02, threshold=1.449e+02, percent-clipped=0.0
2024-03-09 18:04:50,271 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0
2024-03-09 18:04:58,767 INFO [train.py:997] (3/4) Epoch 42, batch 50, loss[loss=0.1467, simple_loss=0.2284, pruned_loss=0.03251, over 23906.00 frames. ], tot_loss[loss=0.1341, simple_loss=0.2251, pruned_loss=0.02157, over 1069473.35 frames. ], batch size: 153, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:05:06,739 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=43520.0, ans=0.125
2024-03-09 18:05:19,865 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0
2024-03-09 18:05:49,991 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=43720.0, ans=0.125
2024-03-09 18:05:57,647 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=43720.0, ans=0.035
2024-03-09 18:06:20,948 INFO [train.py:997] (3/4) Epoch 42, batch 100, loss[loss=0.1437, simple_loss=0.2436, pruned_loss=0.02187, over 23829.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.225, pruned_loss=0.02169, over 1881314.16 frames. ], batch size: 447, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:06:33,487 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43853.333333333336, ans=0.1
2024-03-09 18:07:07,098 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=44053.333333333336, ans=0.001292753623188406
2024-03-09 18:07:09,737 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.750e+01 6.712e+01 7.266e+01 7.977e+01 1.080e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:07:10,771 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.30 vs. limit=6.0
2024-03-09 18:07:24,536 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=44120.0, ans=0.2
2024-03-09 18:07:39,991 INFO [train.py:997] (3/4) Epoch 42, batch 150, loss[loss=0.1351, simple_loss=0.2215, pruned_loss=0.02432, over 20004.00 frames. ], tot_loss[loss=0.1355, simple_loss=0.2269, pruned_loss=0.0221, over 2516694.39 frames. ], batch size: 60, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:07:40,197 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=44186.666666666664, ans=0.2
2024-03-09 18:08:31,601 INFO [train.py:997] (3/4) Epoch 43, batch 0, loss[loss=0.1485, simple_loss=0.2454, pruned_loss=0.02575, over 23705.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2454, pruned_loss=0.02575, over 23705.00 frames. ], batch size: 485, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:08:31,602 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:08:41,004 INFO [train.py:1029] (3/4) Epoch 43, validation: loss=0.2134, simple_loss=0.3077, pruned_loss=0.05952, over 452978.00 frames. 
2024-03-09 18:08:41,005 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:08:53,631 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=44240.0, ans=0.125
2024-03-09 18:09:01,517 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=44306.666666666664, ans=0.0
2024-03-09 18:09:50,359 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=15.0
2024-03-09 18:09:50,990 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:09:58,589 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=44506.666666666664, ans=0.2
2024-03-09 18:10:01,088 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0
2024-03-09 18:10:01,381 INFO [train.py:997] (3/4) Epoch 43, batch 50, loss[loss=0.1285, simple_loss=0.2119, pruned_loss=0.02257, over 20430.00 frames. ], tot_loss[loss=0.1379, simple_loss=0.2283, pruned_loss=0.02375, over 1072487.93 frames. ], batch size: 62, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:10:36,516 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.916e+01 6.864e+01 7.263e+01 8.155e+01 1.054e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:10:40,000 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=44706.666666666664, ans=0.125
2024-03-09 18:10:46,092 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=44773.333333333336, ans=0.125
2024-03-09 18:11:19,221 INFO [train.py:997] (3/4) Epoch 43, batch 100, loss[loss=0.1368, simple_loss=0.2324, pruned_loss=0.02057, over 24162.00 frames. ], tot_loss[loss=0.1356, simple_loss=0.2261, pruned_loss=0.02253, over 1889264.58 frames. ], batch size: 345, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:12:10,231 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=45106.666666666664, ans=0.125
2024-03-09 18:12:33,060 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=45173.333333333336, ans=0.0
2024-03-09 18:12:34,565 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=45173.333333333336, ans=0.0
2024-03-09 18:12:40,863 INFO [train.py:997] (3/4) Epoch 43, batch 150, loss[loss=0.113, simple_loss=0.2075, pruned_loss=0.009225, over 21451.00 frames. ], tot_loss[loss=0.1355, simple_loss=0.2267, pruned_loss=0.02216, over 2516550.55 frames. ], batch size: 718, lr: 1.12e-02, grad_scale: 32.0
2024-03-09 18:12:49,305 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=45240.0, ans=0.125
2024-03-09 18:13:36,399 INFO [train.py:997] (3/4) Epoch 44, batch 0, loss[loss=0.1242, simple_loss=0.2121, pruned_loss=0.0181, over 23603.00 frames. ], tot_loss[loss=0.1242, simple_loss=0.2121, pruned_loss=0.0181, over 23603.00 frames. ], batch size: 128, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:13:36,399 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:13:45,433 INFO [train.py:1029] (3/4) Epoch 44, validation: loss=0.2121, simple_loss=0.3064, pruned_loss=0.05891, over 452978.00 frames. 
2024-03-09 18:13:45,434 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:14:02,599 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=45293.333333333336, ans=0.125
2024-03-09 18:14:06,614 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.85 vs. limit=12.0
2024-03-09 18:14:07,943 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0
2024-03-09 18:14:19,825 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.880e+01 6.918e+01 7.525e+01 8.097e+01 1.200e+02, threshold=1.505e+02, percent-clipped=0.0
2024-03-09 18:15:05,780 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0
2024-03-09 18:15:09,813 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=45560.0, ans=0.2
2024-03-09 18:15:12,599 INFO [train.py:997] (3/4) Epoch 44, batch 50, loss[loss=0.1368, simple_loss=0.2234, pruned_loss=0.02505, over 24053.00 frames. ], tot_loss[loss=0.1364, simple_loss=0.2266, pruned_loss=0.02309, over 1070767.10 frames. ], batch size: 165, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:15:15,968 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45626.666666666664, ans=0.125
2024-03-09 18:15:22,104 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=45626.666666666664, ans=0.125
2024-03-09 18:15:48,181 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=45760.0, ans=0.125
2024-03-09 18:15:52,857 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=45760.0, ans=0.125
2024-03-09 18:15:58,938 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=45826.666666666664, ans=0.2
2024-03-09 18:16:20,423 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45893.333333333336, ans=0.125
2024-03-09 18:16:30,683 INFO [train.py:997] (3/4) Epoch 44, batch 100, loss[loss=0.1365, simple_loss=0.2287, pruned_loss=0.02215, over 24256.00 frames. ], tot_loss[loss=0.1377, simple_loss=0.2287, pruned_loss=0.0234, over 1887612.04 frames. ], batch size: 281, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:16:34,054 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=45960.0, ans=0.95
2024-03-09 18:16:45,383 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0
2024-03-09 18:16:49,106 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=46026.666666666664, ans=0.125
2024-03-09 18:16:56,910 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=46026.666666666664, ans=0.0008637681159420294
2024-03-09 18:17:01,035 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.671e+01 6.824e+01 7.356e+01 8.103e+01 1.148e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 18:17:09,518 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5
2024-03-09 18:17:22,769 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=46160.0, ans=0.04949747468305833
2024-03-09 18:17:36,067 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=46226.666666666664, ans=0.0008202898550724643
2024-03-09 18:17:46,263 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=46226.666666666664, ans=0.0
2024-03-09 18:17:51,976 INFO [train.py:997] (3/4) Epoch 44, batch 150, loss[loss=0.1265, simple_loss=0.2256, pruned_loss=0.01369, over 24226.00 frames. ], tot_loss[loss=0.1366, simple_loss=0.2276, pruned_loss=0.02283, over 2517182.93 frames. ], batch size: 327, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:18:43,508 INFO [train.py:997] (3/4) Epoch 45, batch 0, loss[loss=0.1409, simple_loss=0.2269, pruned_loss=0.02746, over 24064.00 frames. ], tot_loss[loss=0.1409, simple_loss=0.2269, pruned_loss=0.02746, over 24064.00 frames. ], batch size: 165, lr: 1.09e-02, grad_scale: 32.0
2024-03-09 18:18:43,509 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:18:53,093 INFO [train.py:1029] (3/4) Epoch 45, validation: loss=0.2137, simple_loss=0.3089, pruned_loss=0.05927, over 452978.00 frames. 
2024-03-09 18:18:53,094 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:19:05,803 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46346.666666666664, ans=0.1
2024-03-09 18:19:38,295 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46480.0, ans=0.125
2024-03-09 18:19:39,048 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0
2024-03-09 18:19:45,317 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46546.666666666664, ans=0.1
2024-03-09 18:19:51,519 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46546.666666666664, ans=0.125
2024-03-09 18:19:55,489 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.38 vs. limit=15.0
2024-03-09 18:20:07,000 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:20:11,480 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:20:16,257 INFO [train.py:997] (3/4) Epoch 45, batch 50, loss[loss=0.1298, simple_loss=0.2171, pruned_loss=0.02127, over 24318.00 frames. ], tot_loss[loss=0.135, simple_loss=0.2262, pruned_loss=0.02193, over 1073449.42 frames. ], batch size: 208, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:20:22,705 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46680.0, ans=0.1
2024-03-09 18:20:29,932 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.843e+01 6.817e+01 7.386e+01 8.152e+01 1.203e+02, threshold=1.477e+02, percent-clipped=0.0
2024-03-09 18:20:39,502 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=46746.666666666664, ans=0.5
2024-03-09 18:20:43,935 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=46746.666666666664, ans=0.05
2024-03-09 18:20:47,466 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0
2024-03-09 18:20:51,442 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=46813.333333333336, ans=0.125
2024-03-09 18:20:51,498 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46813.333333333336, ans=0.1
2024-03-09 18:21:25,208 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=46946.666666666664, ans=0.0006637681159420306
2024-03-09 18:21:35,473 INFO [train.py:997] (3/4) Epoch 45, batch 100, loss[loss=0.1357, simple_loss=0.2272, pruned_loss=0.0221, over 24266.00 frames. ], tot_loss[loss=0.135, simple_loss=0.226, pruned_loss=0.02198, over 1890623.48 frames. ], batch size: 311, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:22:22,921 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=47213.333333333336, ans=0.2
2024-03-09 18:22:22,922 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=47213.333333333336, ans=0.0
2024-03-09 18:22:41,800 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47280.0, ans=0.1
2024-03-09 18:22:45,466 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.92 vs. limit=15.0
2024-03-09 18:22:55,744 INFO [train.py:997] (3/4) Epoch 45, batch 150, loss[loss=0.139, simple_loss=0.2273, pruned_loss=0.02528, over 24223.00 frames. ], tot_loss[loss=0.1347, simple_loss=0.2259, pruned_loss=0.02176, over 2515683.97 frames. ], batch size: 229, lr: 1.08e-02, grad_scale: 16.0
2024-03-09 18:22:59,117 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=47346.666666666664, ans=0.2
2024-03-09 18:23:50,625 INFO [train.py:997] (3/4) Epoch 46, batch 0, loss[loss=0.1476, simple_loss=0.2425, pruned_loss=0.02637, over 23723.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2425, pruned_loss=0.02637, over 23723.00 frames. ], batch size: 486, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:23:50,626 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:24:00,487 INFO [train.py:1029] (3/4) Epoch 46, validation: loss=0.2142, simple_loss=0.3085, pruned_loss=0.05997, over 452978.00 frames. 
2024-03-09 18:24:00,488 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:24:05,180 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+01 6.849e+01 7.495e+01 7.996e+01 1.078e+02, threshold=1.499e+02, percent-clipped=0.0
2024-03-09 18:24:06,979 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=47400.0, ans=0.125
2024-03-09 18:24:14,515 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=47400.0, ans=0.2
2024-03-09 18:24:16,593 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.58 vs. limit=15.0
2024-03-09 18:24:20,786 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=47466.666666666664, ans=0.2
2024-03-09 18:24:26,435 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=47466.666666666664, ans=0.0005507246376811603
2024-03-09 18:24:27,930 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47466.666666666664, ans=0.1
2024-03-09 18:24:28,774 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=47466.666666666664, ans=15.0
2024-03-09 18:24:37,977 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.34 vs. limit=15.0
2024-03-09 18:24:41,986 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=47533.333333333336, ans=0.0005362318840579708
2024-03-09 18:24:58,743 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=47600.0, ans=0.125
2024-03-09 18:25:25,827 INFO [train.py:997] (3/4) Epoch 46, batch 50, loss[loss=0.1312, simple_loss=0.2258, pruned_loss=0.01824, over 24210.00 frames. ], tot_loss[loss=0.1321, simple_loss=0.2231, pruned_loss=0.02053, over 1071835.54 frames. ], batch size: 295, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:26:45,329 INFO [train.py:997] (3/4) Epoch 46, batch 100, loss[loss=0.1194, simple_loss=0.2152, pruned_loss=0.01181, over 22860.00 frames. ], tot_loss[loss=0.1329, simple_loss=0.2239, pruned_loss=0.02092, over 1888353.71 frames. ], batch size: 608, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:26:49,977 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.653e+01 6.627e+01 7.164e+01 7.678e+01 1.012e+02, threshold=1.433e+02, percent-clipped=0.0
2024-03-09 18:27:05,755 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0
2024-03-09 18:27:55,964 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48333.333333333336, ans=0.1
2024-03-09 18:28:06,136 INFO [train.py:997] (3/4) Epoch 46, batch 150, loss[loss=0.1635, simple_loss=0.2528, pruned_loss=0.0371, over 23209.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.2259, pruned_loss=0.02125, over 2526246.99 frames. ], batch size: 534, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:28:08,672 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0
2024-03-09 18:29:00,564 INFO [train.py:997] (3/4) Epoch 47, batch 0, loss[loss=0.134, simple_loss=0.2278, pruned_loss=0.02012, over 24208.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2278, pruned_loss=0.02012, over 24208.00 frames. ], batch size: 295, lr: 1.05e-02, grad_scale: 32.0
2024-03-09 18:29:00,565 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:29:10,389 INFO [train.py:1029] (3/4) Epoch 47, validation: loss=0.2152, simple_loss=0.3095, pruned_loss=0.06041, over 452978.00 frames. 
2024-03-09 18:29:10,390 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:29:11,567 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5
2024-03-09 18:29:15,302 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=48453.333333333336, ans=0.125
2024-03-09 18:29:19,097 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.59 vs. limit=10.0
2024-03-09 18:29:42,405 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48586.666666666664, ans=0.1
2024-03-09 18:30:28,053 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.832e+01 6.822e+01 7.253e+01 7.989e+01 1.051e+02, threshold=1.451e+02, percent-clipped=0.0
2024-03-09 18:30:34,258 INFO [train.py:997] (3/4) Epoch 47, batch 50, loss[loss=0.1498, simple_loss=0.2312, pruned_loss=0.0342, over 23977.00 frames. ], tot_loss[loss=0.1308, simple_loss=0.2213, pruned_loss=0.02016, over 1069116.76 frames. ], batch size: 153, lr: 1.05e-02, grad_scale: 16.0
2024-03-09 18:30:41,335 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=48786.666666666664, ans=0.125
2024-03-09 18:31:08,154 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=15.0
2024-03-09 18:31:09,126 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=48920.0, ans=0.00023478260869565226
2024-03-09 18:31:12,984 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.91 vs. limit=6.0
2024-03-09 18:31:25,895 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=48986.666666666664, ans=0.125
2024-03-09 18:31:28,788 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=48986.666666666664, ans=0.2
2024-03-09 18:31:28,828 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=48986.666666666664, ans=0.0
2024-03-09 18:31:44,546 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:31:53,479 INFO [train.py:997] (3/4) Epoch 47, batch 100, loss[loss=0.1403, simple_loss=0.2377, pruned_loss=0.02142, over 23945.00 frames. ], tot_loss[loss=0.134, simple_loss=0.225, pruned_loss=0.02154, over 1881478.66 frames. ], batch size: 387, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:32:02,423 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.81 vs. limit=15.0
2024-03-09 18:32:05,335 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.83 vs. limit=10.0
2024-03-09 18:32:43,260 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=49320.0, ans=0.125
2024-03-09 18:32:48,835 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.55 vs. limit=15.0
2024-03-09 18:32:54,294 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=49320.0, ans=0.0
2024-03-09 18:32:58,714 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=49386.666666666664, ans=0.0
2024-03-09 18:33:07,721 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=49386.666666666664, ans=0.0
2024-03-09 18:33:10,428 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.939e+01 7.123e+01 7.707e+01 8.583e+01 1.160e+02, threshold=1.541e+02, percent-clipped=0.0
2024-03-09 18:33:15,538 INFO [train.py:997] (3/4) Epoch 47, batch 150, loss[loss=0.1354, simple_loss=0.2301, pruned_loss=0.02033, over 24039.00 frames. ], tot_loss[loss=0.1345, simple_loss=0.2259, pruned_loss=0.02161, over 2511654.63 frames. ], batch size: 344, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:33:15,814 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=49453.333333333336, ans=0.1
2024-03-09 18:34:03,513 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49506.666666666664, ans=0.125
2024-03-09 18:34:05,686 INFO [train.py:997] (3/4) Epoch 48, batch 0, loss[loss=0.118, simple_loss=0.2089, pruned_loss=0.01354, over 23939.00 frames. ], tot_loss[loss=0.118, simple_loss=0.2089, pruned_loss=0.01354, over 23939.00 frames. ], batch size: 142, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:34:05,687 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:34:15,169 INFO [train.py:1029] (3/4) Epoch 48, validation: loss=0.2149, simple_loss=0.3083, pruned_loss=0.06081, over 452978.00 frames. 
2024-03-09 18:34:15,170 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:34:33,977 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=49573.333333333336, ans=0.0
2024-03-09 18:34:44,808 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=49573.333333333336, ans=0.1
2024-03-09 18:34:45,546 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.30 vs. limit=10.0
2024-03-09 18:35:04,605 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49640.0, ans=0.125
2024-03-09 18:35:24,460 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=12.0
2024-03-09 18:35:39,129 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=49840.0, ans=0.125
2024-03-09 18:35:40,447 INFO [train.py:997] (3/4) Epoch 48, batch 50, loss[loss=0.1303, simple_loss=0.2189, pruned_loss=0.02089, over 24226.00 frames. ], tot_loss[loss=0.1329, simple_loss=0.2236, pruned_loss=0.02108, over 1074716.06 frames. ], batch size: 217, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:35:41,581 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0
2024-03-09 18:35:53,152 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:36:03,370 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.57 vs. limit=15.0
2024-03-09 18:36:11,064 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0
2024-03-09 18:36:16,788 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=49973.333333333336, ans=0.125
2024-03-09 18:36:22,286 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.84 vs. limit=15.0
2024-03-09 18:36:32,171 INFO [scaling.py:1119] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:36:39,626 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50040.0, ans=0.1
2024-03-09 18:36:39,671 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=50040.0, ans=0.125
2024-03-09 18:36:42,443 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.770e+01 6.729e+01 7.301e+01 8.005e+01 9.735e+01, threshold=1.460e+02, percent-clipped=0.0
2024-03-09 18:36:47,526 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=50106.666666666664, ans=0.2
2024-03-09 18:36:59,143 INFO [train.py:997] (3/4) Epoch 48, batch 100, loss[loss=0.1205, simple_loss=0.2024, pruned_loss=0.01929, over 23722.00 frames. ], tot_loss[loss=0.1341, simple_loss=0.2248, pruned_loss=0.02172, over 1889845.79 frames. ], batch size: 117, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:37:25,188 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=50240.0, ans=0.07
2024-03-09 18:37:34,127 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=50306.666666666664, ans=0.2
2024-03-09 18:37:50,890 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=50373.333333333336, ans=0.0
2024-03-09 18:38:02,376 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=50373.333333333336, ans=0.0
2024-03-09 18:38:02,386 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=50373.333333333336, ans=0.125
2024-03-09 18:38:20,105 INFO [train.py:997] (3/4) Epoch 48, batch 150, loss[loss=0.1282, simple_loss=0.2205, pruned_loss=0.0179, over 24263.00 frames. ], tot_loss[loss=0.1335, simple_loss=0.2247, pruned_loss=0.02112, over 2507628.60 frames. ], batch size: 241, lr: 1.03e-02, grad_scale: 8.0
2024-03-09 18:38:29,869 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=50506.666666666664, ans=0.09899494936611666
2024-03-09 18:38:29,901 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=50506.666666666664, ans=0.0
2024-03-09 18:39:15,061 INFO [train.py:997] (3/4) Epoch 49, batch 0, loss[loss=0.1372, simple_loss=0.2326, pruned_loss=0.02094, over 23746.00 frames. ], tot_loss[loss=0.1372, simple_loss=0.2326, pruned_loss=0.02094, over 23746.00 frames. ], batch size: 486, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:39:15,062 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:39:24,778 INFO [train.py:1029] (3/4) Epoch 49, validation: loss=0.2171, simple_loss=0.31, pruned_loss=0.06203, over 452978.00 frames. 
2024-03-09 18:39:24,779 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:39:45,085 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=50626.666666666664, ans=0.125
2024-03-09 18:39:51,071 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=50626.666666666664, ans=0.0
2024-03-09 18:40:11,811 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0
2024-03-09 18:40:12,827 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=50693.333333333336, ans=0.2
2024-03-09 18:40:23,493 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+01 6.914e+01 7.599e+01 8.430e+01 1.205e+02, threshold=1.520e+02, percent-clipped=0.0
2024-03-09 18:40:37,111 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.78 vs. limit=15.0
2024-03-09 18:40:47,030 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50826.666666666664, ans=0.125
2024-03-09 18:40:51,381 INFO [train.py:997] (3/4) Epoch 49, batch 50, loss[loss=0.1153, simple_loss=0.2104, pruned_loss=0.01008, over 21415.00 frames. ], tot_loss[loss=0.1328, simple_loss=0.2238, pruned_loss=0.02091, over 1064043.14 frames. ], batch size: 718, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:41:22,009 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.91 vs. limit=15.0
2024-03-09 18:41:40,434 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.77 vs. limit=12.0
2024-03-09 18:41:51,339 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.20 vs. limit=15.0
2024-03-09 18:42:10,662 INFO [train.py:997] (3/4) Epoch 49, batch 100, loss[loss=0.1366, simple_loss=0.2256, pruned_loss=0.02384, over 24195.00 frames. ], tot_loss[loss=0.1321, simple_loss=0.2229, pruned_loss=0.02069, over 1879066.52 frames. ], batch size: 188, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:42:12,475 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=51226.666666666664, ans=0.125
2024-03-09 18:42:18,841 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=51226.666666666664, ans=0.05
2024-03-09 18:42:18,866 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51226.666666666664, ans=0.125
2024-03-09 18:42:25,546 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.80 vs. limit=22.5
2024-03-09 18:42:27,925 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=51293.333333333336, ans=0.1
2024-03-09 18:42:41,633 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=51293.333333333336, ans=0.0
2024-03-09 18:43:04,162 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.035e+01 6.804e+01 7.380e+01 7.884e+01 1.078e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 18:43:17,359 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=51493.333333333336, ans=0.0
2024-03-09 18:43:30,600 INFO [train.py:997] (3/4) Epoch 49, batch 150, loss[loss=0.1279, simple_loss=0.2197, pruned_loss=0.0181, over 24245.00 frames. ], tot_loss[loss=0.1335, simple_loss=0.2243, pruned_loss=0.02132, over 2507386.41 frames. ], batch size: 198, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:43:39,347 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0
2024-03-09 18:44:22,357 INFO [train.py:997] (3/4) Epoch 50, batch 0, loss[loss=0.134, simple_loss=0.2338, pruned_loss=0.0171, over 23883.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2338, pruned_loss=0.0171, over 23883.00 frames. ], batch size: 447, lr: 1.00e-02, grad_scale: 16.0
2024-03-09 18:44:22,357 INFO [train.py:1020] (3/4) Computing validation loss
2024-03-09 18:44:30,928 INFO [zipformer.py:1858] (3/4) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.2012, 3.7216, 3.9667, 2.6843], device='cuda:3')
2024-03-09 18:44:31,920 INFO [train.py:1029] (3/4) Epoch 50, validation: loss=0.2164, simple_loss=0.3113, pruned_loss=0.06071, over 452978.00 frames. 
2024-03-09 18:44:31,920 INFO [train.py:1030] (3/4) Maximum memory allocated so far is 27673MB
2024-03-09 18:44:32,930 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.00 vs. limit=22.5
2024-03-09 18:44:42,405 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.02 vs. limit=15.0
2024-03-09 18:45:29,957 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=51813.333333333336, ans=0.09899494936611666
2024-03-09 18:45:40,690 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=51880.0, ans=0.0
2024-03-09 18:45:40,704 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=51880.0, ans=0.0
2024-03-09 18:45:42,269 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=51880.0, ans=0.125
2024-03-09 18:45:57,123 INFO [train.py:997] (3/4) Epoch 50, batch 50, loss[loss=0.1173, simple_loss=0.2122, pruned_loss=0.01116, over 22836.00 frames. ], tot_loss[loss=0.131, simple_loss=0.2224, pruned_loss=0.01983, over 1065670.29 frames. ], batch size: 609, lr: 1.00e-02, grad_scale: 8.0
2024-03-09 18:46:16,626 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0
2024-03-09 18:46:37,167 WARNING [optim.py:487] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.908e+01 6.888e+01 7.222e+01 7.907e+01 1.090e+02, threshold=1.444e+02, percent-clipped=0.0
2024-03-09 18:47:13,805 INFO [train.py:997] (3/4) Epoch 50, batch 100, loss[loss=0.1329, simple_loss=0.2276, pruned_loss=0.01912, over 24203.00 frames. ], tot_loss[loss=0.132, simple_loss=0.2232, pruned_loss=0.02042, over 1867371.10 frames. ], batch size: 295, lr: 9.99e-03, grad_scale: 8.0
2024-03-09 18:47:19,505 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=6.83 vs. limit=15.0
2024-03-09 18:47:22,821 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0
2024-03-09 18:47:56,121 INFO [scaling.py:1023] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.89 vs. limit=10.0
2024-03-09 18:48:27,100 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:27,162 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:28,706 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:36,459 INFO [train.py:997] (3/4) Epoch 50, batch 150, loss[loss=0.1306, simple_loss=0.2205, pruned_loss=0.02031, over 24169.00 frames. ], tot_loss[loss=0.1331, simple_loss=0.2249, pruned_loss=0.02064, over 2500487.36 frames. ], batch size: 217, lr: 9.97e-03, grad_scale: 8.0
2024-03-09 18:48:48,483 INFO [train.py:1248] (3/4) Done!