File size: 100,627 Bytes
7baae4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
2024-03-09 17:05:59,770 INFO [train.py:1065] (2/4) Training started
2024-03-09 17:05:59,770 INFO [train.py:1075] (2/4) Device: cuda:2
2024-03-09 17:05:59,856 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt
2024-03-09 17:05:59,871 INFO [train.py:1086] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2989b0b1186fa6022932804f5b39fbb2781ebf42', 'k2-git-date': 'Fri Nov 24 11:34:10 2023', 'lhotse-version': '1.22.0.dev+git.d8ed1bbb.dirty', 'torch-version': '1.11.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'dev/mdcc', 'icefall-git-sha1': '8b7ca604-clean', 'icefall-git-date': 'Sat Mar 9 14:09:58 2024', 'icefall-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/icefall-1.0-py3.9.egg', 'k2-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/k2-1.24.4.dev20231207+cuda10.2.torch1.11.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/lhotse-1.22.0.dev0+git.d8ed1bbb.dirty-py3.9.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-1207150844-f49d8c4f4-c49d5', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 31, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 1, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 4852}
2024-03-09 17:05:59,871 INFO [train.py:1088] (2/4) About to create model
2024-03-09 17:06:00,576 INFO [train.py:1092] (2/4) Number of model parameters: 74470867
2024-03-09 17:06:00,577 INFO [checkpoint.py:112] (2/4) Loading checkpoint from zipformer/exp/epoch-30.pt
2024-03-09 17:06:07,913 INFO [train.py:1107] (2/4) Using DDP
2024-03-09 17:06:08,423 INFO [train.py:1119] (2/4) Loading optimizer state dict
2024-03-09 17:06:09,551 INFO [train.py:1127] (2/4) Loading scheduler state dict
2024-03-09 17:06:09,552 INFO [asr_datamodule.py:368] (2/4) About to get train cuts
2024-03-09 17:06:09,556 INFO [asr_datamodule.py:376] (2/4) About to get valid cuts
2024-03-09 17:06:09,558 INFO [asr_datamodule.py:195] (2/4) About to get Musan cuts
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:200] (2/4) Enable MUSAN
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:223] (2/4) Enable SpecAugment
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:224] (2/4) Time warp factor: 80
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:234] (2/4) Num frame mask: 10
2024-03-09 17:06:12,023 INFO [asr_datamodule.py:247] (2/4) About to create train dataset
2024-03-09 17:06:12,023 INFO [asr_datamodule.py:273] (2/4) Using DynamicBucketingSampler.
2024-03-09 17:06:12,782 INFO [asr_datamodule.py:290] (2/4) About to create train dataloader
2024-03-09 17:06:12,783 INFO [asr_datamodule.py:315] (2/4) About to create dev dataset
2024-03-09 17:06:13,095 INFO [asr_datamodule.py:332] (2/4) About to create dev dataloader
2024-03-09 17:06:13,095 INFO [train.py:1205] (2/4) Loading grad scaler state dict
2024-03-09 17:06:53,814 INFO [train.py:997] (2/4) Epoch 31, batch 0, loss[loss=0.1464, simple_loss=0.2317, pruned_loss=0.03049, over 22540.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2317, pruned_loss=0.03049, over 22540.00 frames. ], batch size: 85, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:06:53,814 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:07:03,247 INFO [train.py:1029] (2/4) Epoch 31, validation: loss=0.2089, simple_loss=0.3019, pruned_loss=0.05794, over 452978.00 frames. 
2024-03-09 17:07:03,248 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 26094MB
2024-03-09 17:07:04,302 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0
2024-03-09 17:07:17,133 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=15.0
2024-03-09 17:07:21,695 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0
2024-03-09 17:08:00,478 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=31800.0, ans=0.04949747468305833
2024-03-09 17:08:15,852 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31866.666666666668, ans=0.125
2024-03-09 17:08:19,165 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=31866.666666666668, ans=0.2
2024-03-09 17:08:21,814 INFO [train.py:997] (2/4) Epoch 31, batch 50, loss[loss=0.1534, simple_loss=0.2489, pruned_loss=0.02897, over 23857.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2335, pruned_loss=0.02738, over 1071633.96 frames. ], batch size: 447, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:08:23,704 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=31933.333333333332, ans=0.5
2024-03-09 17:08:26,706 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=31933.333333333332, ans=0.2
2024-03-09 17:08:54,735 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.909e+01 7.298e+01 7.941e+01 8.893e+01 1.039e+02, threshold=1.588e+02, percent-clipped=0.0
2024-03-09 17:09:16,043 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=32133.333333333332, ans=0.2
2024-03-09 17:09:17,579 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=32133.333333333332, ans=0.125
2024-03-09 17:09:39,306 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=32200.0, ans=0.125
2024-03-09 17:09:43,950 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=32200.0, ans=0.125
2024-03-09 17:09:48,294 INFO [train.py:997] (2/4) Epoch 31, batch 100, loss[loss=0.14, simple_loss=0.2268, pruned_loss=0.02655, over 24240.00 frames. ], tot_loss[loss=0.1442, simple_loss=0.2339, pruned_loss=0.0272, over 1879009.89 frames. ], batch size: 188, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:10:01,774 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0
2024-03-09 17:10:13,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=32333.333333333332, ans=0.1
2024-03-09 17:10:16,168 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=32333.333333333332, ans=0.0038405797101449283
2024-03-09 17:10:53,962 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32533.333333333332, ans=0.125
2024-03-09 17:10:57,006 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=32533.333333333332, ans=0.125
2024-03-09 17:11:00,033 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=32533.333333333332, ans=0.07
2024-03-09 17:11:08,745 INFO [train.py:997] (2/4) Epoch 31, batch 150, loss[loss=0.1535, simple_loss=0.2408, pruned_loss=0.03313, over 24237.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2351, pruned_loss=0.02781, over 2517109.18 frames. ], batch size: 198, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:11:10,504 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32600.0, ans=0.1
2024-03-09 17:12:06,856 INFO [train.py:997] (2/4) Epoch 32, batch 0, loss[loss=0.1951, simple_loss=0.2726, pruned_loss=0.05876, over 23222.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2726, pruned_loss=0.05876, over 23222.00 frames. ], batch size: 534, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:12:06,857 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:12:16,507 INFO [train.py:1029] (2/4) Epoch 32, validation: loss=0.2101, simple_loss=0.3027, pruned_loss=0.0588, over 452978.00 frames. 
2024-03-09 17:12:16,508 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27051MB
2024-03-09 17:12:18,465 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:19,936 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:26,195 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=32653.333333333332, ans=0.0
2024-03-09 17:12:32,050 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.203e+01 7.071e+01 7.685e+01 8.593e+01 1.169e+02, threshold=1.537e+02, percent-clipped=0.0
2024-03-09 17:12:33,909 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=32720.0, ans=0.125
2024-03-09 17:12:37,615 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0
2024-03-09 17:12:57,829 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32786.666666666664, ans=0.125
2024-03-09 17:12:58,574 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.96 vs. limit=22.5
2024-03-09 17:13:03,988 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=32853.333333333336, ans=0.0
2024-03-09 17:13:14,946 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=32853.333333333336, ans=0.125
2024-03-09 17:13:27,333 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32920.0, ans=0.1
2024-03-09 17:13:28,081 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0
2024-03-09 17:13:34,604 INFO [train.py:997] (2/4) Epoch 32, batch 50, loss[loss=0.1302, simple_loss=0.2187, pruned_loss=0.02087, over 23953.00 frames. ], tot_loss[loss=0.144, simple_loss=0.2325, pruned_loss=0.02775, over 1067019.37 frames. ], batch size: 142, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:13:44,581 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0
2024-03-09 17:14:01,597 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33053.333333333336, ans=0.1
2024-03-09 17:14:16,281 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=33120.0, ans=0.0
2024-03-09 17:14:27,390 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=33186.666666666664, ans=0.0
2024-03-09 17:14:33,630 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=33186.666666666664, ans=0.125
2024-03-09 17:14:59,398 INFO [train.py:997] (2/4) Epoch 32, batch 100, loss[loss=0.1402, simple_loss=0.2245, pruned_loss=0.02798, over 19870.00 frames. ], tot_loss[loss=0.1436, simple_loss=0.2327, pruned_loss=0.02729, over 1886966.60 frames. ], batch size: 60, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:15:15,496 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.885e+01 7.174e+01 7.568e+01 8.159e+01 1.038e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:15:15,861 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:15:33,196 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.59 vs. limit=10.0
2024-03-09 17:15:34,172 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=33453.333333333336, ans=0.003597101449275362
2024-03-09 17:15:51,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33520.0, ans=0.1
2024-03-09 17:16:19,716 INFO [train.py:997] (2/4) Epoch 32, batch 150, loss[loss=0.1359, simple_loss=0.2226, pruned_loss=0.02461, over 23572.00 frames. ], tot_loss[loss=0.1435, simple_loss=0.2329, pruned_loss=0.02701, over 2520168.77 frames. ], batch size: 128, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:17:09,038 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33706.666666666664, ans=0.1
2024-03-09 17:17:14,940 INFO [train.py:997] (2/4) Epoch 33, batch 0, loss[loss=0.1378, simple_loss=0.2217, pruned_loss=0.02702, over 24242.00 frames. ], tot_loss[loss=0.1378, simple_loss=0.2217, pruned_loss=0.02702, over 24242.00 frames. ], batch size: 229, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:17:14,941 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:17:24,825 INFO [train.py:1029] (2/4) Epoch 33, validation: loss=0.2104, simple_loss=0.3043, pruned_loss=0.05821, over 452978.00 frames. 
2024-03-09 17:17:24,826 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27051MB
2024-03-09 17:17:36,332 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=33706.666666666664, ans=0.0
2024-03-09 17:17:41,717 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0
2024-03-09 17:18:20,064 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:18:43,172 INFO [train.py:997] (2/4) Epoch 33, batch 50, loss[loss=0.1854, simple_loss=0.2694, pruned_loss=0.0507, over 23231.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.229, pruned_loss=0.02591, over 1049607.84 frames. ], batch size: 534, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:18:43,483 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=34040.0, ans=0.2
2024-03-09 17:18:45,080 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=34040.0, ans=0.0
2024-03-09 17:18:46,184 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+01 7.058e+01 7.697e+01 8.414e+01 1.529e+02, threshold=1.539e+02, percent-clipped=1.0
2024-03-09 17:18:51,724 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0
2024-03-09 17:18:54,167 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=34040.0, ans=0.0
2024-03-09 17:18:57,360 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=34106.666666666664, ans=0.0
2024-03-09 17:18:59,493 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.59 vs. limit=15.0
2024-03-09 17:19:11,925 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34106.666666666664, ans=0.125
2024-03-09 17:19:18,571 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=34173.333333333336, ans=0.0
2024-03-09 17:19:18,653 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=34173.333333333336, ans=10.0
2024-03-09 17:19:27,118 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34173.333333333336, ans=0.1
2024-03-09 17:20:08,483 INFO [train.py:997] (2/4) Epoch 33, batch 100, loss[loss=0.1549, simple_loss=0.2513, pruned_loss=0.02926, over 23824.00 frames. ], tot_loss[loss=0.142, simple_loss=0.2312, pruned_loss=0.02642, over 1875811.14 frames. ], batch size: 447, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:20:10,862 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0
2024-03-09 17:20:22,600 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=34440.0, ans=0.2
2024-03-09 17:20:30,302 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=34440.0, ans=0.1
2024-03-09 17:20:31,757 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34440.0, ans=0.125
2024-03-09 17:20:44,038 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=34506.666666666664, ans=0.125
2024-03-09 17:20:47,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=34506.666666666664, ans=0.04949747468305833
2024-03-09 17:20:53,222 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34573.333333333336, ans=0.125
2024-03-09 17:21:06,331 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=34573.333333333336, ans=0.05
2024-03-09 17:21:18,745 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34640.0, ans=0.1
2024-03-09 17:21:28,183 INFO [train.py:997] (2/4) Epoch 33, batch 150, loss[loss=0.1461, simple_loss=0.2362, pruned_loss=0.02803, over 24280.00 frames. ], tot_loss[loss=0.1438, simple_loss=0.2338, pruned_loss=0.02691, over 2514156.33 frames. ], batch size: 267, lr: 1.34e-02, grad_scale: 64.0
2024-03-09 17:21:31,133 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.628e+01 7.574e+01 8.231e+01 9.009e+01 1.365e+02, threshold=1.646e+02, percent-clipped=0.0
2024-03-09 17:22:22,791 INFO [train.py:997] (2/4) Epoch 34, batch 0, loss[loss=0.1448, simple_loss=0.2345, pruned_loss=0.02754, over 24058.00 frames. ], tot_loss[loss=0.1448, simple_loss=0.2345, pruned_loss=0.02754, over 24058.00 frames. ], batch size: 176, lr: 1.32e-02, grad_scale: 64.0
2024-03-09 17:22:22,791 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:22:32,283 INFO [train.py:1029] (2/4) Epoch 34, validation: loss=0.2117, simple_loss=0.3053, pruned_loss=0.0591, over 452978.00 frames. 
2024-03-09 17:22:32,284 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:22:37,355 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=34760.0, ans=0.125
2024-03-09 17:22:37,447 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34760.0, ans=0.0
2024-03-09 17:22:56,099 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=34826.666666666664, ans=0.0032985507246376814
2024-03-09 17:23:00,750 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=34826.666666666664, ans=0.0
2024-03-09 17:23:49,727 INFO [train.py:997] (2/4) Epoch 34, batch 50, loss[loss=0.129, simple_loss=0.2203, pruned_loss=0.01887, over 23885.00 frames. ], tot_loss[loss=0.1394, simple_loss=0.2283, pruned_loss=0.02526, over 1071868.33 frames. ], batch size: 142, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:24:03,318 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=35093.333333333336, ans=0.0
2024-03-09 17:24:26,755 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=35226.666666666664, ans=0.2
2024-03-09 17:24:35,890 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=35226.666666666664, ans=0.95
2024-03-09 17:24:51,409 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=35293.333333333336, ans=0.125
2024-03-09 17:24:54,525 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=35293.333333333336, ans=0.0031971014492753625
2024-03-09 17:24:57,512 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=35360.0, ans=0.125
2024-03-09 17:25:04,709 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.790e+01 6.987e+01 7.379e+01 8.041e+01 1.553e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 17:25:13,956 INFO [train.py:997] (2/4) Epoch 34, batch 100, loss[loss=0.1515, simple_loss=0.249, pruned_loss=0.02695, over 23827.00 frames. ], tot_loss[loss=0.1396, simple_loss=0.2292, pruned_loss=0.025, over 1881309.78 frames. ], batch size: 447, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:25:14,328 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=35426.666666666664, ans=0.0
2024-03-09 17:25:23,486 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35426.666666666664, ans=0.125
2024-03-09 17:25:30,476 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0
2024-03-09 17:25:34,248 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=35493.333333333336, ans=0.2
2024-03-09 17:26:22,831 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35693.333333333336, ans=0.1
2024-03-09 17:26:32,942 INFO [train.py:997] (2/4) Epoch 34, batch 150, loss[loss=0.1508, simple_loss=0.2458, pruned_loss=0.02793, over 23968.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2306, pruned_loss=0.02507, over 2514436.88 frames. ], batch size: 416, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:27:26,417 INFO [train.py:997] (2/4) Epoch 35, batch 0, loss[loss=0.1368, simple_loss=0.2253, pruned_loss=0.02416, over 24306.00 frames. ], tot_loss[loss=0.1368, simple_loss=0.2253, pruned_loss=0.02416, over 24306.00 frames. ], batch size: 254, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:27:26,418 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:27:38,584 INFO [train.py:1029] (2/4) Epoch 35, validation: loss=0.2098, simple_loss=0.3027, pruned_loss=0.05849, over 452978.00 frames. 
2024-03-09 17:27:38,584 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:28:15,932 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=35946.666666666664, ans=0.2
2024-03-09 17:28:16,321 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0
2024-03-09 17:28:25,533 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:28:30,263 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36013.333333333336, ans=0.1
2024-03-09 17:28:34,547 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.276e+01 7.140e+01 7.953e+01 8.912e+01 1.249e+02, threshold=1.591e+02, percent-clipped=0.0
2024-03-09 17:28:58,535 INFO [train.py:997] (2/4) Epoch 35, batch 50, loss[loss=0.1376, simple_loss=0.2266, pruned_loss=0.02431, over 23909.00 frames. ], tot_loss[loss=0.1429, simple_loss=0.2317, pruned_loss=0.02705, over 1052272.91 frames. ], batch size: 153, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:28:58,803 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36146.666666666664, ans=0.125
2024-03-09 17:29:18,678 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=36213.333333333336, ans=0.002997101449275361
2024-03-09 17:29:24,696 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=36213.333333333336, ans=0.125
2024-03-09 17:29:46,663 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=36346.666666666664, ans=0.125
2024-03-09 17:30:02,015 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36413.333333333336, ans=0.125
2024-03-09 17:30:18,437 INFO [train.py:997] (2/4) Epoch 35, batch 100, loss[loss=0.1428, simple_loss=0.2342, pruned_loss=0.02573, over 24271.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2338, pruned_loss=0.02722, over 1876023.22 frames. ], batch size: 267, lr: 1.29e-02, grad_scale: 128.0
2024-03-09 17:31:18,163 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.800e+01 7.204e+01 7.789e+01 8.601e+01 1.817e+02, threshold=1.558e+02, percent-clipped=1.0
2024-03-09 17:31:19,313 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=22.5
2024-03-09 17:31:38,613 INFO [train.py:997] (2/4) Epoch 35, batch 150, loss[loss=0.145, simple_loss=0.2414, pruned_loss=0.02429, over 24050.00 frames. ], tot_loss[loss=0.1435, simple_loss=0.2337, pruned_loss=0.02665, over 2506960.03 frames. ], batch size: 365, lr: 1.29e-02, grad_scale: 64.0
2024-03-09 17:31:47,935 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36813.333333333336, ans=0.125
2024-03-09 17:32:32,828 INFO [train.py:997] (2/4) Epoch 36, batch 0, loss[loss=0.1224, simple_loss=0.2015, pruned_loss=0.02166, over 23589.00 frames. ], tot_loss[loss=0.1224, simple_loss=0.2015, pruned_loss=0.02166, over 23589.00 frames. ], batch size: 116, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:32:32,828 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:32:40,920 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.9525, 2.3762, 2.5905, 2.6092], device='cuda:2')
2024-03-09 17:32:42,863 INFO [train.py:1029] (2/4) Epoch 36, validation: loss=0.212, simple_loss=0.307, pruned_loss=0.05847, over 452978.00 frames. 
2024-03-09 17:32:42,864 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:33:03,428 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=36933.333333333336, ans=0.09899494936611666
2024-03-09 17:33:10,444 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0
2024-03-09 17:33:21,744 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=37000.0, ans=0.0
2024-03-09 17:33:29,963 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0
2024-03-09 17:33:36,135 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.46 vs. limit=15.0
2024-03-09 17:34:10,522 INFO [train.py:997] (2/4) Epoch 36, batch 50, loss[loss=0.1477, simple_loss=0.2458, pruned_loss=0.02483, over 23977.00 frames. ], tot_loss[loss=0.1408, simple_loss=0.23, pruned_loss=0.02584, over 1069727.98 frames. ], batch size: 416, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:34:23,010 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=37200.0, ans=0.125
2024-03-09 17:34:24,724 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37266.666666666664, ans=0.125
2024-03-09 17:34:32,325 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:34:32,369 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=37266.666666666664, ans=0.2
2024-03-09 17:34:43,633 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37333.333333333336, ans=0.125
2024-03-09 17:34:52,052 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0
2024-03-09 17:34:52,769 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=37333.333333333336, ans=0.125
2024-03-09 17:34:55,596 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.060e+01 6.975e+01 7.752e+01 8.346e+01 1.468e+02, threshold=1.550e+02, percent-clipped=0.0
2024-03-09 17:35:10,862 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0
2024-03-09 17:35:16,363 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=37466.666666666664, ans=0.2
2024-03-09 17:35:25,797 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:35:28,459 INFO [train.py:997] (2/4) Epoch 36, batch 100, loss[loss=0.1514, simple_loss=0.2453, pruned_loss=0.02874, over 23974.00 frames. ], tot_loss[loss=0.1397, simple_loss=0.2292, pruned_loss=0.02509, over 1882046.42 frames. ], batch size: 416, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:35:56,861 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=37600.0, ans=0.0
2024-03-09 17:35:59,960 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=37600.0, ans=0.125
2024-03-09 17:36:01,376 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37666.666666666664, ans=0.125
2024-03-09 17:36:20,334 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37733.333333333336, ans=0.125
2024-03-09 17:36:24,067 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.17 vs. limit=10.0
2024-03-09 17:36:37,815 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0
2024-03-09 17:36:38,550 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=37800.0, ans=0.0026521739130434784
2024-03-09 17:36:44,015 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0
2024-03-09 17:36:50,951 INFO [train.py:997] (2/4) Epoch 36, batch 150, loss[loss=0.1439, simple_loss=0.2293, pruned_loss=0.02927, over 24098.00 frames. ], tot_loss[loss=0.1402, simple_loss=0.2299, pruned_loss=0.02527, over 2516283.42 frames. ], batch size: 165, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:36:56,448 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.25 vs. limit=15.0
2024-03-09 17:37:46,101 INFO [train.py:997] (2/4) Epoch 37, batch 0, loss[loss=0.1375, simple_loss=0.2303, pruned_loss=0.02238, over 24194.00 frames. ], tot_loss[loss=0.1375, simple_loss=0.2303, pruned_loss=0.02238, over 24194.00 frames. ], batch size: 217, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:37:46,102 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:37:55,592 INFO [train.py:1029] (2/4) Epoch 37, validation: loss=0.2112, simple_loss=0.3044, pruned_loss=0.05893, over 452978.00 frames. 
2024-03-09 17:37:55,593 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:37:58,936 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37920.0, ans=0.125
2024-03-09 17:38:01,974 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=37920.0, ans=0.2
2024-03-09 17:38:02,083 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37920.0, ans=0.125
2024-03-09 17:38:17,244 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=37986.666666666664, ans=0.0
2024-03-09 17:38:19,542 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.31 vs. limit=15.0
2024-03-09 17:38:20,419 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37986.666666666664, ans=0.125
2024-03-09 17:38:21,878 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=37986.666666666664, ans=0.125
2024-03-09 17:38:25,027 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=37986.666666666664, ans=0.0026115942028985505
2024-03-09 17:38:30,964 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.112e+01 7.137e+01 7.682e+01 8.524e+01 1.300e+02, threshold=1.536e+02, percent-clipped=0.0
2024-03-09 17:38:47,184 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=38120.0, ans=0.125
2024-03-09 17:39:20,069 INFO [train.py:997] (2/4) Epoch 37, batch 50, loss[loss=0.1495, simple_loss=0.2329, pruned_loss=0.03309, over 24092.00 frames. ], tot_loss[loss=0.1381, simple_loss=0.2271, pruned_loss=0.02458, over 1071551.82 frames. ], batch size: 165, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:39:42,150 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:43,722 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:56,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=38386.666666666664, ans=0.5
2024-03-09 17:40:19,358 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=38453.333333333336, ans=0.0025101449275362316
2024-03-09 17:40:36,327 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=38520.0, ans=0.0
2024-03-09 17:40:37,849 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38520.0, ans=0.1
2024-03-09 17:40:40,615 INFO [train.py:997] (2/4) Epoch 37, batch 100, loss[loss=0.1281, simple_loss=0.2133, pruned_loss=0.02146, over 23630.00 frames. ], tot_loss[loss=0.139, simple_loss=0.2291, pruned_loss=0.02449, over 1890939.57 frames. ], batch size: 116, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:40:42,503 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=38586.666666666664, ans=0.125
2024-03-09 17:41:08,793 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:41:13,211 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=38720.0, ans=0.0
2024-03-09 17:41:15,924 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.868e+01 6.991e+01 7.571e+01 8.226e+01 1.121e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:41:16,295 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=38720.0, ans=0.125
2024-03-09 17:41:16,933 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0
2024-03-09 17:42:00,692 INFO [train.py:997] (2/4) Epoch 37, batch 150, loss[loss=0.1421, simple_loss=0.2347, pruned_loss=0.0247, over 24277.00 frames. ], tot_loss[loss=0.1395, simple_loss=0.2295, pruned_loss=0.02477, over 2520408.82 frames. ], batch size: 267, lr: 1.24e-02, grad_scale: 64.0
2024-03-09 17:42:09,048 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38920.0, ans=0.125
2024-03-09 17:42:52,930 INFO [train.py:997] (2/4) Epoch 38, batch 0, loss[loss=0.1404, simple_loss=0.2289, pruned_loss=0.026, over 24195.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2289, pruned_loss=0.026, over 24195.00 frames. ], batch size: 217, lr: 1.23e-02, grad_scale: 64.0
2024-03-09 17:42:52,930 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:43:02,283 INFO [train.py:1029] (2/4) Epoch 38, validation: loss=0.2136, simple_loss=0.3079, pruned_loss=0.05959, over 452978.00 frames. 
2024-03-09 17:43:02,283 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:43:13,152 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=38973.333333333336, ans=0.125
2024-03-09 17:43:17,946 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38973.333333333336, ans=0.0
2024-03-09 17:43:21,008 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39040.0, ans=0.125
2024-03-09 17:43:22,017 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0
2024-03-09 17:43:45,519 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39106.666666666664, ans=0.1
2024-03-09 17:44:27,814 INFO [train.py:997] (2/4) Epoch 38, batch 50, loss[loss=0.1443, simple_loss=0.2397, pruned_loss=0.02444, over 24123.00 frames. ], tot_loss[loss=0.1379, simple_loss=0.2277, pruned_loss=0.02404, over 1071400.73 frames. ], batch size: 366, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:44:48,017 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.028e+01 7.170e+01 7.896e+01 8.779e+01 1.113e+02, threshold=1.579e+02, percent-clipped=0.0
2024-03-09 17:44:51,469 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=39373.333333333336, ans=0.002310144927536232
2024-03-09 17:45:03,480 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=39440.0, ans=0.07
2024-03-09 17:45:21,592 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=39506.666666666664, ans=0.0
2024-03-09 17:45:33,572 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0
2024-03-09 17:45:38,850 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=39573.333333333336, ans=0.125
2024-03-09 17:45:43,459 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=39573.333333333336, ans=0.05
2024-03-09 17:45:46,179 INFO [train.py:997] (2/4) Epoch 38, batch 100, loss[loss=0.1357, simple_loss=0.2239, pruned_loss=0.02378, over 24043.00 frames. ], tot_loss[loss=0.1393, simple_loss=0.2298, pruned_loss=0.0244, over 1893209.55 frames. ], batch size: 165, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:46:09,701 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=39706.666666666664, ans=0.0
2024-03-09 17:46:16,562 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0
2024-03-09 17:46:20,509 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=39773.333333333336, ans=0.09899494936611666
2024-03-09 17:47:07,593 INFO [train.py:997] (2/4) Epoch 38, batch 150, loss[loss=0.1358, simple_loss=0.2314, pruned_loss=0.0201, over 24093.00 frames. ], tot_loss[loss=0.1397, simple_loss=0.2303, pruned_loss=0.02457, over 2520913.76 frames. ], batch size: 344, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:48:03,476 INFO [train.py:997] (2/4) Epoch 39, batch 0, loss[loss=0.1329, simple_loss=0.2262, pruned_loss=0.01986, over 24062.00 frames. ], tot_loss[loss=0.1329, simple_loss=0.2262, pruned_loss=0.01986, over 24062.00 frames. ], batch size: 344, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:48:03,477 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:48:12,216 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.6502, 5.3125, 5.6457, 5.3410], device='cuda:2')
2024-03-09 17:48:12,745 INFO [train.py:1029] (2/4) Epoch 39, validation: loss=0.2141, simple_loss=0.3082, pruned_loss=0.06004, over 452978.00 frames. 
2024-03-09 17:48:12,746 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:48:26,644 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.993e+01 6.884e+01 7.356e+01 8.157e+01 1.068e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 17:48:40,644 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40093.333333333336, ans=0.1
2024-03-09 17:48:40,657 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=40093.333333333336, ans=0.0
2024-03-09 17:49:09,858 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=40226.666666666664, ans=0.002124637681159421
2024-03-09 17:49:16,101 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40226.666666666664, ans=0.1
2024-03-09 17:49:41,668 INFO [train.py:997] (2/4) Epoch 39, batch 50, loss[loss=0.1416, simple_loss=0.2311, pruned_loss=0.02604, over 24191.00 frames. ], tot_loss[loss=0.138, simple_loss=0.2272, pruned_loss=0.02437, over 1068898.46 frames. ], batch size: 295, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:49:44,373 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0
2024-03-09 17:50:20,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=40493.333333333336, ans=0.05
2024-03-09 17:50:59,996 INFO [train.py:997] (2/4) Epoch 39, batch 100, loss[loss=0.1243, simple_loss=0.2186, pruned_loss=0.01501, over 22863.00 frames. ], tot_loss[loss=0.1378, simple_loss=0.2281, pruned_loss=0.02373, over 1888872.37 frames. ], batch size: 609, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:51:09,406 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.940e+01 6.841e+01 7.461e+01 8.103e+01 1.250e+02, threshold=1.492e+02, percent-clipped=0.0
2024-03-09 17:51:36,818 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=40826.666666666664, ans=0.0
2024-03-09 17:52:21,040 INFO [train.py:997] (2/4) Epoch 39, batch 150, loss[loss=0.1354, simple_loss=0.2311, pruned_loss=0.01985, over 24144.00 frames. ], tot_loss[loss=0.1382, simple_loss=0.229, pruned_loss=0.02371, over 2523197.24 frames. ], batch size: 366, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:52:30,135 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41026.666666666664, ans=0.125
2024-03-09 17:53:16,190 INFO [train.py:997] (2/4) Epoch 40, batch 0, loss[loss=0.1141, simple_loss=0.2117, pruned_loss=0.008274, over 21415.00 frames. ], tot_loss[loss=0.1141, simple_loss=0.2117, pruned_loss=0.008274, over 21415.00 frames. ], batch size: 718, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:53:16,190 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:53:25,709 INFO [train.py:1029] (2/4) Epoch 40, validation: loss=0.2148, simple_loss=0.3085, pruned_loss=0.06058, over 452978.00 frames. 
2024-03-09 17:53:25,709 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:53:54,642 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0
2024-03-09 17:54:00,191 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:07,647 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:13,866 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=41213.333333333336, ans=10.0
2024-03-09 17:54:23,974 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0
2024-03-09 17:54:39,624 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=41346.666666666664, ans=0.125
2024-03-09 17:54:43,491 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0
2024-03-09 17:54:47,011 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.979e+01 7.013e+01 7.603e+01 8.055e+01 1.247e+02, threshold=1.521e+02, percent-clipped=0.0
2024-03-09 17:54:51,545 INFO [train.py:997] (2/4) Epoch 40, batch 50, loss[loss=0.1435, simple_loss=0.2316, pruned_loss=0.02769, over 24076.00 frames. ], tot_loss[loss=0.1369, simple_loss=0.2275, pruned_loss=0.02311, over 1062746.76 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:55:00,924 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41413.333333333336, ans=0.1
2024-03-09 17:55:34,858 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41546.666666666664, ans=0.125
2024-03-09 17:55:36,380 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=41613.333333333336, ans=0.025
2024-03-09 17:55:41,381 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0
2024-03-09 17:56:11,514 INFO [train.py:997] (2/4) Epoch 40, batch 100, loss[loss=0.1421, simple_loss=0.2409, pruned_loss=0.02163, over 24057.00 frames. ], tot_loss[loss=0.1387, simple_loss=0.2298, pruned_loss=0.02385, over 1880352.11 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:56:25,273 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41813.333333333336, ans=0.125
2024-03-09 17:56:25,274 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=41813.333333333336, ans=0.125
2024-03-09 17:56:38,662 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=41813.333333333336, ans=0.0017797101449275356
2024-03-09 17:57:01,827 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.09 vs. limit=10.0
2024-03-09 17:57:15,104 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=42013.333333333336, ans=0.0
2024-03-09 17:57:25,919 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.807e+01 6.999e+01 7.479e+01 8.341e+01 1.133e+02, threshold=1.496e+02, percent-clipped=0.0
2024-03-09 17:57:29,664 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=42080.0, ans=0.0
2024-03-09 17:57:29,749 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=42080.0, ans=0.2
2024-03-09 17:57:30,900 INFO [train.py:997] (2/4) Epoch 40, batch 150, loss[loss=0.1284, simple_loss=0.2239, pruned_loss=0.01647, over 22919.00 frames. ], tot_loss[loss=0.1385, simple_loss=0.2292, pruned_loss=0.02386, over 2520675.51 frames. ], batch size: 609, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:57:34,323 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=42080.0, ans=0.0
2024-03-09 17:58:21,383 INFO [train.py:997] (2/4) Epoch 41, batch 0, loss[loss=0.1449, simple_loss=0.2422, pruned_loss=0.0238, over 23837.00 frames. ], tot_loss[loss=0.1449, simple_loss=0.2422, pruned_loss=0.0238, over 23837.00 frames. ], batch size: 447, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 17:58:21,383 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:58:30,942 INFO [train.py:1029] (2/4) Epoch 41, validation: loss=0.2136, simple_loss=0.3076, pruned_loss=0.05982, over 452978.00 frames. 
2024-03-09 17:58:30,942 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:58:52,553 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=42200.0, ans=0.0016956521739130443
2024-03-09 17:59:16,947 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=42266.666666666664, ans=0.04949747468305833
2024-03-09 17:59:26,230 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42333.333333333336, ans=0.1
2024-03-09 17:59:41,645 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42400.0, ans=0.1
2024-03-09 17:59:44,713 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=42400.0, ans=0.0016521739130434792
2024-03-09 17:59:45,314 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.49 vs. limit=12.0
2024-03-09 17:59:53,110 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0
2024-03-09 17:59:53,553 INFO [train.py:997] (2/4) Epoch 41, batch 50, loss[loss=0.1425, simple_loss=0.2332, pruned_loss=0.02589, over 24080.00 frames. ], tot_loss[loss=0.1364, simple_loss=0.2274, pruned_loss=0.02268, over 1071951.54 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:00:15,798 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=42533.333333333336, ans=0.2
2024-03-09 18:00:18,834 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42533.333333333336, ans=0.125
2024-03-09 18:00:36,173 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0
2024-03-09 18:00:55,611 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+01 7.025e+01 7.943e+01 8.921e+01 1.202e+02, threshold=1.589e+02, percent-clipped=0.0
2024-03-09 18:00:56,005 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42733.333333333336, ans=0.125
2024-03-09 18:01:09,779 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=42733.333333333336, ans=0.0
2024-03-09 18:01:14,009 INFO [train.py:997] (2/4) Epoch 41, batch 100, loss[loss=0.1255, simple_loss=0.2142, pruned_loss=0.01841, over 23967.00 frames. ], tot_loss[loss=0.1367, simple_loss=0.2276, pruned_loss=0.02288, over 1877208.30 frames. ], batch size: 142, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:01:37,481 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=42866.666666666664, ans=0.125
2024-03-09 18:01:54,136 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=42933.333333333336, ans=0.125
2024-03-09 18:01:55,711 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42933.333333333336, ans=0.1
2024-03-09 18:02:22,732 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=22.5
2024-03-09 18:02:33,531 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=43133.333333333336, ans=0.125
2024-03-09 18:02:34,744 INFO [train.py:997] (2/4) Epoch 41, batch 150, loss[loss=0.1384, simple_loss=0.23, pruned_loss=0.02337, over 24168.00 frames. ], tot_loss[loss=0.1376, simple_loss=0.2285, pruned_loss=0.02335, over 2507465.28 frames. ], batch size: 326, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:03:28,794 INFO [train.py:997] (2/4) Epoch 42, batch 0, loss[loss=0.134, simple_loss=0.2254, pruned_loss=0.02131, over 24201.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2254, pruned_loss=0.02131, over 24201.00 frames. ], batch size: 295, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:03:28,794 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:03:38,341 INFO [train.py:1029] (2/4) Epoch 42, validation: loss=0.2135, simple_loss=0.3075, pruned_loss=0.05972, over 452978.00 frames. 
2024-03-09 18:03:38,342 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:04:03,016 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43253.333333333336, ans=0.1
2024-03-09 18:04:23,937 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=43320.0, ans=15.0
2024-03-09 18:04:26,993 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.11 vs. limit=15.0
2024-03-09 18:04:29,006 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+01 6.812e+01 7.244e+01 8.018e+01 1.063e+02, threshold=1.449e+02, percent-clipped=0.0
2024-03-09 18:04:33,174 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.04 vs. limit=12.0
2024-03-09 18:04:39,615 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.16 vs. limit=22.5
2024-03-09 18:04:58,778 INFO [train.py:997] (2/4) Epoch 42, batch 50, loss[loss=0.1374, simple_loss=0.2325, pruned_loss=0.02117, over 24209.00 frames. ], tot_loss[loss=0.1344, simple_loss=0.224, pruned_loss=0.0224, over 1063053.21 frames. ], batch size: 327, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:05:06,716 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=43520.0, ans=0.0014086956521739136
2024-03-09 18:05:44,548 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.46 vs. limit=22.5
2024-03-09 18:05:57,777 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=43720.0, ans=0.125
2024-03-09 18:06:20,953 INFO [train.py:997] (2/4) Epoch 42, batch 100, loss[loss=0.1233, simple_loss=0.2109, pruned_loss=0.01781, over 23717.00 frames. ], tot_loss[loss=0.1346, simple_loss=0.2245, pruned_loss=0.02238, over 1878545.59 frames. ], batch size: 116, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:06:23,450 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.95 vs. limit=10.0
2024-03-09 18:07:09,738 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.750e+01 6.712e+01 7.266e+01 7.977e+01 1.080e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:07:24,501 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=44120.0, ans=0.125
2024-03-09 18:07:39,996 INFO [train.py:997] (2/4) Epoch 42, batch 150, loss[loss=0.137, simple_loss=0.223, pruned_loss=0.02552, over 19881.00 frames. ], tot_loss[loss=0.1349, simple_loss=0.2254, pruned_loss=0.02216, over 2514217.23 frames. ], batch size: 59, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:07:40,262 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=44186.666666666664, ans=0.125
2024-03-09 18:08:31,608 INFO [train.py:997] (2/4) Epoch 43, batch 0, loss[loss=0.1316, simple_loss=0.218, pruned_loss=0.02257, over 24324.00 frames. ], tot_loss[loss=0.1316, simple_loss=0.218, pruned_loss=0.02257, over 24324.00 frames. ], batch size: 208, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:08:31,608 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:08:41,004 INFO [train.py:1029] (2/4) Epoch 43, validation: loss=0.2134, simple_loss=0.3077, pruned_loss=0.05952, over 452978.00 frames. 
2024-03-09 18:08:41,005 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:09:23,371 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=44373.333333333336, ans=0.125
2024-03-09 18:09:36,220 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=22.5
2024-03-09 18:09:46,538 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=44506.666666666664, ans=0.125
2024-03-09 18:10:01,379 INFO [train.py:997] (2/4) Epoch 43, batch 50, loss[loss=0.1368, simple_loss=0.2225, pruned_loss=0.02552, over 24062.00 frames. ], tot_loss[loss=0.1347, simple_loss=0.2255, pruned_loss=0.02198, over 1072218.91 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:10:22,500 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0
2024-03-09 18:10:36,518 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.916e+01 6.864e+01 7.263e+01 8.155e+01 1.054e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:10:40,006 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=44706.666666666664, ans=0.2
2024-03-09 18:10:43,434 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0
2024-03-09 18:10:46,091 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=44773.333333333336, ans=0.125
2024-03-09 18:10:46,106 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=44773.333333333336, ans=0.0
2024-03-09 18:10:49,207 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=44773.333333333336, ans=0.05
2024-03-09 18:11:06,743 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0
2024-03-09 18:11:19,226 INFO [train.py:997] (2/4) Epoch 43, batch 100, loss[loss=0.1264, simple_loss=0.2111, pruned_loss=0.02089, over 23618.00 frames. ], tot_loss[loss=0.1352, simple_loss=0.2259, pruned_loss=0.02223, over 1893635.62 frames. ], batch size: 128, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:11:55,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=45040.0, ans=0.125
2024-03-09 18:12:38,760 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.08 vs. limit=12.0
2024-03-09 18:12:40,866 INFO [train.py:997] (2/4) Epoch 43, batch 150, loss[loss=0.1335, simple_loss=0.2295, pruned_loss=0.01875, over 24175.00 frames. ], tot_loss[loss=0.1369, simple_loss=0.2276, pruned_loss=0.02307, over 2526801.06 frames. ], batch size: 366, lr: 1.12e-02, grad_scale: 32.0
2024-03-09 18:13:36,397 INFO [train.py:997] (2/4) Epoch 44, batch 0, loss[loss=0.1362, simple_loss=0.2244, pruned_loss=0.02402, over 24256.00 frames. ], tot_loss[loss=0.1362, simple_loss=0.2244, pruned_loss=0.02402, over 24256.00 frames. ], batch size: 198, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:13:36,397 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:13:45,433 INFO [train.py:1029] (2/4) Epoch 44, validation: loss=0.2121, simple_loss=0.3064, pruned_loss=0.05891, over 452978.00 frames. 
2024-03-09 18:13:45,433 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:14:19,829 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.880e+01 6.918e+01 7.525e+01 8.097e+01 1.200e+02, threshold=1.505e+02, percent-clipped=0.0
2024-03-09 18:14:20,284 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=45360.0, ans=0.125
2024-03-09 18:14:40,642 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45493.333333333336, ans=0.125
2024-03-09 18:14:48,183 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45493.333333333336, ans=0.1
2024-03-09 18:15:08,394 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.043e-02
2024-03-09 18:15:12,600 INFO [train.py:997] (2/4) Epoch 44, batch 50, loss[loss=0.1366, simple_loss=0.2257, pruned_loss=0.0237, over 24213.00 frames. ], tot_loss[loss=0.1363, simple_loss=0.2266, pruned_loss=0.02297, over 1070982.79 frames. ], batch size: 241, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:15:22,049 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45626.666666666664, ans=0.1
2024-03-09 18:15:31,433 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45693.333333333336, ans=0.125
2024-03-09 18:15:48,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=45760.0, ans=0.125
2024-03-09 18:16:00,464 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=45826.666666666664, ans=0.125
2024-03-09 18:16:03,549 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=45826.666666666664, ans=0.0
2024-03-09 18:16:14,469 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=45893.333333333336, ans=0.125
2024-03-09 18:16:30,709 INFO [train.py:997] (2/4) Epoch 44, batch 100, loss[loss=0.129, simple_loss=0.2209, pruned_loss=0.01858, over 24261.00 frames. ], tot_loss[loss=0.1355, simple_loss=0.2263, pruned_loss=0.02232, over 1879870.47 frames. ], batch size: 198, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:16:49,127 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=46026.666666666664, ans=0.2
2024-03-09 18:17:01,036 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.671e+01 6.824e+01 7.356e+01 8.103e+01 1.148e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 18:17:09,048 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=46093.333333333336, ans=0.0
2024-03-09 18:17:24,195 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46160.0, ans=0.125
2024-03-09 18:17:34,646 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:41,283 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:44,868 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:51,570 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=15.0
2024-03-09 18:17:51,977 INFO [train.py:997] (2/4) Epoch 44, batch 150, loss[loss=0.1372, simple_loss=0.2233, pruned_loss=0.0256, over 24294.00 frames. ], tot_loss[loss=0.1353, simple_loss=0.227, pruned_loss=0.0218, over 2514145.75 frames. ], batch size: 188, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:17:53,657 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46293.333333333336, ans=0.125
2024-03-09 18:18:43,510 INFO [train.py:997] (2/4) Epoch 45, batch 0, loss[loss=0.1288, simple_loss=0.2152, pruned_loss=0.02118, over 24274.00 frames. ], tot_loss[loss=0.1288, simple_loss=0.2152, pruned_loss=0.02118, over 24274.00 frames. ], batch size: 229, lr: 1.09e-02, grad_scale: 32.0
2024-03-09 18:18:43,510 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:18:53,094 INFO [train.py:1029] (2/4) Epoch 45, validation: loss=0.2137, simple_loss=0.3089, pruned_loss=0.05927, over 452978.00 frames. 
2024-03-09 18:18:53,095 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:19:05,780 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=46346.666666666664, ans=0.1
2024-03-09 18:19:23,120 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5
2024-03-09 18:19:32,732 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0
2024-03-09 18:19:41,266 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46546.666666666664, ans=0.1
2024-03-09 18:19:46,926 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=46546.666666666664, ans=0.2
2024-03-09 18:19:57,743 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=46546.666666666664, ans=0.05
2024-03-09 18:20:01,403 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0
2024-03-09 18:20:02,439 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=46613.333333333336, ans=0.0007362318840579696
2024-03-09 18:20:16,257 INFO [train.py:997] (2/4) Epoch 45, batch 50, loss[loss=0.1174, simple_loss=0.2083, pruned_loss=0.01326, over 23907.00 frames. ], tot_loss[loss=0.1332, simple_loss=0.2231, pruned_loss=0.02163, over 1066649.49 frames. ], batch size: 142, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:20:25,682 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46680.0, ans=0.1
2024-03-09 18:20:29,931 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.843e+01 6.817e+01 7.386e+01 8.152e+01 1.203e+02, threshold=1.477e+02, percent-clipped=0.0
2024-03-09 18:20:42,431 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46746.666666666664, ans=0.125
2024-03-09 18:20:45,564 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=46813.333333333336, ans=0.125
2024-03-09 18:20:53,107 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46813.333333333336, ans=0.125
2024-03-09 18:21:02,448 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=46880.0, ans=0.0
2024-03-09 18:21:35,457 INFO [train.py:997] (2/4) Epoch 45, batch 100, loss[loss=0.1478, simple_loss=0.2458, pruned_loss=0.02492, over 23684.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2252, pruned_loss=0.02144, over 1890075.48 frames. ], batch size: 486, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:21:37,217 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=47013.333333333336, ans=0.0
2024-03-09 18:21:53,417 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0
2024-03-09 18:22:03,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=47080.0, ans=0.2
2024-03-09 18:22:10,049 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=22.5
2024-03-09 18:22:16,801 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=47146.666666666664, ans=0.035
2024-03-09 18:22:24,506 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47213.333333333336, ans=0.125
2024-03-09 18:22:42,566 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.26 vs. limit=15.0
2024-03-09 18:22:46,486 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=47280.0, ans=0.2
2024-03-09 18:22:55,744 INFO [train.py:997] (2/4) Epoch 45, batch 150, loss[loss=0.1664, simple_loss=0.254, pruned_loss=0.03941, over 23212.00 frames. ], tot_loss[loss=0.1359, simple_loss=0.2269, pruned_loss=0.02249, over 2518070.61 frames. ], batch size: 534, lr: 1.08e-02, grad_scale: 16.0
2024-03-09 18:23:43,806 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=47400.0, ans=0.0
2024-03-09 18:23:50,623 INFO [train.py:997] (2/4) Epoch 46, batch 0, loss[loss=0.1432, simple_loss=0.2268, pruned_loss=0.02982, over 24129.00 frames. ], tot_loss[loss=0.1432, simple_loss=0.2268, pruned_loss=0.02982, over 24129.00 frames. ], batch size: 165, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:23:50,624 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:24:00,486 INFO [train.py:1029] (2/4) Epoch 46, validation: loss=0.2142, simple_loss=0.3085, pruned_loss=0.05997, over 452978.00 frames. 
2024-03-09 18:24:00,487 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:24:05,179 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+01 6.849e+01 7.495e+01 7.996e+01 1.078e+02, threshold=1.499e+02, percent-clipped=0.0
2024-03-09 18:24:07,569 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5
2024-03-09 18:24:16,641 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=15.0
2024-03-09 18:24:29,570 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=47466.666666666664, ans=0.2
2024-03-09 18:25:25,829 INFO [train.py:997] (2/4) Epoch 46, batch 50, loss[loss=0.1487, simple_loss=0.246, pruned_loss=0.02576, over 23774.00 frames. ], tot_loss[loss=0.1351, simple_loss=0.2257, pruned_loss=0.02229, over 1056332.04 frames. ], batch size: 447, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:25:39,126 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=15.0
2024-03-09 18:26:24,958 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=47933.333333333336, ans=0.0
2024-03-09 18:26:37,848 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=48000.0, ans=0.125
2024-03-09 18:26:40,906 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=48000.0, ans=0.0004347826086956528
2024-03-09 18:26:45,324 INFO [train.py:997] (2/4) Epoch 46, batch 100, loss[loss=0.1348, simple_loss=0.2278, pruned_loss=0.02095, over 24272.00 frames. ], tot_loss[loss=0.1357, simple_loss=0.2265, pruned_loss=0.02243, over 1878666.24 frames. ], batch size: 281, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:26:49,978 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.653e+01 6.627e+01 7.164e+01 7.678e+01 1.012e+02, threshold=1.433e+02, percent-clipped=0.0
2024-03-09 18:26:55,813 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.78 vs. limit=10.0
2024-03-09 18:27:27,831 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=48200.0, ans=0.0
2024-03-09 18:27:45,397 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=48266.666666666664, ans=0.125
2024-03-09 18:28:06,133 INFO [train.py:997] (2/4) Epoch 46, batch 150, loss[loss=0.1401, simple_loss=0.2273, pruned_loss=0.02641, over 24049.00 frames. ], tot_loss[loss=0.1358, simple_loss=0.2269, pruned_loss=0.02238, over 2512007.53 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:28:07,823 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:29:00,570 INFO [train.py:997] (2/4) Epoch 47, batch 0, loss[loss=0.1299, simple_loss=0.2214, pruned_loss=0.0192, over 24297.00 frames. ], tot_loss[loss=0.1299, simple_loss=0.2214, pruned_loss=0.0192, over 24297.00 frames. ], batch size: 281, lr: 1.05e-02, grad_scale: 32.0
2024-03-09 18:29:00,571 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:29:10,390 INFO [train.py:1029] (2/4) Epoch 47, validation: loss=0.2152, simple_loss=0.3095, pruned_loss=0.06041, over 452978.00 frames. 
2024-03-09 18:29:10,391 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:29:11,564 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5
2024-03-09 18:29:19,106 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.47 vs. limit=10.0
2024-03-09 18:29:33,277 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=48520.0, ans=0.125
2024-03-09 18:29:42,361 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48586.666666666664, ans=0.1
2024-03-09 18:30:07,579 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48653.333333333336, ans=0.0
2024-03-09 18:30:28,055 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.832e+01 6.822e+01 7.253e+01 7.989e+01 1.051e+02, threshold=1.451e+02, percent-clipped=0.0
2024-03-09 18:30:34,259 INFO [train.py:997] (2/4) Epoch 47, batch 50, loss[loss=0.1384, simple_loss=0.2347, pruned_loss=0.02108, over 24005.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2238, pruned_loss=0.02148, over 1081251.69 frames. ], batch size: 388, lr: 1.05e-02, grad_scale: 16.0
2024-03-09 18:30:36,114 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=48786.666666666664, ans=0.125
2024-03-09 18:30:46,476 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0
2024-03-09 18:30:47,614 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=48786.666666666664, ans=0.125
2024-03-09 18:30:56,769 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=48853.333333333336, ans=0.07
2024-03-09 18:31:02,013 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0
2024-03-09 18:31:18,828 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.03 vs. limit=22.5
2024-03-09 18:31:20,263 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=10.0
2024-03-09 18:31:25,803 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=48986.666666666664, ans=0.2
2024-03-09 18:31:53,476 INFO [train.py:997] (2/4) Epoch 47, batch 100, loss[loss=0.1376, simple_loss=0.2362, pruned_loss=0.01955, over 23933.00 frames. ], tot_loss[loss=0.1331, simple_loss=0.2243, pruned_loss=0.02098, over 1884190.11 frames. ], batch size: 387, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:32:23,553 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0
2024-03-09 18:32:40,429 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=49253.333333333336, ans=0.125
2024-03-09 18:32:41,758 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=49253.333333333336, ans=0.125
2024-03-09 18:33:02,454 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0
2024-03-09 18:33:10,429 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.939e+01 7.123e+01 7.707e+01 8.583e+01 1.160e+02, threshold=1.541e+02, percent-clipped=0.0
2024-03-09 18:33:12,794 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49386.666666666664, ans=0.125
2024-03-09 18:33:15,548 INFO [train.py:997] (2/4) Epoch 47, batch 150, loss[loss=0.1442, simple_loss=0.2278, pruned_loss=0.03028, over 24055.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2249, pruned_loss=0.02099, over 2506885.08 frames. ], batch size: 165, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:33:15,846 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=49453.333333333336, ans=0.07
2024-03-09 18:34:03,520 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49506.666666666664, ans=0.125
2024-03-09 18:34:05,688 INFO [train.py:997] (2/4) Epoch 48, batch 0, loss[loss=0.1148, simple_loss=0.1943, pruned_loss=0.01763, over 23890.00 frames. ], tot_loss[loss=0.1148, simple_loss=0.1943, pruned_loss=0.01763, over 23890.00 frames. ], batch size: 117, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:34:05,689 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:34:15,167 INFO [train.py:1029] (2/4) Epoch 48, validation: loss=0.2149, simple_loss=0.3083, pruned_loss=0.06081, over 452978.00 frames. 
2024-03-09 18:34:15,168 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:34:46,412 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=49573.333333333336, ans=9.275362318840637e-05
2024-03-09 18:35:08,324 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=12.0
2024-03-09 18:35:31,699 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=49773.333333333336, ans=0.0
2024-03-09 18:35:40,446 INFO [train.py:997] (2/4) Epoch 48, batch 50, loss[loss=0.1329, simple_loss=0.2198, pruned_loss=0.02301, over 24057.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2253, pruned_loss=0.02073, over 1074633.96 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:35:52,080 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5
2024-03-09 18:36:16,806 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=49973.333333333336, ans=0.125
2024-03-09 18:36:18,405 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=49973.333333333336, ans=0.07
2024-03-09 18:36:38,196 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=50040.0, ans=0.2
2024-03-09 18:36:39,624 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=50040.0, ans=0.0
2024-03-09 18:36:42,440 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.770e+01 6.729e+01 7.301e+01 8.005e+01 9.735e+01, threshold=1.460e+02, percent-clipped=0.0
2024-03-09 18:36:59,148 INFO [train.py:997] (2/4) Epoch 48, batch 100, loss[loss=0.1308, simple_loss=0.2182, pruned_loss=0.0217, over 24239.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.226, pruned_loss=0.02118, over 1878205.65 frames. ], batch size: 229, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:37:18,327 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=15.0
2024-03-09 18:37:25,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50240.0, ans=0.125
2024-03-09 18:37:43,216 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=50306.666666666664, ans=0.5
2024-03-09 18:37:49,335 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50373.333333333336, ans=0.1
2024-03-09 18:38:16,390 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0
2024-03-09 18:38:17,397 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=50440.0, ans=0.0
2024-03-09 18:38:20,108 INFO [train.py:997] (2/4) Epoch 48, batch 150, loss[loss=0.1177, simple_loss=0.2043, pruned_loss=0.01552, over 23931.00 frames. ], tot_loss[loss=0.1333, simple_loss=0.2249, pruned_loss=0.02085, over 2510086.46 frames. ], batch size: 142, lr: 1.03e-02, grad_scale: 8.0
2024-03-09 18:38:28,107 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=50506.666666666664, ans=0.07
2024-03-09 18:38:28,138 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=50506.666666666664, ans=0.0
2024-03-09 18:39:15,082 INFO [train.py:997] (2/4) Epoch 49, batch 0, loss[loss=0.1342, simple_loss=0.2, pruned_loss=0.03417, over 23983.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.2, pruned_loss=0.03417, over 23983.00 frames. ], batch size: 142, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:39:15,083 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:39:24,777 INFO [train.py:1029] (2/4) Epoch 49, validation: loss=0.2171, simple_loss=0.31, pruned_loss=0.06203, over 452978.00 frames. 
2024-03-09 18:39:24,778 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:39:43,422 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=50560.0, ans=0.035
2024-03-09 18:39:45,477 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0
2024-03-09 18:39:49,595 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=50626.666666666664, ans=0.2
2024-03-09 18:39:50,271 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.68 vs. limit=22.5
2024-03-09 18:39:50,991 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=50626.666666666664, ans=0.125
2024-03-09 18:40:01,326 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0
2024-03-09 18:40:22,423 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50760.0, ans=0.1
2024-03-09 18:40:23,493 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+01 6.914e+01 7.599e+01 8.430e+01 1.205e+02, threshold=1.520e+02, percent-clipped=0.0
2024-03-09 18:40:51,380 INFO [train.py:997] (2/4) Epoch 49, batch 50, loss[loss=0.1187, simple_loss=0.2148, pruned_loss=0.01126, over 22871.00 frames. ], tot_loss[loss=0.1338, simple_loss=0.2248, pruned_loss=0.0214, over 1065684.76 frames. ], batch size: 609, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:40:58,397 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.69 vs. limit=6.0
2024-03-09 18:41:06,977 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50960.0, ans=0.125
2024-03-09 18:41:11,068 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0
2024-03-09 18:41:19,399 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=50960.0, ans=0.0
2024-03-09 18:41:19,451 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=50960.0, ans=0.1
2024-03-09 18:41:20,993 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=51026.666666666664, ans=0.125
2024-03-09 18:41:23,461 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0
2024-03-09 18:42:10,663 INFO [train.py:997] (2/4) Epoch 49, batch 100, loss[loss=0.1376, simple_loss=0.2243, pruned_loss=0.02545, over 24069.00 frames. ], tot_loss[loss=0.1337, simple_loss=0.2248, pruned_loss=0.02128, over 1887734.01 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:42:14,001 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51226.666666666664, ans=0.125
2024-03-09 18:42:19,178 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0
2024-03-09 18:42:54,560 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=22.5
2024-03-09 18:42:58,424 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51426.666666666664, ans=0.1
2024-03-09 18:43:04,158 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.035e+01 6.804e+01 7.380e+01 7.884e+01 1.078e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 18:43:12,040 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=51426.666666666664, ans=0.0
2024-03-09 18:43:30,601 INFO [train.py:997] (2/4) Epoch 49, batch 150, loss[loss=0.1402, simple_loss=0.2252, pruned_loss=0.02761, over 24089.00 frames. ], tot_loss[loss=0.1341, simple_loss=0.2252, pruned_loss=0.02148, over 2524757.47 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:43:34,367 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51560.0, ans=0.125
2024-03-09 18:44:22,349 INFO [train.py:997] (2/4) Epoch 50, batch 0, loss[loss=0.1177, simple_loss=0.2032, pruned_loss=0.01613, over 23802.00 frames. ], tot_loss[loss=0.1177, simple_loss=0.2032, pruned_loss=0.01613, over 23802.00 frames. ], batch size: 117, lr: 1.00e-02, grad_scale: 16.0
2024-03-09 18:44:22,349 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:44:31,468 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.1928, 4.4109, 4.2274, 2.9856], device='cuda:2')
2024-03-09 18:44:31,920 INFO [train.py:1029] (2/4) Epoch 50, validation: loss=0.2164, simple_loss=0.3113, pruned_loss=0.06071, over 452978.00 frames. 
2024-03-09 18:44:31,921 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:44:41,739 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=51613.333333333336, ans=0.09899494936611666
2024-03-09 18:45:19,022 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=51746.666666666664, ans=0.0
2024-03-09 18:45:19,178 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=51746.666666666664, ans=0.2
2024-03-09 18:45:29,903 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=51813.333333333336, ans=0.0
2024-03-09 18:45:40,648 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=51880.0, ans=0.0
2024-03-09 18:45:40,665 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51880.0, ans=0.1
2024-03-09 18:45:57,119 INFO [train.py:997] (2/4) Epoch 50, batch 50, loss[loss=0.1315, simple_loss=0.2219, pruned_loss=0.0205, over 24266.00 frames. ], tot_loss[loss=0.1328, simple_loss=0.2238, pruned_loss=0.02086, over 1076247.40 frames. ], batch size: 254, lr: 1.00e-02, grad_scale: 8.0
2024-03-09 18:46:37,165 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.908e+01 6.888e+01 7.222e+01 7.907e+01 1.090e+02, threshold=1.444e+02, percent-clipped=0.0
2024-03-09 18:47:12,660 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=52280.0, ans=0.2
2024-03-09 18:47:13,806 INFO [train.py:997] (2/4) Epoch 50, batch 100, loss[loss=0.1378, simple_loss=0.2215, pruned_loss=0.02701, over 23908.00 frames. ], tot_loss[loss=0.1339, simple_loss=0.2252, pruned_loss=0.02123, over 1891086.62 frames. ], batch size: 153, lr: 9.99e-03, grad_scale: 8.0
2024-03-09 18:47:18,834 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=52280.0, ans=0.125
2024-03-09 18:47:37,302 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=52346.666666666664, ans=0.0
2024-03-09 18:47:41,672 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.550e-03
2024-03-09 18:47:52,411 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=52413.333333333336, ans=0.2
2024-03-09 18:47:53,903 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=52413.333333333336, ans=0.125
2024-03-09 18:47:55,375 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52413.333333333336, ans=0.1
2024-03-09 18:47:55,407 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=52413.333333333336, ans=0.2
2024-03-09 18:47:56,878 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=52413.333333333336, ans=0.125
2024-03-09 18:48:02,070 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0
2024-03-09 18:48:10,607 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52480.0, ans=0.125
2024-03-09 18:48:19,523 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:21,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=52546.666666666664, ans=0.125
2024-03-09 18:48:24,807 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=22.5
2024-03-09 18:48:27,106 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:36,460 INFO [train.py:997] (2/4) Epoch 50, batch 150, loss[loss=0.1327, simple_loss=0.2285, pruned_loss=0.01848, over 24183.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.2258, pruned_loss=0.02127, over 2521109.74 frames. ], batch size: 345, lr: 9.97e-03, grad_scale: 8.0
2024-03-09 18:48:45,967 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=52613.333333333336, ans=0.0
2024-03-09 18:48:48,730 INFO [train.py:1248] (2/4) Done!