2024-03-09 12:56:09,179 INFO [train.py:1065] (1/4) Training started 2024-03-09 12:56:09,179 INFO [train.py:1075] (1/4) Device: cuda:1 2024-03-09 12:56:09,271 INFO [lexicon.py:168] (1/4) Loading pre-compiled data/lang_char/Linv.pt 2024-03-09 12:56:09,334 INFO [train.py:1086] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2989b0b1186fa6022932804f5b39fbb2781ebf42', 'k2-git-date': 'Fri Nov 24 11:34:10 2023', 'lhotse-version': '1.22.0.dev+git.d8ed1bbb.dirty', 'torch-version': '1.11.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'dev/mdcc', 'icefall-git-sha1': 'f62fc7f0-clean', 'icefall-git-date': 'Sat Mar 9 12:55:42 2024', 'icefall-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/icefall-1.0-py3.9.egg', 'k2-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/k2-1.24.4.dev20231207+cuda10.2.torch1.11.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/lhotse-1.22.0.dev0+git.d8ed1bbb.dirty-py3.9.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-1207150844-f49d8c4f4-c49d5', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 1, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 4852} 2024-03-09 12:56:09,334 INFO [train.py:1088] (1/4) About to create model 2024-03-09 12:56:09,982 INFO [train.py:1092] (1/4) Number of model parameters: 74470867 2024-03-09 12:56:15,130 INFO [train.py:1107] (1/4) Using DDP 2024-03-09 12:56:15,518 INFO [asr_datamodule.py:368] (1/4) About to get train cuts 2024-03-09 12:56:15,622 INFO [asr_datamodule.py:376] (1/4) About to get valid cuts 2024-03-09 12:56:15,640 INFO [asr_datamodule.py:195] (1/4) About to get Musan cuts 2024-03-09 12:56:18,711 INFO [asr_datamodule.py:200] (1/4) Enable MUSAN 2024-03-09 12:56:18,711 INFO [asr_datamodule.py:223] (1/4) Enable SpecAugment 2024-03-09 12:56:18,712 INFO [asr_datamodule.py:224] (1/4) Time warp factor: 80 2024-03-09 12:56:18,712 INFO [asr_datamodule.py:234] (1/4) Num frame mask: 10 2024-03-09 12:56:18,712 INFO [asr_datamodule.py:247] (1/4) About to create train dataset 2024-03-09 12:56:18,713 INFO [asr_datamodule.py:273] (1/4) Using DynamicBucketingSampler. 2024-03-09 12:56:19,505 INFO [asr_datamodule.py:290] (1/4) About to create train dataloader 2024-03-09 12:56:19,505 INFO [asr_datamodule.py:315] (1/4) About to create dev dataset 2024-03-09 12:56:19,801 INFO [asr_datamodule.py:332] (1/4) About to create dev dataloader 2024-03-09 12:57:18,476 INFO [train.py:997] (1/4) Epoch 1, batch 0, loss[loss=10.37, simple_loss=9.46, pruned_loss=9.124, over 23825.00 frames. ], tot_loss[loss=10.37, simple_loss=9.46, pruned_loss=9.124, over 23825.00 frames. ], batch size: 447, lr: 2.25e-02, grad_scale: 1.0 2024-03-09 12:57:18,476 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 12:57:28,776 INFO [train.py:1029] (1/4) Epoch 1, validation: loss=10.41, simple_loss=9.49, pruned_loss=9.134, over 452978.00 frames. 2024-03-09 12:57:28,777 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 24846MB 2024-03-09 12:57:38,879 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=4.0 2024-03-09 12:57:52,244 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+03 5.651e+03 5.908e+03 6.903e+03 6.981e+03, threshold=2.363e+04, percent-clipped=0.0 2024-03-09 12:57:55,249 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=28.75 vs. limit=5.016666666666667 2024-03-09 12:57:55,260 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=4.026666666666666 2024-03-09 12:58:04,358 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=257.97 vs. limit=7.55 2024-03-09 12:58:09,644 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=329.22 vs. limit=5.033333333333333 2024-03-09 12:58:10,358 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+03 3.453e+03 5.651e+03 6.615e+03 7.215e+03, threshold=2.260e+04, percent-clipped=0.0 2024-03-09 12:58:10,725 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=133.33333333333334, ans=0.49375 2024-03-09 12:58:22,204 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=224.36 vs. limit=7.575 2024-03-09 12:58:31,336 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=375.06 vs. limit=5.1 2024-03-09 12:58:43,048 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266.6666666666667, ans=0.4875 2024-03-09 12:58:46,108 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.817e+02 1.921e+03 2.306e+03 5.651e+03 7.215e+03, threshold=9.223e+03, percent-clipped=0.0 2024-03-09 12:58:54,494 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=309.74 vs. limit=7.7 2024-03-09 12:58:54,741 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=48.66 vs. limit=4.1066666666666665 2024-03-09 12:58:59,114 INFO [train.py:997] (1/4) Epoch 1, batch 50, loss[loss=1.197, simple_loss=1.073, pruned_loss=1.123, over 24219.00 frames. ], tot_loss[loss=3.851, simple_loss=3.544, pruned_loss=3.009, over 1066252.25 frames. ], batch size: 229, lr: 2.48e-02, grad_scale: 0.25 2024-03-09 12:59:00,264 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=12.84 vs. limit=3.05 2024-03-09 12:59:00,330 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.07 vs. limit=3.05 2024-03-09 12:59:18,289 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=309.93 vs. limit=7.8 2024-03-09 12:59:19,426 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=400.0, ans=5.25 2024-03-09 12:59:20,052 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=400.0, ans=7.65 2024-03-09 12:59:36,814 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=3.287e+00 2024-03-09 12:59:43,865 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466.6666666666667, ans=0.478125 2024-03-09 12:59:44,324 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=219.28 vs. limit=7.675 2024-03-09 12:59:58,405 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=533.3333333333334, ans=0.475 2024-03-09 12:59:59,032 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=29.79 vs. limit=5.133333333333334 2024-03-09 12:59:59,177 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=3.08 2024-03-09 12:59:59,320 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.69 vs. limit=5.133333333333334 2024-03-09 13:00:01,916 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2024-03-09 13:00:02,624 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=52.84 vs. limit=7.7 2024-03-09 13:00:04,666 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=142.78 vs. limit=7.9 2024-03-09 13:00:09,999 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.36 vs. limit=7.7 2024-03-09 13:00:13,032 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=220.54 vs. limit=7.725 2024-03-09 13:00:16,805 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=17.29 vs. limit=5.15 2024-03-09 13:00:17,876 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=600.0, ans=0.471875 2024-03-09 13:00:18,494 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.10 vs. limit=4.24 2024-03-09 13:00:19,516 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600.0, ans=0.471875 2024-03-09 13:00:20,411 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=94.13 vs. limit=7.725 2024-03-09 13:00:25,692 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=90.72 vs. limit=7.95 2024-03-09 13:00:31,794 INFO [train.py:997] (1/4) Epoch 1, batch 100, loss[loss=1.008, simple_loss=0.8757, pruned_loss=1.059, over 24238.00 frames. ], tot_loss[loss=2.33, simple_loss=2.122, pruned_loss=1.945, over 1880725.02 frames. ], batch size: 229, lr: 2.70e-02, grad_scale: 0.5 2024-03-09 13:00:34,395 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=198.67 vs. limit=7.75 2024-03-09 13:00:37,049 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.046e+01 9.193e+01 2.011e+02 2.156e+03 7.215e+03, threshold=4.023e+02, percent-clipped=0.0 2024-03-09 13:00:48,465 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=4.293333333333333 2024-03-09 13:00:49,641 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=733.3333333333334, ans=5.458333333333333 2024-03-09 13:00:52,138 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=173.46 vs. limit=5.366666666666667 2024-03-09 13:00:53,243 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=733.3333333333334, ans=0.17250000000000001 2024-03-09 13:00:59,197 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=114.46 vs. limit=7.775 2024-03-09 13:01:15,054 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=146.21 vs. limit=7.8 2024-03-09 13:01:34,056 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=446.35 vs. limit=8.15 2024-03-09 13:01:39,202 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=406.36 vs. limit=7.825 2024-03-09 13:01:42,172 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.61 vs. limit=3.13 2024-03-09 13:01:42,755 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=866.6666666666666, ans=0.459375 2024-03-09 13:01:53,178 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=933.3333333333334, ans=0.3833333333333333 2024-03-09 13:01:59,299 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.80 vs. limit=5.233333333333333 2024-03-09 13:02:03,197 INFO [train.py:997] (1/4) Epoch 1, batch 150, loss[loss=0.9447, simple_loss=0.8101, pruned_loss=0.9832, over 24311.00 frames. ], tot_loss[loss=1.771, simple_loss=1.593, pruned_loss=1.563, over 2511097.42 frames. ], batch size: 281, lr: 2.93e-02, grad_scale: 0.5 2024-03-09 13:02:11,844 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1000.0, ans=0.453125 2024-03-09 13:02:11,993 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:03:01,377 INFO [train.py:997] (1/4) Epoch 2, batch 0, loss[loss=0.8849, simple_loss=0.749, pruned_loss=0.966, over 23961.00 frames. ], tot_loss[loss=0.8849, simple_loss=0.749, pruned_loss=0.966, over 23961.00 frames. ], batch size: 142, lr: 2.91e-02, grad_scale: 1.0 2024-03-09 13:03:01,377 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:03:11,804 INFO [train.py:1029] (1/4) Epoch 2, validation: loss=0.9516, simple_loss=0.8161, pruned_loss=0.9787, over 452978.00 frames. 2024-03-09 13:03:11,805 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 27929MB 2024-03-09 13:03:15,044 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=8.29 2024-03-09 13:03:32,295 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=60.60 vs. limit=7.92 2024-03-09 13:03:32,399 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=345.20 vs. limit=7.92 2024-03-09 13:03:48,306 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten.whitening_limit, batch_count=1186.6666666666667, ans=7.945 2024-03-09 13:03:49,347 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1186.6666666666667, ans=0.3516666666666667 2024-03-09 13:03:50,037 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=251.68 vs. limit=7.945 2024-03-09 13:03:57,157 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=176.72 vs. limit=5.593333333333334 2024-03-09 13:04:00,478 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=13.32 vs. limit=5.296666666666667 2024-03-09 13:04:03,334 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1253.3333333333333, ans=0.0718 2024-03-09 13:04:05,715 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=291.63 vs. limit=8.44 2024-03-09 13:04:08,666 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1253.3333333333333, ans=0.04608333333333334 2024-03-09 13:04:09,518 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=29.66 vs. limit=7.97 2024-03-09 13:04:14,487 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=50.84 vs. limit=7.97 2024-03-09 13:04:16,661 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=38.77 vs. limit=7.97 2024-03-09 13:04:29,156 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=13.61 vs. limit=4.5280000000000005 2024-03-09 13:04:31,668 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.992e+01 8.885e+01 1.035e+02 1.288e+02 2.193e+02, threshold=2.069e+02, percent-clipped=0.0 2024-03-09 13:04:32,525 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.69 vs. limit=5.66 2024-03-09 13:04:42,730 INFO [train.py:997] (1/4) Epoch 2, batch 50, loss[loss=0.9004, simple_loss=0.7693, pruned_loss=0.8825, over 24239.00 frames. ], tot_loss[loss=0.8917, simple_loss=0.7603, pruned_loss=0.9101, over 1065764.49 frames. ], batch size: 267, lr: 3.13e-02, grad_scale: 1.0 2024-03-09 13:04:44,748 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1386.6666666666667, ans=0.435 2024-03-09 13:04:44,864 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1386.6666666666667, ans=0.28613333333333335 2024-03-09 13:04:50,106 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1386.6666666666667, ans=0.0688 2024-03-09 13:04:54,152 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.95 vs. limit=8.54 2024-03-09 13:04:54,393 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=323.97 vs. limit=8.02 2024-03-09 13:05:01,234 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=29.83 vs. limit=8.59 2024-03-09 13:05:07,798 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1453.3333333333333, ans=0.8491333333333334 2024-03-09 13:05:08,946 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=161.42 vs. limit=8.045 2024-03-09 13:05:22,443 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1520.0, ans=0.14300000000000002 2024-03-09 13:05:30,276 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=125.83 vs. limit=8.07 2024-03-09 13:05:56,461 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.49 vs. limit=8.69 2024-03-09 13:06:06,083 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1653.3333333333333, ans=0.4225 2024-03-09 13:06:11,810 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=8.12 2024-03-09 13:06:16,174 INFO [train.py:997] (1/4) Epoch 2, batch 100, loss[loss=0.8043, simple_loss=0.6841, pruned_loss=0.7594, over 24353.00 frames. ], tot_loss[loss=0.874, simple_loss=0.7456, pruned_loss=0.8608, over 1877999.81 frames. ], batch size: 208, lr: 3.35e-02, grad_scale: 2.0 2024-03-09 13:06:16,427 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1720.0, ans=0.08925000000000001 2024-03-09 13:06:24,130 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.21 vs. limit=8.145 2024-03-09 13:06:26,976 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1720.0, ans=0.419375 2024-03-09 13:06:27,619 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=34.15 vs. limit=8.79 2024-03-09 13:06:29,245 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=8.79 2024-03-09 13:06:31,012 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=104.99 vs. limit=8.145 2024-03-09 13:06:31,157 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=18.97 vs. limit=5.86 2024-03-09 13:06:32,845 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.35 vs. limit=8.84 2024-03-09 13:06:38,562 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=28.39 vs. limit=8.17 2024-03-09 13:06:40,094 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.50 vs. limit=4.714666666666667 2024-03-09 13:06:45,304 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=1786.6666666666667, ans=8.84 2024-03-09 13:06:49,624 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1853.3333333333333, ans=0.413125 2024-03-09 13:06:56,328 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=1853.3333333333333, ans=0.5 2024-03-09 13:07:04,041 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=88.42 vs. limit=8.195 2024-03-09 13:07:16,781 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=33.47 vs. limit=8.22 2024-03-09 13:07:27,120 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.70 vs. limit=8.94 2024-03-09 13:07:29,925 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1986.6666666666667, ans=0.25166666666666665 2024-03-09 13:07:37,328 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=109.88 vs. limit=8.245 2024-03-09 13:07:37,966 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.386e+01 8.999e+01 1.029e+02 1.187e+02 2.200e+02, threshold=2.058e+02, percent-clipped=1.0 2024-03-09 13:07:38,319 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1986.6666666666667, ans=0.8304666666666667 2024-03-09 13:07:41,012 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.26 vs. limit=5.496666666666667 2024-03-09 13:07:46,572 INFO [train.py:997] (1/4) Epoch 2, batch 150, loss[loss=0.8377, simple_loss=0.7126, pruned_loss=0.7537, over 24138.00 frames. ], tot_loss[loss=0.8594, simple_loss=0.7321, pruned_loss=0.8241, over 2519427.22 frames. ], batch size: 345, lr: 3.57e-02, grad_scale: 2.0 2024-03-09 13:07:47,801 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.04 vs. limit=6.026666666666667 2024-03-09 13:08:38,014 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2106.6666666666665, ans=0.8262666666666667 2024-03-09 13:08:44,925 INFO [train.py:997] (1/4) Epoch 3, batch 0, loss[loss=0.7346, simple_loss=0.6204, pruned_loss=0.6731, over 23898.00 frames. ], tot_loss[loss=0.7346, simple_loss=0.6204, pruned_loss=0.6731, over 23898.00 frames. ], batch size: 117, lr: 3.42e-02, grad_scale: 4.0 2024-03-09 13:08:44,925 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:08:54,189 INFO [train.py:1029] (1/4) Epoch 3, validation: loss=0.8556, simple_loss=0.7313, pruned_loss=0.7513, over 452978.00 frames. 2024-03-09 13:08:54,190 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:08:55,563 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=9.08 2024-03-09 13:08:58,114 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=2106.6666666666665, ans=0.12100000000000001 2024-03-09 13:09:00,517 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.27 vs. limit=9.08 2024-03-09 13:09:01,694 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2106.6666666666665, ans=0.40125 2024-03-09 13:09:07,344 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=54.60 vs. limit=9.08 2024-03-09 13:09:09,402 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.62 vs. limit=5.526666666666666 2024-03-09 13:09:17,986 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=25.71 vs. limit=8.315 2024-03-09 13:09:19,232 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2173.3333333333335, ans=0.0511 2024-03-09 13:09:23,394 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=24.67 vs. limit=8.315 2024-03-09 13:09:28,421 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.81 vs. limit=8.34 2024-03-09 13:09:37,284 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=4.896 2024-03-09 13:09:39,306 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=11.50 vs. limit=6.12 2024-03-09 13:09:45,326 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=2306.6666666666665, ans=0.04279166666666667 2024-03-09 13:09:50,533 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=2306.6666666666665, ans=0.08558333333333334 2024-03-09 13:10:02,421 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=2306.6666666666665, ans=0.8192666666666667 2024-03-09 13:10:26,721 INFO [train.py:997] (1/4) Epoch 3, batch 50, loss[loss=0.7942, simple_loss=0.6754, pruned_loss=0.6814, over 22499.00 frames. ], tot_loss[loss=0.8027, simple_loss=0.6816, pruned_loss=0.7063, over 1076932.44 frames. ], batch size: 85, lr: 3.63e-02, grad_scale: 4.0 2024-03-09 13:10:39,380 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2440.0, ans=0.042375 2024-03-09 13:10:42,807 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2506.6666666666665, ans=0.3825 2024-03-09 13:10:42,887 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2506.6666666666665, ans=0.3825 2024-03-09 13:10:45,645 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2506.6666666666665, ans=8.44 2024-03-09 13:10:55,717 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.72 vs. limit=6.253333333333333 2024-03-09 13:10:55,872 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=9.379999999999999 2024-03-09 13:11:02,858 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.58 vs. limit=9.43 2024-03-09 13:11:18,968 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2640.0, ans=0.2736 2024-03-09 13:11:24,665 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=5.056 2024-03-09 13:11:29,750 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=6.131e+00 2024-03-09 13:11:34,501 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.376e+01 1.355e+02 1.829e+02 2.456e+02 5.542e+02, threshold=3.657e+02, percent-clipped=39.0 2024-03-09 13:11:42,523 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.63 vs. limit=5.676666666666667 2024-03-09 13:11:55,487 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2773.3333333333335, ans=0.8029333333333334 2024-03-09 13:11:56,956 INFO [train.py:997] (1/4) Epoch 3, batch 100, loss[loss=0.7542, simple_loss=0.6496, pruned_loss=0.5993, over 23998.00 frames. ], tot_loss[loss=0.7724, simple_loss=0.6588, pruned_loss=0.656, over 1892246.34 frames. ], batch size: 388, lr: 3.84e-02, grad_scale: 8.0 2024-03-09 13:11:57,285 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2773.3333333333335, ans=0.37 2024-03-09 13:12:12,529 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=2840.0, ans=0.366875 2024-03-09 13:12:17,142 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=2840.0, ans=0.03609999999999999 2024-03-09 13:12:47,099 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2906.6666666666665, ans=0.08183333333333334 2024-03-09 13:13:25,255 INFO [train.py:997] (1/4) Epoch 3, batch 150, loss[loss=0.5973, simple_loss=0.5294, pruned_loss=0.4183, over 24191.00 frames. ], tot_loss[loss=0.72, simple_loss=0.6198, pruned_loss=0.5835, over 2520234.54 frames. ], batch size: 295, lr: 4.05e-02, grad_scale: 8.0 2024-03-09 13:14:27,473 INFO [train.py:997] (1/4) Epoch 4, batch 0, loss[loss=0.5233, simple_loss=0.4703, pruned_loss=0.3471, over 24085.00 frames. ], tot_loss[loss=0.5233, simple_loss=0.4703, pruned_loss=0.3471, over 24085.00 frames. ], batch size: 176, lr: 3.82e-02, grad_scale: 16.0 2024-03-09 13:14:27,473 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:14:37,765 INFO [train.py:1029] (1/4) Epoch 4, validation: loss=0.515, simple_loss=0.4763, pruned_loss=0.3039, over 452978.00 frames. 2024-03-09 13:14:37,766 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:15:14,089 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.30 vs. limit=8.735 2024-03-09 13:15:19,316 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=3293.3333333333335, ans=5.823333333333333 2024-03-09 13:15:20,227 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=3293.3333333333335, ans=0.07649999999999998 2024-03-09 13:15:31,398 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.775e+02 3.449e+02 4.262e+02 1.233e+03, threshold=6.899e+02, percent-clipped=36.0 2024-03-09 13:15:36,805 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=3360.0, ans=0.08000000000000002 2024-03-09 13:15:40,835 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.43 vs. limit=3.504 2024-03-09 13:15:48,471 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=3426.6666666666665, ans=0.035 2024-03-09 13:16:04,437 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=8.81 2024-03-09 13:16:07,038 INFO [train.py:997] (1/4) Epoch 4, batch 50, loss[loss=0.5094, simple_loss=0.4648, pruned_loss=0.3139, over 21668.00 frames. ], tot_loss[loss=0.5222, simple_loss=0.4714, pruned_loss=0.3379, over 1071722.45 frames. ], batch size: 718, lr: 3.92e-02, grad_scale: 8.0 2024-03-09 13:16:10,660 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=3.090e+00 2024-03-09 13:16:20,754 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=3493.3333333333335, ans=0.7777333333333334 2024-03-09 13:16:24,113 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=3560.0, ans=0.7856 2024-03-09 13:16:35,240 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=10.17 2024-03-09 13:16:41,068 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=5.874e+00 2024-03-09 13:16:45,906 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=3626.6666666666665, ans=0.03866666666666667 2024-03-09 13:16:47,725 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3626.6666666666665, ans=0.32999999999999996 2024-03-09 13:16:54,274 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=3626.6666666666665, ans=0.32999999999999996 2024-03-09 13:16:56,465 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=8.86 2024-03-09 13:16:58,402 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=5.923333333333334 2024-03-09 13:16:59,861 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=10.27 2024-03-09 13:17:03,751 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3693.3333333333335, ans=0.03833333333333333 2024-03-09 13:17:18,190 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=3760.0, ans=0.32375 2024-03-09 13:17:23,242 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3760.0, ans=0.26239999999999997 2024-03-09 13:17:32,776 INFO [train.py:997] (1/4) Epoch 4, batch 100, loss[loss=0.513, simple_loss=0.4723, pruned_loss=0.3027, over 23816.00 frames. ], tot_loss[loss=0.4906, simple_loss=0.4491, pruned_loss=0.2996, over 1888435.16 frames. ], batch size: 447, lr: 3.92e-02, grad_scale: 8.0 2024-03-09 13:17:33,171 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=3826.6666666666665, ans=0.021666666666666667 2024-03-09 13:17:39,705 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=3826.6666666666665, ans=0.013899999999999996 2024-03-09 13:18:04,870 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.98 vs. limit=5.557333333333333 2024-03-09 13:18:07,562 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3960.0, ans=0.26039999999999996 2024-03-09 13:18:11,789 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=5.99 2024-03-09 13:18:26,598 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.209e+02 2.728e+02 3.814e+02 7.926e+02, threshold=5.455e+02, percent-clipped=1.0 2024-03-09 13:18:28,555 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=4026.6666666666665, ans=0.07483333333333334 2024-03-09 13:18:34,874 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4026.6666666666665, ans=0.31125 2024-03-09 13:18:37,066 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.20 vs. limit=6.006666666666667 2024-03-09 13:18:48,278 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4093.3333333333335, ans=0.25906666666666667 2024-03-09 13:18:57,711 INFO [train.py:997] (1/4) Epoch 4, batch 150, loss[loss=0.4069, simple_loss=0.3903, pruned_loss=0.2035, over 24298.00 frames. ], tot_loss[loss=0.4613, simple_loss=0.4275, pruned_loss=0.2677, over 2513145.67 frames. ], batch size: 281, lr: 3.91e-02, grad_scale: 8.0 2024-03-09 13:18:59,587 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=4160.0, ans=0.305 2024-03-09 13:19:56,253 INFO [train.py:997] (1/4) Epoch 5, batch 0, loss[loss=0.3781, simple_loss=0.3667, pruned_loss=0.1803, over 24287.00 frames. ], tot_loss[loss=0.3781, simple_loss=0.3667, pruned_loss=0.1803, over 24287.00 frames. ], batch size: 198, lr: 3.65e-02, grad_scale: 16.0 2024-03-09 13:19:56,254 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:20:05,951 INFO [train.py:1029] (1/4) Epoch 5, validation: loss=0.3626, simple_loss=0.3682, pruned_loss=0.1368, over 452978.00 frames. 2024-03-09 13:20:05,952 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:20:23,632 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.61 vs. limit=9.105 2024-03-09 13:20:36,776 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.07 vs. limit=5.712 2024-03-09 13:20:45,181 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=10.76 2024-03-09 13:20:50,793 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4346.666666666667, ans=0.29625 2024-03-09 13:20:57,408 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=4413.333333333333, ans=0.29312499999999997 2024-03-09 13:21:01,357 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.44 vs. limit=5.765333333333333 2024-03-09 13:21:29,187 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=4546.666666666667, ans=0.04772222222222222 2024-03-09 13:21:30,488 INFO [train.py:997] (1/4) Epoch 5, batch 50, loss[loss=0.3302, simple_loss=0.3291, pruned_loss=0.1405, over 23126.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3595, pruned_loss=0.1739, over 1069308.62 frames. ], batch size: 102, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:22:09,208 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.206e+02 1.970e+02 2.387e+02 3.231e+02 6.932e+02, threshold=4.775e+02, percent-clipped=2.0 2024-03-09 13:22:17,657 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=4746.666666666667, ans=0.07033333333333333 2024-03-09 13:22:23,030 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=4746.666666666667, ans=0.27749999999999997 2024-03-09 13:22:36,650 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=11.11 2024-03-09 13:22:37,596 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=4813.333333333333, ans=0.27437500000000004 2024-03-09 13:22:55,079 INFO [train.py:997] (1/4) Epoch 5, batch 100, loss[loss=0.3387, simple_loss=0.338, pruned_loss=0.1456, over 24269.00 frames. ], tot_loss[loss=0.3642, simple_loss=0.3568, pruned_loss=0.1682, over 1889130.24 frames. ], batch size: 254, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:23:09,039 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=9.33 2024-03-09 13:23:09,614 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4880.0, ans=0.2512 2024-03-09 13:23:16,102 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=4946.666666666667, ans=0.268125 2024-03-09 13:23:20,950 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=4946.666666666667, ans=0.034541666666666665 2024-03-09 13:23:47,983 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5080.0, ans=0.2492 2024-03-09 13:23:49,545 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=5080.0, ans=0.26187499999999997 2024-03-09 13:24:03,516 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.37 vs. limit=9.43 2024-03-09 13:24:08,326 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.94 vs. limit=9.43 2024-03-09 13:24:14,089 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=9.43 2024-03-09 13:24:19,167 INFO [train.py:997] (1/4) Epoch 5, batch 150, loss[loss=0.3277, simple_loss=0.334, pruned_loss=0.1306, over 24254.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3537, pruned_loss=0.1626, over 2521890.26 frames. ], batch size: 241, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:24:19,438 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=5213.333333333333, ans=0.04494444444444445 2024-03-09 13:25:15,977 INFO [train.py:997] (1/4) Epoch 6, batch 0, loss[loss=0.3373, simple_loss=0.3407, pruned_loss=0.1404, over 24130.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3407, pruned_loss=0.1404, over 24130.00 frames. ], batch size: 345, lr: 3.40e-02, grad_scale: 16.0 2024-03-09 13:25:15,977 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:25:26,276 INFO [train.py:1029] (1/4) Epoch 6, validation: loss=0.3173, simple_loss=0.3385, pruned_loss=0.1003, over 452978.00 frames. 2024-03-09 13:25:26,277 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:25:36,062 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5266.666666666667, ans=0.253125 2024-03-09 13:25:42,122 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.34 vs. limit=6.316666666666666 2024-03-09 13:25:54,056 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=5333.333333333333, ans=0.25 2024-03-09 13:25:57,154 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=5333.333333333333, ans=0.25 2024-03-09 13:26:01,730 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.753e+02 2.102e+02 2.732e+02 4.816e+02, threshold=4.205e+02, percent-clipped=1.0 2024-03-09 13:26:26,306 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=5466.666666666667, ans=0.009681159420289855 2024-03-09 13:26:40,383 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=5533.333333333333, ans=0.043611111111111114 2024-03-09 13:26:41,960 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5533.333333333333, ans=0.24466666666666667 2024-03-09 13:26:56,052 INFO [train.py:997] (1/4) Epoch 6, batch 50, loss[loss=0.2893, simple_loss=0.3015, pruned_loss=0.107, over 22945.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3227, pruned_loss=0.1237, over 1076265.39 frames. ], batch size: 85, lr: 3.40e-02, grad_scale: 16.0 2024-03-09 13:26:59,523 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5600.0, ans=0.244 2024-03-09 13:27:10,302 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.32 vs. limit=6.4 2024-03-09 13:27:10,992 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5666.666666666667, ans=0.24333333333333332 2024-03-09 13:27:17,459 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029 2024-03-09 13:27:28,584 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=5733.333333333333, ans=0.23125 2024-03-09 13:27:28,598 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=5733.333333333333, ans=0.23125 2024-03-09 13:27:36,617 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5733.333333333333, ans=0.24266666666666667 2024-03-09 13:27:47,700 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=5800.0, ans=0.0425 2024-03-09 13:28:03,325 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=5866.666666666667, ans=0.28800000000000003 2024-03-09 13:28:17,537 INFO [train.py:997] (1/4) Epoch 6, batch 100, loss[loss=0.2961, simple_loss=0.3108, pruned_loss=0.1088, over 24037.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3214, pruned_loss=0.1211, over 1896674.99 frames. ], batch size: 344, lr: 3.40e-02, grad_scale: 8.0 2024-03-09 13:28:29,736 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.67 vs. limit=6.483333333333333 2024-03-09 13:28:45,018 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=6.4 2024-03-09 13:28:47,226 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.031e+02 1.395e+02 1.660e+02 2.447e+02 5.591e+02, threshold=3.319e+02, percent-clipped=4.0 2024-03-09 13:29:11,885 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.65 vs. limit=6.453333333333333 2024-03-09 13:29:16,785 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=9.8 2024-03-09 13:29:19,203 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=6133.333333333333, ans=0.04111111111111111 2024-03-09 13:29:20,668 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6133.333333333333, ans=0.23866666666666667 2024-03-09 13:29:27,543 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.41 vs. limit=12.15 2024-03-09 13:29:35,836 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.63 vs. limit=12.15 2024-03-09 13:29:40,033 INFO [train.py:997] (1/4) Epoch 6, batch 150, loss[loss=0.2978, simple_loss=0.3164, pruned_loss=0.1066, over 24215.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3187, pruned_loss=0.1175, over 2528047.44 frames. ], batch size: 295, lr: 3.39e-02, grad_scale: 8.0 2024-03-09 13:29:47,694 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.61 vs. limit=12.2 2024-03-09 13:29:48,684 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6266.666666666667, ans=0.23733333333333334 2024-03-09 13:29:49,281 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=9.85 2024-03-09 13:30:37,232 INFO [train.py:997] (1/4) Epoch 7, batch 0, loss[loss=0.2736, simple_loss=0.2921, pruned_loss=0.09633, over 24264.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.2921, pruned_loss=0.09633, over 24264.00 frames. ], batch size: 229, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:30:37,233 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:30:47,281 INFO [train.py:1029] (1/4) Epoch 7, validation: loss=0.2933, simple_loss=0.3253, pruned_loss=0.08566, over 452978.00 frames. 2024-03-09 13:30:47,282 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:30:52,418 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=6320.0, ans=0.060500000000000005 2024-03-09 13:31:14,019 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=6386.666666666667, ans=0.04005555555555555 2024-03-09 13:31:33,910 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.55 vs. limit=9.92 2024-03-09 13:31:34,157 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=9.92 2024-03-09 13:31:41,383 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=6453.333333333333, ans=0.09899494936611666 2024-03-09 13:31:51,690 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.61 vs. limit=9.945 2024-03-09 13:32:16,194 INFO [train.py:997] (1/4) Epoch 7, batch 50, loss[loss=0.3277, simple_loss=0.3412, pruned_loss=0.1305, over 23796.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3088, pruned_loss=0.1062, over 1064790.52 frames. ], batch size: 447, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:32:21,719 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=6653.333333333333, ans=0.188125 2024-03-09 13:32:30,830 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.025e+02 1.360e+02 1.605e+02 1.865e+02 3.683e+02, threshold=3.211e+02, percent-clipped=2.0 2024-03-09 13:32:39,465 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=12.54 2024-03-09 13:33:08,952 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=6853.333333333333, ans=0.17875000000000002 2024-03-09 13:33:34,909 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.58 vs. limit=10.095 2024-03-09 13:33:37,046 INFO [train.py:997] (1/4) Epoch 7, batch 100, loss[loss=0.2656, simple_loss=0.2928, pruned_loss=0.08672, over 24061.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.303, pruned_loss=0.1002, over 1882038.50 frames. ], batch size: 165, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:34:01,389 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=10.145 2024-03-09 13:34:16,821 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=4.068 2024-03-09 13:34:21,469 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=10.17 2024-03-09 13:34:44,817 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=7253.333333333333, ans=0.036444444444444446 2024-03-09 13:34:52,059 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=4.088 2024-03-09 13:34:54,602 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=7253.333333333333, ans=0.15999999999999998 2024-03-09 13:34:58,867 INFO [train.py:997] (1/4) Epoch 7, batch 150, loss[loss=0.3202, simple_loss=0.3387, pruned_loss=0.1253, over 24015.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3032, pruned_loss=0.099, over 2524063.42 frames. ], batch size: 416, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:34:59,097 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=7320.0, ans=0.156875 2024-03-09 13:35:57,459 INFO [train.py:997] (1/4) Epoch 8, batch 0, loss[loss=0.2642, simple_loss=0.2932, pruned_loss=0.08708, over 24274.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.2932, pruned_loss=0.08708, over 24274.00 frames. ], batch size: 241, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:35:57,460 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:36:07,342 INFO [train.py:1029] (1/4) Epoch 8, validation: loss=0.2797, simple_loss=0.3212, pruned_loss=0.07915, over 452978.00 frames. 2024-03-09 13:36:07,342 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:36:08,859 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.023e+02 1.314e+02 1.638e+02 1.955e+02 4.296e+02, threshold=3.277e+02, percent-clipped=3.0 2024-03-09 13:36:39,018 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=7440.0, ans=0.3116 2024-03-09 13:36:39,790 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.17 vs. limit=10.29 2024-03-09 13:37:16,004 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=7640.0, ans=0.009208695652173913 2024-03-09 13:37:20,505 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=7640.0, ans=0.14187499999999997 2024-03-09 13:37:29,704 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=7706.666666666667, ans=0.13874999999999998 2024-03-09 13:37:31,062 INFO [train.py:997] (1/4) Epoch 8, batch 50, loss[loss=0.2257, simple_loss=0.2618, pruned_loss=0.0639, over 23643.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.2911, pruned_loss=0.08472, over 1078912.58 frames. ], batch size: 116, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:37:44,616 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=4.156 2024-03-09 13:37:55,289 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=4.166 2024-03-09 13:38:24,225 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=7906.666666666667, ans=0.03372222222222222 2024-03-09 13:38:51,055 INFO [train.py:997] (1/4) Epoch 8, batch 100, loss[loss=0.2466, simple_loss=0.2786, pruned_loss=0.08102, over 24276.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.2907, pruned_loss=0.08518, over 1885374.66 frames. ], batch size: 229, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:38:52,574 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.761e+01 1.115e+02 1.336e+02 1.652e+02 2.844e+02, threshold=2.672e+02, percent-clipped=0.0 2024-03-09 13:38:52,916 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=8040.0, ans=0.009121739130434783 2024-03-09 13:39:08,596 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=8106.666666666667, ans=0.032888888888888884 2024-03-09 13:39:31,373 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=8173.333333333333, ans=0.125 2024-03-09 13:39:55,225 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.12 vs. limit=10.59 2024-03-09 13:39:55,320 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.87 vs. limit=9.120000000000001 2024-03-09 13:40:00,781 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8306.666666666666, ans=0.21693333333333334 2024-03-09 13:40:03,787 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=8306.666666666666, ans=0.125 2024-03-09 13:40:07,266 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=13.73 2024-03-09 13:40:11,303 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=8373.333333333334, ans=0.00904927536231884 2024-03-09 13:40:12,947 INFO [train.py:997] (1/4) Epoch 8, batch 150, loss[loss=0.2434, simple_loss=0.2812, pruned_loss=0.07604, over 24216.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.2906, pruned_loss=0.08413, over 2516122.09 frames. ], batch size: 241, lr: 2.99e-02, grad_scale: 16.0 2024-03-09 13:41:11,726 INFO [train.py:997] (1/4) Epoch 9, batch 0, loss[loss=0.2348, simple_loss=0.2799, pruned_loss=0.06493, over 22947.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.2799, pruned_loss=0.06493, over 22947.00 frames. ], batch size: 609, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:41:11,726 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:41:21,825 INFO [train.py:1029] (1/4) Epoch 9, validation: loss=0.2624, simple_loss=0.312, pruned_loss=0.07326, over 452978.00 frames. 2024-03-09 13:41:21,826 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:41:56,193 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=10.685 2024-03-09 13:41:58,790 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=8493.333333333334, ans=0.125 2024-03-09 13:42:41,967 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.295e+01 1.084e+02 1.217e+02 1.477e+02 3.480e+02, threshold=2.433e+02, percent-clipped=5.0 2024-03-09 13:42:48,345 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=8693.333333333334, ans=0.05 2024-03-09 13:42:51,117 INFO [train.py:997] (1/4) Epoch 9, batch 50, loss[loss=0.2174, simple_loss=0.259, pruned_loss=0.06285, over 23807.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.2796, pruned_loss=0.0732, over 1065530.57 frames. ], batch size: 117, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:43:03,839 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=8760.0, ans=0.05 2024-03-09 13:43:06,876 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=5.419e-03 2024-03-09 13:43:27,054 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8893.333333333334, ans=0.125 2024-03-09 13:43:30,810 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.28 vs. limit=9.446666666666667 2024-03-09 13:43:31,512 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=8893.333333333334, ans=0.125 2024-03-09 13:43:39,257 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:43:54,719 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=9026.666666666666, ans=0.125 2024-03-09 13:43:54,751 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9026.666666666666, ans=0.125 2024-03-09 13:44:01,069 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=9026.666666666666, ans=0.125 2024-03-09 13:44:08,283 INFO [train.py:997] (1/4) Epoch 9, batch 100, loss[loss=0.2747, simple_loss=0.3108, pruned_loss=0.09787, over 24027.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.2844, pruned_loss=0.07658, over 1870255.05 frames. ], batch size: 416, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:44:27,254 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=9160.0, ans=0.09899494936611666 2024-03-09 13:44:31,293 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=10.935 2024-03-09 13:44:42,377 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9226.666666666666, ans=0.20773333333333333 2024-03-09 13:44:51,431 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=9226.666666666666, ans=0.028222222222222225 2024-03-09 13:44:54,586 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=9226.666666666666, ans=0.028222222222222225 2024-03-09 13:45:00,625 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=9293.333333333334, ans=0.027944444444444445 2024-03-09 13:45:01,112 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=10.985 2024-03-09 13:45:09,998 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=9293.333333333334, ans=0.027944444444444445 2024-03-09 13:45:20,414 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.522e+01 1.120e+02 1.341e+02 1.607e+02 2.660e+02, threshold=2.681e+02, percent-clipped=5.0 2024-03-09 13:45:20,827 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=9360.0, ans=0.02766666666666667 2024-03-09 13:45:30,102 INFO [train.py:997] (1/4) Epoch 9, batch 150, loss[loss=0.2975, simple_loss=0.3287, pruned_loss=0.1155, over 23250.00 frames. ], tot_loss[loss=0.241, simple_loss=0.2837, pruned_loss=0.07501, over 2500237.37 frames. ], batch size: 534, lr: 2.82e-02, grad_scale: 32.0 2024-03-09 13:46:27,247 INFO [train.py:997] (1/4) Epoch 10, batch 0, loss[loss=0.2369, simple_loss=0.284, pruned_loss=0.07279, over 24125.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.284, pruned_loss=0.07279, over 24125.00 frames. ], batch size: 326, lr: 2.69e-02, grad_scale: 32.0 2024-03-09 13:46:27,248 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:46:37,027 INFO [train.py:1029] (1/4) Epoch 10, validation: loss=0.2538, simple_loss=0.3122, pruned_loss=0.07122, over 452978.00 frames. 2024-03-09 13:46:37,028 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:46:45,116 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=9480.0, ans=0.125 2024-03-09 13:46:58,786 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=9546.666666666666, ans=0.026888888888888893 2024-03-09 13:47:05,052 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:47:23,101 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9613.333333333334, ans=0.125 2024-03-09 13:47:27,937 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=9680.0, ans=0.0 2024-03-09 13:47:50,551 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9746.666666666666, ans=0.125 2024-03-09 13:47:52,036 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9746.666666666666, ans=0.125 2024-03-09 13:48:02,512 INFO [train.py:997] (1/4) Epoch 10, batch 50, loss[loss=0.2059, simple_loss=0.261, pruned_loss=0.05376, over 23183.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2714, pruned_loss=0.06369, over 1073145.30 frames. ], batch size: 102, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:48:23,125 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9880.0, ans=0.125 2024-03-09 13:48:30,966 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=9880.0, ans=0.025500000000000002 2024-03-09 13:48:35,694 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9946.666666666666, ans=0.125 2024-03-09 13:48:40,389 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=9946.666666666666, ans=0.125 2024-03-09 13:48:50,347 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.85 vs. limit=15.01 2024-03-09 13:48:50,369 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.32 vs. limit=8.005333333333333 2024-03-09 13:48:56,565 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.50 vs. limit=11.254999999999999 2024-03-09 13:48:58,431 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.812e+01 1.075e+02 1.246e+02 1.479e+02 2.668e+02, threshold=2.491e+02, percent-clipped=0.0 2024-03-09 13:49:03,852 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=10013.333333333334, ans=0.5495333333333334 2024-03-09 13:49:21,969 INFO [train.py:997] (1/4) Epoch 10, batch 100, loss[loss=0.2104, simple_loss=0.2628, pruned_loss=0.06071, over 24245.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2757, pruned_loss=0.06749, over 1887774.62 frames. ], batch size: 188, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:49:43,921 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=10213.333333333334, ans=0.125 2024-03-09 13:49:51,190 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10213.333333333334, ans=0.19786666666666666 2024-03-09 13:50:32,411 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10413.333333333334, ans=0.0 2024-03-09 13:50:34,633 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.55 vs. limit=11.405 2024-03-09 13:50:43,575 INFO [train.py:997] (1/4) Epoch 10, batch 150, loss[loss=0.2116, simple_loss=0.269, pruned_loss=0.05994, over 24231.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2736, pruned_loss=0.06492, over 2516540.74 frames. ], batch size: 295, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:51:41,264 INFO [train.py:997] (1/4) Epoch 11, batch 0, loss[loss=0.272, simple_loss=0.3165, pruned_loss=0.09935, over 23261.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3165, pruned_loss=0.09935, over 23261.00 frames. ], batch size: 534, lr: 2.56e-02, grad_scale: 32.0 2024-03-09 13:51:41,265 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:51:51,067 INFO [train.py:1029] (1/4) Epoch 11, validation: loss=0.2397, simple_loss=0.3066, pruned_loss=0.06689, over 452978.00 frames. 2024-03-09 13:51:51,068 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:51:57,902 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=10533.333333333334, ans=0.022777777777777775 2024-03-09 13:52:26,177 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=10666.666666666666, ans=0.022222222222222227 2024-03-09 13:52:38,411 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.688e+01 1.049e+02 1.183e+02 1.464e+02 2.170e+02, threshold=2.365e+02, percent-clipped=0.0 2024-03-09 13:52:40,252 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=10733.333333333334, ans=0.125 2024-03-09 13:52:53,745 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=10733.333333333334, ans=0.5243333333333333 2024-03-09 13:53:03,029 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=10800.0, ans=0.008521739130434783 2024-03-09 13:53:18,226 INFO [train.py:997] (1/4) Epoch 11, batch 50, loss[loss=0.2736, simple_loss=0.3181, pruned_loss=0.1022, over 23282.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2732, pruned_loss=0.06346, over 1069723.89 frames. ], batch size: 534, lr: 2.56e-02, grad_scale: 32.0 2024-03-09 13:53:24,988 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=10866.666666666666, ans=0.5196666666666667 2024-03-09 13:53:28,075 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=10866.666666666666, ans=0.021388888888888895 2024-03-09 13:53:29,593 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=10866.666666666666, ans=0.125 2024-03-09 13:53:35,764 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=10933.333333333334, ans=0.04949747468305833 2024-03-09 13:53:49,337 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=11000.0, ans=0.5150000000000001 2024-03-09 13:54:03,153 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=11066.666666666666, ans=0.02055555555555556 2024-03-09 13:54:22,922 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.61 vs. limit=11.675 2024-03-09 13:54:38,699 INFO [train.py:997] (1/4) Epoch 11, batch 100, loss[loss=0.193, simple_loss=0.2527, pruned_loss=0.05427, over 24239.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2701, pruned_loss=0.06122, over 1888991.39 frames. ], batch size: 188, lr: 2.55e-02, grad_scale: 32.0 2024-03-09 13:54:41,232 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.56 vs. limit=15.9 2024-03-09 13:54:45,238 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=11200.0, ans=0.020000000000000004 2024-03-09 13:55:23,737 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.186e+01 9.979e+01 1.131e+02 1.409e+02 2.515e+02, threshold=2.263e+02, percent-clipped=1.0 2024-03-09 13:55:34,867 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=11400.0, ans=0.008391304347826088 2024-03-09 13:55:40,348 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=11400.0, ans=0.01916666666666667 2024-03-09 13:55:50,869 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=11466.666666666666, ans=0.125 2024-03-09 13:55:58,235 INFO [train.py:997] (1/4) Epoch 11, batch 150, loss[loss=0.1917, simple_loss=0.2551, pruned_loss=0.05353, over 23980.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2709, pruned_loss=0.0611, over 2515892.05 frames. ], batch size: 142, lr: 2.55e-02, grad_scale: 32.0 2024-03-09 13:56:01,501 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=11533.333333333334, ans=0.49633333333333335 2024-03-09 13:56:55,632 INFO [train.py:997] (1/4) Epoch 12, batch 0, loss[loss=0.196, simple_loss=0.265, pruned_loss=0.05234, over 24281.00 frames. ], tot_loss[loss=0.196, simple_loss=0.265, pruned_loss=0.05234, over 24281.00 frames. ], batch size: 267, lr: 2.45e-02, grad_scale: 32.0 2024-03-09 13:56:55,633 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 13:57:03,939 INFO [zipformer.py:1858] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5354, 4.5343, 4.6111, 3.9691], device='cuda:1') 2024-03-09 13:57:05,243 INFO [train.py:1029] (1/4) Epoch 12, validation: loss=0.2325, simple_loss=0.3061, pruned_loss=0.06737, over 452978.00 frames. 2024-03-09 13:57:05,243 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 13:57:26,785 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=11.870000000000001 2024-03-09 13:57:28,843 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11653.333333333334, ans=0.125 2024-03-09 13:57:30,385 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=11653.333333333334, ans=0.125 2024-03-09 13:57:44,244 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=11720.0, ans=0.3758 2024-03-09 13:57:47,471 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=11720.0, ans=0.125 2024-03-09 13:57:49,698 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=11.895 2024-03-09 13:58:10,022 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=11786.666666666666, ans=0.01755555555555556 2024-03-09 13:58:28,233 INFO [train.py:997] (1/4) Epoch 12, batch 50, loss[loss=0.182, simple_loss=0.2514, pruned_loss=0.04753, over 23977.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2601, pruned_loss=0.05373, over 1075353.46 frames. ], batch size: 142, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 13:58:30,079 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=11920.0, ans=0.008278260869565218 2024-03-09 13:58:59,607 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.173e+01 9.982e+01 1.112e+02 1.363e+02 2.435e+02, threshold=2.224e+02, percent-clipped=1.0 2024-03-09 13:59:18,580 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12120.0, ans=0.1788 2024-03-09 13:59:20,126 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=12120.0, ans=0.125 2024-03-09 13:59:43,193 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=12186.666666666666, ans=0.125 2024-03-09 13:59:49,525 INFO [train.py:997] (1/4) Epoch 12, batch 100, loss[loss=0.1956, simple_loss=0.2683, pruned_loss=0.05464, over 23946.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.265, pruned_loss=0.05564, over 1891777.75 frames. ], batch size: 153, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 14:00:01,508 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.98 vs. limit=12.094999999999999 2024-03-09 14:00:07,019 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=12320.0, ans=0.125 2024-03-09 14:00:22,217 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=12386.666666666666, ans=0.015055555555555558 2024-03-09 14:00:25,431 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=12.145 2024-03-09 14:00:41,327 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=12453.333333333334, ans=0.02 2024-03-09 14:00:57,076 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=9.008 2024-03-09 14:00:59,589 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=12520.0, ans=0.46180000000000004 2024-03-09 14:01:09,128 INFO [train.py:997] (1/4) Epoch 12, batch 150, loss[loss=0.1911, simple_loss=0.2616, pruned_loss=0.05579, over 24243.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2667, pruned_loss=0.05604, over 2521167.42 frames. ], batch size: 188, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 14:02:05,588 INFO [train.py:997] (1/4) Epoch 13, batch 0, loss[loss=0.1725, simple_loss=0.2522, pruned_loss=0.0419, over 22914.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2522, pruned_loss=0.0419, over 22914.00 frames. ], batch size: 608, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:02:05,588 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:02:18,485 INFO [train.py:1029] (1/4) Epoch 13, validation: loss=0.2245, simple_loss=0.307, pruned_loss=0.06618, over 452978.00 frames. 2024-03-09 14:02:18,486 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:02:21,171 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=12640.0, ans=16.98 2024-03-09 14:02:24,054 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=12640.0, ans=9.056000000000001 2024-03-09 14:02:37,306 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.720e+01 1.064e+02 1.199e+02 1.343e+02 2.089e+02, threshold=2.398e+02, percent-clipped=0.0 2024-03-09 14:03:04,391 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=12773.333333333334, ans=0.125 2024-03-09 14:03:19,849 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=12840.0, ans=0.008078260869565217 2024-03-09 14:03:42,233 INFO [train.py:997] (1/4) Epoch 13, batch 50, loss[loss=0.1869, simple_loss=0.2708, pruned_loss=0.04913, over 24134.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2606, pruned_loss=0.05124, over 1066003.64 frames. ], batch size: 366, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:03:44,150 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=12973.333333333334, ans=0.012611111111111108 2024-03-09 14:04:04,731 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:04:14,175 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=13106.666666666666, ans=0.125 2024-03-09 14:04:17,340 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=13106.666666666666, ans=0.4412666666666667 2024-03-09 14:04:34,407 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13173.333333333334, ans=0.16826666666666668 2024-03-09 14:05:01,304 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13240.0, ans=0.1676 2024-03-09 14:05:04,149 INFO [train.py:997] (1/4) Epoch 13, batch 100, loss[loss=0.2096, simple_loss=0.2892, pruned_loss=0.06483, over 23880.00 frames. ], tot_loss[loss=0.185, simple_loss=0.2623, pruned_loss=0.05176, over 1880535.12 frames. ], batch size: 447, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:05:04,459 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=13306.666666666666, ans=0.125 2024-03-09 14:05:05,375 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.04 vs. limit=11.653333333333332 2024-03-09 14:05:12,523 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13306.666666666666, ans=0.125 2024-03-09 14:05:15,615 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=13306.666666666666, ans=0.39959999999999996 2024-03-09 14:05:17,290 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=13306.666666666666, ans=0.125 2024-03-09 14:05:24,891 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.077e+01 1.017e+02 1.138e+02 1.327e+02 1.773e+02, threshold=2.276e+02, percent-clipped=0.0 2024-03-09 14:05:25,187 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13373.333333333334, ans=0.16626666666666667 2024-03-09 14:05:28,184 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=13373.333333333334, ans=0.125 2024-03-09 14:05:38,987 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13440.0, ans=0.125 2024-03-09 14:05:40,387 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13440.0, ans=0.125 2024-03-09 14:05:41,237 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.81 vs. limit=12.54 2024-03-09 14:05:47,203 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.91 vs. limit=17.58 2024-03-09 14:05:58,466 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=13506.666666666666, ans=0.16493333333333335 2024-03-09 14:06:06,040 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=13506.666666666666, ans=0.42726666666666674 2024-03-09 14:06:12,154 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13573.333333333334, ans=0.16426666666666667 2024-03-09 14:06:25,348 INFO [train.py:997] (1/4) Epoch 13, batch 150, loss[loss=0.1767, simple_loss=0.2523, pruned_loss=0.05056, over 19620.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2641, pruned_loss=0.05246, over 2516891.82 frames. ], batch size: 60, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:06:25,715 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=13640.0, ans=0.04949747468305833 2024-03-09 14:06:27,153 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=13640.0, ans=0.007904347826086957 2024-03-09 14:06:28,643 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=13640.0, ans=0.125 2024-03-09 14:06:31,630 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13640.0, ans=0.1636 2024-03-09 14:07:22,758 INFO [train.py:997] (1/4) Epoch 14, batch 0, loss[loss=0.1767, simple_loss=0.2581, pruned_loss=0.04766, over 24252.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2581, pruned_loss=0.04766, over 24252.00 frames. ], batch size: 267, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:07:22,759 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:07:32,051 INFO [train.py:1029] (1/4) Epoch 14, validation: loss=0.2172, simple_loss=0.3059, pruned_loss=0.06427, over 452978.00 frames. 2024-03-09 14:07:32,051 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:07:51,033 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=13760.0, ans=0.125 2024-03-09 14:08:21,348 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=13893.333333333334, ans=0.125 2024-03-09 14:08:53,267 INFO [train.py:997] (1/4) Epoch 14, batch 50, loss[loss=0.1844, simple_loss=0.2623, pruned_loss=0.05327, over 24091.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.256, pruned_loss=0.04751, over 1072929.34 frames. ], batch size: 344, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:08:59,478 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.009e+01 1.028e+02 1.152e+02 1.303e+02 2.373e+02, threshold=2.304e+02, percent-clipped=1.0 2024-03-09 14:09:04,358 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=14026.666666666666, ans=0.40906666666666675 2024-03-09 14:09:15,671 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=14093.333333333334, ans=0.125 2024-03-09 14:09:25,040 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=14160.0, ans=0.025 2024-03-09 14:09:28,187 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=14160.0, ans=0.40440000000000004 2024-03-09 14:09:34,125 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=14160.0, ans=0.125 2024-03-09 14:10:12,233 INFO [train.py:997] (1/4) Epoch 14, batch 100, loss[loss=0.162, simple_loss=0.2462, pruned_loss=0.03886, over 23056.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.2565, pruned_loss=0.04744, over 1894402.86 frames. ], batch size: 102, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:10:38,126 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=14426.666666666666, ans=0.0065555555555555575 2024-03-09 14:10:50,405 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=14493.333333333334, ans=0.125 2024-03-09 14:11:26,650 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=14626.666666666666, ans=0.125 2024-03-09 14:11:34,685 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14693.333333333334, ans=0.15306666666666666 2024-03-09 14:11:35,861 INFO [train.py:997] (1/4) Epoch 14, batch 150, loss[loss=0.178, simple_loss=0.2628, pruned_loss=0.04662, over 24136.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2583, pruned_loss=0.04808, over 2529748.18 frames. ], batch size: 345, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:11:41,697 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.629e+01 9.664e+01 1.070e+02 1.194e+02 2.380e+02, threshold=2.140e+02, percent-clipped=1.0 2024-03-09 14:12:33,847 INFO [train.py:997] (1/4) Epoch 15, batch 0, loss[loss=0.154, simple_loss=0.2256, pruned_loss=0.04119, over 23725.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2256, pruned_loss=0.04119, over 23725.00 frames. ], batch size: 116, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:12:33,848 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:12:40,136 INFO [zipformer.py:1858] (1/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([1.3561, 2.4523, 2.2250, 2.2075, 2.3488, 2.2107, 2.2510, 2.4288], device='cuda:1') 2024-03-09 14:12:43,270 INFO [train.py:1029] (1/4) Epoch 15, validation: loss=0.2144, simple_loss=0.3029, pruned_loss=0.06295, over 452978.00 frames. 2024-03-09 14:12:43,271 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:12:45,212 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=14746.666666666666, ans=0.005222222222222218 2024-03-09 14:12:51,620 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14746.666666666666, ans=0.15253333333333333 2024-03-09 14:13:00,049 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.77 vs. limit=12.373333333333333 2024-03-09 14:13:01,027 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=14813.333333333334, ans=0.125 2024-03-09 14:13:18,500 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14880.0, ans=0.125 2024-03-09 14:13:44,983 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=14946.666666666666, ans=0.0 2024-03-09 14:14:02,147 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=15013.333333333334, ans=0.07 2024-03-09 14:14:04,892 INFO [train.py:997] (1/4) Epoch 15, batch 50, loss[loss=0.1528, simple_loss=0.2308, pruned_loss=0.03743, over 24311.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2568, pruned_loss=0.04683, over 1069987.40 frames. ], batch size: 208, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:15:10,084 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=15346.666666666666, ans=0.125 2024-03-09 14:15:11,578 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15346.666666666666, ans=0.14653333333333335 2024-03-09 14:15:17,216 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=13.254999999999999 2024-03-09 14:15:19,056 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.102e+01 1.026e+02 1.164e+02 1.400e+02 2.237e+02, threshold=2.327e+02, percent-clipped=1.0 2024-03-09 14:15:27,117 INFO [train.py:997] (1/4) Epoch 15, batch 100, loss[loss=0.171, simple_loss=0.2499, pruned_loss=0.04604, over 24286.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2552, pruned_loss=0.04464, over 1891505.42 frames. ], batch size: 198, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:15:43,520 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=10.192 2024-03-09 14:15:46,381 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.15 vs. limit=12.74 2024-03-09 14:15:57,839 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=15546.666666666666, ans=0.0018888888888888913 2024-03-09 14:15:58,718 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.46 vs. limit=12.773333333333333 2024-03-09 14:16:09,540 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=15546.666666666666, ans=0.007489855072463768 2024-03-09 14:16:46,428 INFO [train.py:997] (1/4) Epoch 15, batch 150, loss[loss=0.164, simple_loss=0.2458, pruned_loss=0.04113, over 24250.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2542, pruned_loss=0.04426, over 2528929.90 frames. ], batch size: 229, lr: 2.16e-02, grad_scale: 32.0 2024-03-09 14:17:45,376 INFO [train.py:997] (1/4) Epoch 16, batch 0, loss[loss=0.1681, simple_loss=0.2542, pruned_loss=0.04098, over 24041.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2542, pruned_loss=0.04098, over 24041.00 frames. ], batch size: 344, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:17:45,376 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:17:55,605 INFO [train.py:1029] (1/4) Epoch 16, validation: loss=0.2134, simple_loss=0.3039, pruned_loss=0.06146, over 452978.00 frames. 2024-03-09 14:17:55,606 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:17:56,575 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.52 vs. limit=13.425 2024-03-09 14:18:08,089 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=15800.0, ans=0.14200000000000002 2024-03-09 14:18:17,402 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=15866.666666666666, ans=0.125 2024-03-09 14:18:35,950 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=15933.333333333334, ans=0.00027777777777777263 2024-03-09 14:18:54,850 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=13.5 2024-03-09 14:19:03,235 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.463e+01 8.632e+01 1.007e+02 1.180e+02 1.868e+02, threshold=2.014e+02, percent-clipped=0.0 2024-03-09 14:19:15,293 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.31 vs. limit=19.55 2024-03-09 14:19:21,895 INFO [train.py:997] (1/4) Epoch 16, batch 50, loss[loss=0.1666, simple_loss=0.2546, pruned_loss=0.03931, over 24146.00 frames. ], tot_loss[loss=0.1671, simple_loss=0.2501, pruned_loss=0.04203, over 1073491.29 frames. ], batch size: 366, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:19:44,428 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=13.575 2024-03-09 14:20:13,170 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16333.333333333334, ans=0.13666666666666666 2024-03-09 14:20:15,433 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.68 vs. limit=19.75 2024-03-09 14:20:34,542 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=16400.0, ans=0.007304347826086957 2024-03-09 14:20:38,763 INFO [train.py:997] (1/4) Epoch 16, batch 100, loss[loss=0.1697, simple_loss=0.2556, pruned_loss=0.04194, over 24316.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2524, pruned_loss=0.04305, over 1887687.30 frames. ], batch size: 267, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:21:04,515 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.76 vs. limit=13.266666666666666 2024-03-09 14:21:43,560 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.243e+01 8.931e+01 9.706e+01 1.091e+02 1.368e+02, threshold=1.941e+02, percent-clipped=0.0 2024-03-09 14:22:02,418 INFO [train.py:997] (1/4) Epoch 16, batch 150, loss[loss=0.177, simple_loss=0.2616, pruned_loss=0.04614, over 24078.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2541, pruned_loss=0.04335, over 2515767.43 frames. ], batch size: 176, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:22:06,874 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=13.8 2024-03-09 14:22:12,826 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.719999999999999 2024-03-09 14:23:00,939 INFO [train.py:997] (1/4) Epoch 17, batch 0, loss[loss=0.1637, simple_loss=0.246, pruned_loss=0.04069, over 24061.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.246, pruned_loss=0.04069, over 24061.00 frames. ], batch size: 176, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:23:00,940 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:23:11,391 INFO [train.py:1029] (1/4) Epoch 17, validation: loss=0.215, simple_loss=0.3066, pruned_loss=0.06175, over 452978.00 frames. 2024-03-09 14:23:11,392 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:23:30,185 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=16920.0, ans=0.1308 2024-03-09 14:23:50,958 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16986.666666666668, ans=0.13013333333333332 2024-03-09 14:23:51,089 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=16986.666666666668, ans=0.05 2024-03-09 14:23:52,538 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=16986.666666666668, ans=0.125 2024-03-09 14:23:56,334 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=5.548 2024-03-09 14:24:17,054 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=17053.333333333332, ans=20.29 2024-03-09 14:24:21,307 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=17120.0, ans=0.125 2024-03-09 14:24:36,336 INFO [train.py:997] (1/4) Epoch 17, batch 50, loss[loss=0.1483, simple_loss=0.2263, pruned_loss=0.03512, over 23963.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2491, pruned_loss=0.03904, over 1057614.93 frames. ], batch size: 142, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:24:45,050 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=13.945 2024-03-09 14:25:01,343 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=17253.333333333332, ans=0.007118840579710146 2024-03-09 14:25:22,883 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.440e+01 9.326e+01 1.031e+02 1.175e+02 1.521e+02, threshold=2.062e+02, percent-clipped=0.0 2024-03-09 14:25:24,793 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=17386.666666666668, ans=0.125 2024-03-09 14:25:49,287 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=17453.333333333332, ans=0.28913333333333346 2024-03-09 14:25:57,095 INFO [train.py:997] (1/4) Epoch 17, batch 100, loss[loss=0.1461, simple_loss=0.231, pruned_loss=0.03062, over 23610.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2519, pruned_loss=0.04078, over 1879726.72 frames. ], batch size: 128, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:26:24,787 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=17586.666666666668, ans=0.125 2024-03-09 14:26:41,750 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.64 vs. limit=14.120000000000001 2024-03-09 14:27:02,962 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=14.17 2024-03-09 14:27:09,072 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=14.17 2024-03-09 14:27:11,639 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17786.666666666668, ans=0.0 2024-03-09 14:27:15,877 INFO [train.py:997] (1/4) Epoch 17, batch 150, loss[loss=0.1683, simple_loss=0.2543, pruned_loss=0.04115, over 24250.00 frames. ], tot_loss[loss=0.1672, simple_loss=0.2519, pruned_loss=0.04124, over 2514401.00 frames. ], batch size: 254, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:28:12,292 INFO [train.py:997] (1/4) Epoch 18, batch 0, loss[loss=0.1704, simple_loss=0.2504, pruned_loss=0.04523, over 24059.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2504, pruned_loss=0.04523, over 24059.00 frames. ], batch size: 165, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:28:12,293 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:28:22,761 INFO [train.py:1029] (1/4) Epoch 18, validation: loss=0.213, simple_loss=0.3039, pruned_loss=0.06107, over 452978.00 frames. 2024-03-09 14:28:22,762 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:28:29,500 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17906.666666666668, ans=0.12093333333333331 2024-03-09 14:28:29,584 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=17906.666666666668, ans=0.125 2024-03-09 14:28:47,863 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=17973.333333333332, ans=0.125 2024-03-09 14:29:02,779 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.169e+01 8.782e+01 9.645e+01 1.059e+02 1.496e+02, threshold=1.929e+02, percent-clipped=0.0 2024-03-09 14:29:28,476 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.32 vs. limit=21.08 2024-03-09 14:29:39,891 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=18173.333333333332, ans=0.0 2024-03-09 14:29:44,456 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=18240.0, ans=0.0 2024-03-09 14:29:44,479 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=18240.0, ans=0.0 2024-03-09 14:29:45,632 INFO [train.py:997] (1/4) Epoch 18, batch 50, loss[loss=0.1691, simple_loss=0.2518, pruned_loss=0.04319, over 24201.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2454, pruned_loss=0.03824, over 1072671.42 frames. ], batch size: 198, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:29:58,028 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.02 vs. limit=9.559999999999999 2024-03-09 14:30:12,679 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18306.666666666668, ans=0.125 2024-03-09 14:30:20,348 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=18373.333333333332, ans=0.00687536231884058 2024-03-09 14:30:29,473 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=18373.333333333332, ans=0.00687536231884058 2024-03-09 14:30:38,774 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18440.0, ans=0.125 2024-03-09 14:31:06,276 INFO [train.py:997] (1/4) Epoch 18, batch 100, loss[loss=0.1557, simple_loss=0.2366, pruned_loss=0.03735, over 24249.00 frames. ], tot_loss[loss=0.1624, simple_loss=0.2471, pruned_loss=0.03879, over 1889100.54 frames. ], batch size: 229, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:31:28,031 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=18640.0, ans=0.125 2024-03-09 14:31:41,829 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.054e+01 8.645e+01 9.593e+01 1.057e+02 1.559e+02, threshold=1.919e+02, percent-clipped=0.0 2024-03-09 14:32:26,161 INFO [train.py:997] (1/4) Epoch 18, batch 150, loss[loss=0.1594, simple_loss=0.244, pruned_loss=0.03738, over 24194.00 frames. ], tot_loss[loss=0.1648, simple_loss=0.25, pruned_loss=0.03986, over 2506233.46 frames. ], batch size: 217, lr: 1.95e-02, grad_scale: 32.0 2024-03-09 14:32:32,083 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=14.59 2024-03-09 14:32:34,086 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=18906.666666666668, ans=0.0 2024-03-09 14:33:23,372 INFO [train.py:997] (1/4) Epoch 19, batch 0, loss[loss=0.1604, simple_loss=0.2397, pruned_loss=0.04053, over 24241.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.2397, pruned_loss=0.04053, over 24241.00 frames. ], batch size: 218, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:33:23,373 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:33:33,968 INFO [zipformer.py:1858] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.8883, 3.6169, 3.8407, 3.5710], device='cuda:1') 2024-03-09 14:33:35,282 INFO [train.py:1029] (1/4) Epoch 19, validation: loss=0.2133, simple_loss=0.3046, pruned_loss=0.061, over 452978.00 frames. 2024-03-09 14:33:35,282 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:33:38,529 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=18960.0, ans=0.024620000000000003 2024-03-09 14:34:06,399 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=19026.666666666668, ans=0.125 2024-03-09 14:34:31,699 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=5.8740000000000006 2024-03-09 14:34:37,314 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=19160.0, ans=0.0 2024-03-09 14:34:51,194 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=19226.666666666668, ans=0.22706666666666675 2024-03-09 14:34:55,502 INFO [train.py:997] (1/4) Epoch 19, batch 50, loss[loss=0.1665, simple_loss=0.2575, pruned_loss=0.03771, over 24037.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.2457, pruned_loss=0.03756, over 1073837.29 frames. ], batch size: 388, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:35:05,250 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=19293.333333333332, ans=0.0 2024-03-09 14:35:17,295 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.127e+01 8.675e+01 9.444e+01 1.046e+02 1.924e+02, threshold=1.889e+02, percent-clipped=1.0 2024-03-09 14:35:34,761 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:35:42,563 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=19493.333333333332, ans=0.0 2024-03-09 14:35:47,083 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:35:52,174 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=14.809999999999999 2024-03-09 14:36:16,167 INFO [train.py:997] (1/4) Epoch 19, batch 100, loss[loss=0.1616, simple_loss=0.2452, pruned_loss=0.039, over 24034.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2452, pruned_loss=0.03723, over 1890184.74 frames. ], batch size: 176, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:36:16,430 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=19626.666666666668, ans=0.0 2024-03-09 14:36:19,714 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=19626.666666666668, ans=0.125 2024-03-09 14:36:25,888 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19626.666666666668, ans=0.125 2024-03-09 14:36:53,042 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=19760.0, ans=0.125 2024-03-09 14:37:11,315 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19826.666666666668, ans=0.10173333333333334 2024-03-09 14:37:12,886 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=19826.666666666668, ans=0.125 2024-03-09 14:37:19,745 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:37:23,422 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.04 vs. limit=22.42 2024-03-09 14:37:25,787 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=19893.333333333332, ans=0.125 2024-03-09 14:37:36,532 INFO [train.py:997] (1/4) Epoch 19, batch 150, loss[loss=0.1599, simple_loss=0.2512, pruned_loss=0.03435, over 24267.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2464, pruned_loss=0.03773, over 2528268.60 frames. ], batch size: 267, lr: 1.89e-02, grad_scale: 32.0 2024-03-09 14:37:41,354 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=19960.0, ans=0.125 2024-03-09 14:38:31,429 INFO [train.py:997] (1/4) Epoch 20, batch 0, loss[loss=0.1634, simple_loss=0.2444, pruned_loss=0.0412, over 24271.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2444, pruned_loss=0.0412, over 24271.00 frames. ], batch size: 188, lr: 1.85e-02, grad_scale: 32.0 2024-03-09 14:38:31,430 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:38:38,116 INFO [zipformer.py:1858] (1/4) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.6482, 3.5293, 3.4464, 2.9597], device='cuda:1') 2024-03-09 14:38:40,963 INFO [train.py:1029] (1/4) Epoch 20, validation: loss=0.2111, simple_loss=0.3031, pruned_loss=0.05952, over 452978.00 frames. 2024-03-09 14:38:40,964 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:38:53,199 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.462e+01 8.448e+01 9.307e+01 1.038e+02 2.078e+02, threshold=1.861e+02, percent-clipped=1.0 2024-03-09 14:39:37,888 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=20213.333333333332, ans=0.125 2024-03-09 14:39:45,510 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=20213.333333333332, ans=0.0 2024-03-09 14:39:51,629 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20280.0, ans=0.1 2024-03-09 14:39:56,276 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=20280.0, ans=0.05 2024-03-09 14:40:03,554 INFO [train.py:997] (1/4) Epoch 20, batch 50, loss[loss=0.1501, simple_loss=0.2401, pruned_loss=0.0301, over 24247.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2471, pruned_loss=0.03744, over 1078185.65 frames. ], batch size: 311, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:40:16,089 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=20346.666666666668, ans=0.125 2024-03-09 14:40:22,268 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=20413.333333333332, ans=0.125 2024-03-09 14:40:52,250 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2024-03-09 14:41:07,535 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=20613.333333333332, ans=0.125 2024-03-09 14:41:11,837 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=20613.333333333332, ans=0.125 2024-03-09 14:41:25,664 INFO [train.py:997] (1/4) Epoch 20, batch 100, loss[loss=0.1642, simple_loss=0.2514, pruned_loss=0.03849, over 24248.00 frames. ], tot_loss[loss=0.1601, simple_loss=0.2473, pruned_loss=0.03644, over 1887389.18 frames. ], batch size: 327, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:41:34,817 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.448e+01 8.010e+01 8.832e+01 9.695e+01 1.353e+02, threshold=1.766e+02, percent-clipped=0.0 2024-03-09 14:42:01,170 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=20813.333333333332, ans=0.0 2024-03-09 14:42:05,632 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=20813.333333333332, ans=0.07 2024-03-09 14:42:22,169 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=20880.0, ans=0.0 2024-03-09 14:42:24,481 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=22.5 2024-03-09 14:42:28,070 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.98 vs. limit=10.0 2024-03-09 14:42:44,020 INFO [train.py:997] (1/4) Epoch 20, batch 150, loss[loss=0.1638, simple_loss=0.2437, pruned_loss=0.04201, over 24146.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2464, pruned_loss=0.03572, over 2517856.40 frames. ], batch size: 165, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:42:53,410 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=21013.333333333332, ans=0.125 2024-03-09 14:43:39,749 INFO [train.py:997] (1/4) Epoch 21, batch 0, loss[loss=0.1518, simple_loss=0.2417, pruned_loss=0.031, over 24271.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2417, pruned_loss=0.031, over 24271.00 frames. ], batch size: 311, lr: 1.79e-02, grad_scale: 32.0 2024-03-09 14:43:39,749 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:43:49,471 INFO [train.py:1029] (1/4) Epoch 21, validation: loss=0.2106, simple_loss=0.3015, pruned_loss=0.05984, over 452978.00 frames. 2024-03-09 14:43:49,472 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:44:14,282 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=21133.333333333332, ans=0.2 2024-03-09 14:44:41,840 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=21266.666666666668, ans=0.0 2024-03-09 14:44:44,846 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=21266.666666666668, ans=0.125 2024-03-09 14:45:04,429 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21333.333333333332, ans=0.1 2024-03-09 14:45:10,281 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.134e+01 8.236e+01 9.284e+01 1.075e+02 1.651e+02, threshold=1.857e+02, percent-clipped=0.0 2024-03-09 14:45:13,756 INFO [train.py:997] (1/4) Epoch 21, batch 50, loss[loss=0.1677, simple_loss=0.2632, pruned_loss=0.03612, over 23946.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2429, pruned_loss=0.03552, over 1073685.29 frames. ], batch size: 415, lr: 1.79e-02, grad_scale: 32.0 2024-03-09 14:45:18,742 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=21400.0, ans=0.2 2024-03-09 14:45:21,734 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=21400.0, ans=0.0 2024-03-09 14:45:26,645 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=21400.0, ans=0.125 2024-03-09 14:45:39,065 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=21466.666666666668, ans=0.006202898550724638 2024-03-09 14:45:46,868 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=21533.333333333332, ans=0.125 2024-03-09 14:46:02,215 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=21600.0, ans=0.125 2024-03-09 14:46:10,826 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2024-03-09 14:46:26,514 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-03-09 14:46:32,874 INFO [train.py:997] (1/4) Epoch 21, batch 100, loss[loss=0.1629, simple_loss=0.2466, pruned_loss=0.0396, over 24078.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2415, pruned_loss=0.03431, over 1887899.06 frames. ], batch size: 176, lr: 1.79e-02, grad_scale: 64.0 2024-03-09 14:47:29,612 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21933.333333333332, ans=0.1 2024-03-09 14:47:31,291 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=21933.333333333332, ans=0.0 2024-03-09 14:47:32,745 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=21933.333333333332, ans=0.0 2024-03-09 14:47:38,774 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=22000.0, ans=0.125 2024-03-09 14:47:51,970 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.744e+01 8.144e+01 8.919e+01 1.026e+02 1.301e+02, threshold=1.784e+02, percent-clipped=0.0 2024-03-09 14:47:55,062 INFO [train.py:997] (1/4) Epoch 21, batch 150, loss[loss=0.1673, simple_loss=0.2482, pruned_loss=0.0432, over 23993.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.243, pruned_loss=0.03461, over 2513775.58 frames. ], batch size: 153, lr: 1.79e-02, grad_scale: 64.0 2024-03-09 14:47:56,119 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.86 vs. limit=22.5 2024-03-09 14:48:00,731 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=22066.666666666668, ans=0.0 2024-03-09 14:48:51,238 INFO [train.py:997] (1/4) Epoch 22, batch 0, loss[loss=0.1484, simple_loss=0.2325, pruned_loss=0.03211, over 22914.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2325, pruned_loss=0.03211, over 22914.00 frames. ], batch size: 85, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:48:51,238 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:49:00,966 INFO [train.py:1029] (1/4) Epoch 22, validation: loss=0.2117, simple_loss=0.3028, pruned_loss=0.06033, over 452978.00 frames. 2024-03-09 14:49:00,967 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:49:01,389 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=22120.0, ans=10.0 2024-03-09 14:49:17,667 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.26 vs. limit=10.0 2024-03-09 14:49:39,427 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=22253.333333333332, ans=0.125 2024-03-09 14:49:52,658 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2024-03-09 14:49:57,867 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=22320.0, ans=0.125 2024-03-09 14:50:16,999 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.40 vs. limit=22.5 2024-03-09 14:50:23,732 INFO [train.py:997] (1/4) Epoch 22, batch 50, loss[loss=0.1532, simple_loss=0.2456, pruned_loss=0.03037, over 24129.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2422, pruned_loss=0.03385, over 1075865.88 frames. ], batch size: 366, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:50:33,370 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=22453.333333333332, ans=0.125 2024-03-09 14:51:18,938 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=22653.333333333332, ans=0.125 2024-03-09 14:51:28,007 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.973e+01 8.132e+01 8.918e+01 9.986e+01 1.265e+02, threshold=1.784e+02, percent-clipped=0.0 2024-03-09 14:51:45,175 INFO [train.py:997] (1/4) Epoch 22, batch 100, loss[loss=0.1543, simple_loss=0.2422, pruned_loss=0.03322, over 24202.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2424, pruned_loss=0.03426, over 1886470.38 frames. ], batch size: 240, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:52:06,842 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=22853.333333333332, ans=0.0 2024-03-09 14:52:16,731 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=15.0 2024-03-09 14:52:29,636 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22920.0, ans=0.1 2024-03-09 14:52:33,709 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2024-03-09 14:52:44,423 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2024-03-09 14:52:51,200 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23053.333333333332, ans=0.1 2024-03-09 14:53:03,130 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=23053.333333333332, ans=0.0 2024-03-09 14:53:03,812 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.44 vs. limit=15.0 2024-03-09 14:53:05,945 INFO [train.py:997] (1/4) Epoch 22, batch 150, loss[loss=0.1406, simple_loss=0.2324, pruned_loss=0.02439, over 22865.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2431, pruned_loss=0.03457, over 2515024.82 frames. ], batch size: 609, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:54:00,140 INFO [train.py:997] (1/4) Epoch 23, batch 0, loss[loss=0.1577, simple_loss=0.2488, pruned_loss=0.03332, over 24210.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2488, pruned_loss=0.03332, over 24210.00 frames. ], batch size: 267, lr: 1.70e-02, grad_scale: 64.0 2024-03-09 14:54:00,140 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:54:09,891 INFO [train.py:1029] (1/4) Epoch 23, validation: loss=0.2115, simple_loss=0.3036, pruned_loss=0.0597, over 452978.00 frames. 2024-03-09 14:54:09,892 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:54:13,408 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=23173.333333333332, ans=0.125 2024-03-09 14:54:14,995 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=23173.333333333332, ans=0.125 2024-03-09 14:54:29,516 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.67 vs. limit=22.5 2024-03-09 14:54:39,587 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=23240.0, ans=0.1 2024-03-09 14:55:02,388 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=23373.333333333332, ans=0.125 2024-03-09 14:55:05,175 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.526e+01 7.783e+01 8.704e+01 9.596e+01 1.275e+02, threshold=1.741e+02, percent-clipped=0.0 2024-03-09 14:55:08,698 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=23373.333333333332, ans=0.125 2024-03-09 14:55:33,119 INFO [train.py:997] (1/4) Epoch 23, batch 50, loss[loss=0.1521, simple_loss=0.2417, pruned_loss=0.03127, over 24067.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2394, pruned_loss=0.03291, over 1071009.25 frames. ], batch size: 344, lr: 1.70e-02, grad_scale: 64.0 2024-03-09 14:55:39,710 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=23506.666666666668, ans=0.09899494936611666 2024-03-09 14:55:58,356 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=23573.333333333332, ans=0.0 2024-03-09 14:56:48,449 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2024-03-09 14:56:53,726 INFO [train.py:997] (1/4) Epoch 23, batch 100, loss[loss=0.1528, simple_loss=0.2445, pruned_loss=0.03053, over 24112.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2398, pruned_loss=0.03287, over 1881783.74 frames. ], batch size: 345, lr: 1.69e-02, grad_scale: 64.0 2024-03-09 14:57:01,707 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=23840.0, ans=0.025 2024-03-09 14:57:42,083 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2024-03-09 14:57:45,467 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.240e+01 7.813e+01 8.574e+01 9.589e+01 1.326e+02, threshold=1.715e+02, percent-clipped=0.0 2024-03-09 14:58:10,442 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=24106.666666666668, ans=0.1 2024-03-09 14:58:13,651 INFO [train.py:997] (1/4) Epoch 23, batch 150, loss[loss=0.1624, simple_loss=0.2549, pruned_loss=0.03496, over 23936.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.241, pruned_loss=0.03212, over 2514340.50 frames. ], batch size: 416, lr: 1.69e-02, grad_scale: 64.0 2024-03-09 14:58:23,379 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=24173.333333333332, ans=0.125 2024-03-09 14:59:06,475 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2024-03-09 14:59:07,183 INFO [train.py:997] (1/4) Epoch 24, batch 0, loss[loss=0.1271, simple_loss=0.2232, pruned_loss=0.0155, over 21518.00 frames. ], tot_loss[loss=0.1271, simple_loss=0.2232, pruned_loss=0.0155, over 21518.00 frames. ], batch size: 718, lr: 1.66e-02, grad_scale: 64.0 2024-03-09 14:59:07,183 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 14:59:16,704 INFO [train.py:1029] (1/4) Epoch 24, validation: loss=0.2123, simple_loss=0.3043, pruned_loss=0.06014, over 452978.00 frames. 2024-03-09 14:59:16,705 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 14:59:46,333 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:59:51,057 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24293.333333333332, ans=0.1 2024-03-09 15:00:23,425 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=24426.666666666668, ans=0.1 2024-03-09 15:00:29,679 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=24493.333333333332, ans=0.125 2024-03-09 15:00:35,099 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2024-03-09 15:00:43,107 INFO [train.py:997] (1/4) Epoch 24, batch 50, loss[loss=0.1387, simple_loss=0.2251, pruned_loss=0.02613, over 23963.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2362, pruned_loss=0.03109, over 1060911.67 frames. ], batch size: 142, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:00:55,750 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=24560.0, ans=0.125 2024-03-09 15:01:06,656 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=24626.666666666668, ans=0.2 2024-03-09 15:01:20,106 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.514e+01 7.866e+01 8.423e+01 9.105e+01 1.243e+02, threshold=1.685e+02, percent-clipped=0.0 2024-03-09 15:01:42,063 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24760.0, ans=0.125 2024-03-09 15:02:03,700 INFO [train.py:997] (1/4) Epoch 24, batch 100, loss[loss=0.1456, simple_loss=0.2379, pruned_loss=0.02665, over 24251.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2399, pruned_loss=0.03229, over 1871011.64 frames. ], batch size: 281, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:02:07,457 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2024-03-09 15:02:33,880 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.30 vs. limit=15.0 2024-03-09 15:02:50,048 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=25026.666666666668, ans=0.125 2024-03-09 15:02:59,724 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2024-03-09 15:03:24,737 INFO [train.py:997] (1/4) Epoch 24, batch 150, loss[loss=0.1519, simple_loss=0.2381, pruned_loss=0.03281, over 23184.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2414, pruned_loss=0.03247, over 2513651.39 frames. ], batch size: 102, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:04:17,880 INFO [train.py:997] (1/4) Epoch 25, batch 0, loss[loss=0.1598, simple_loss=0.238, pruned_loss=0.04073, over 20207.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.238, pruned_loss=0.04073, over 20207.00 frames. ], batch size: 62, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:04:17,881 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:04:27,735 INFO [train.py:1029] (1/4) Epoch 25, validation: loss=0.2123, simple_loss=0.3048, pruned_loss=0.05995, over 452978.00 frames. 2024-03-09 15:04:27,736 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:04:56,149 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.291e+01 7.825e+01 8.498e+01 9.317e+01 1.197e+02, threshold=1.700e+02, percent-clipped=0.0 2024-03-09 15:05:32,023 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25546.666666666668, ans=0.125 2024-03-09 15:05:36,713 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=25546.666666666668, ans=0.2 2024-03-09 15:05:50,867 INFO [train.py:997] (1/4) Epoch 25, batch 50, loss[loss=0.1479, simple_loss=0.2436, pruned_loss=0.02613, over 23960.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2397, pruned_loss=0.03115, over 1079173.03 frames. ], batch size: 387, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:05:53,294 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2024-03-09 15:06:17,236 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 15:06:22,408 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.37 vs. limit=10.0 2024-03-09 15:06:27,806 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=25746.666666666668, ans=0.0 2024-03-09 15:06:47,676 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=25813.333333333332, ans=0.2 2024-03-09 15:07:11,204 INFO [train.py:997] (1/4) Epoch 25, batch 100, loss[loss=0.1517, simple_loss=0.2301, pruned_loss=0.03666, over 24327.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2394, pruned_loss=0.031, over 1889729.14 frames. ], batch size: 208, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:07:21,069 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=25946.666666666668, ans=0.0 2024-03-09 15:07:37,230 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.20 vs. limit=22.5 2024-03-09 15:07:37,665 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.239e+01 7.935e+01 8.679e+01 9.503e+01 1.168e+02, threshold=1.736e+02, percent-clipped=0.0 2024-03-09 15:07:48,719 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=26080.0, ans=0.0052 2024-03-09 15:08:31,503 INFO [train.py:997] (1/4) Epoch 25, batch 150, loss[loss=0.1486, simple_loss=0.2329, pruned_loss=0.03214, over 19692.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2401, pruned_loss=0.03131, over 2512195.76 frames. ], batch size: 59, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:09:26,515 INFO [train.py:997] (1/4) Epoch 26, batch 0, loss[loss=0.1521, simple_loss=0.2347, pruned_loss=0.03482, over 19646.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2347, pruned_loss=0.03482, over 19646.00 frames. ], batch size: 60, lr: 1.58e-02, grad_scale: 64.0 2024-03-09 15:09:26,515 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:09:35,916 INFO [train.py:1029] (1/4) Epoch 26, validation: loss=0.2091, simple_loss=0.3013, pruned_loss=0.05842, over 452978.00 frames. 2024-03-09 15:09:35,917 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:09:44,132 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=26333.333333333332, ans=0.005144927536231885 2024-03-09 15:09:59,654 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=26400.0, ans=0.125 2024-03-09 15:10:05,851 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=26400.0, ans=0.0 2024-03-09 15:10:44,199 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=26600.0, ans=0.125 2024-03-09 15:10:52,940 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.76 vs. limit=22.5 2024-03-09 15:10:59,542 INFO [train.py:997] (1/4) Epoch 26, batch 50, loss[loss=0.173, simple_loss=0.2652, pruned_loss=0.04038, over 23651.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2368, pruned_loss=0.03014, over 1067992.57 frames. ], batch size: 485, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:11:11,923 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.387e+01 7.632e+01 8.183e+01 8.952e+01 1.265e+02, threshold=1.637e+02, percent-clipped=0.0 2024-03-09 15:11:15,263 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=26733.333333333332, ans=0.0 2024-03-09 15:11:39,925 INFO [scaling.py:1119] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-03-09 15:11:40,717 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.39 vs. limit=15.0 2024-03-09 15:12:22,491 INFO [train.py:997] (1/4) Epoch 26, batch 100, loss[loss=0.1483, simple_loss=0.2318, pruned_loss=0.03234, over 24235.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2376, pruned_loss=0.03081, over 1886089.62 frames. ], batch size: 188, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:12:22,829 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=27000.0, ans=0.0 2024-03-09 15:12:55,614 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=27133.333333333332, ans=0.125 2024-03-09 15:12:58,587 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=27133.333333333332, ans=0.0 2024-03-09 15:13:36,471 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.86 vs. limit=15.0 2024-03-09 15:13:42,261 INFO [train.py:997] (1/4) Epoch 26, batch 150, loss[loss=0.1536, simple_loss=0.2462, pruned_loss=0.03051, over 24293.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2382, pruned_loss=0.03088, over 2521719.52 frames. ], batch size: 267, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:14:38,736 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.880e+01 7.565e+01 8.210e+01 9.162e+01 1.256e+02, threshold=1.642e+02, percent-clipped=0.0 2024-03-09 15:14:38,769 INFO [train.py:997] (1/4) Epoch 27, batch 0, loss[loss=0.1496, simple_loss=0.2333, pruned_loss=0.03292, over 20536.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2333, pruned_loss=0.03292, over 20536.00 frames. ], batch size: 62, lr: 1.54e-02, grad_scale: 64.0 2024-03-09 15:14:38,770 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:14:48,412 INFO [train.py:1029] (1/4) Epoch 27, validation: loss=0.2114, simple_loss=0.3031, pruned_loss=0.05987, over 452978.00 frames. 2024-03-09 15:14:48,413 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:15:40,921 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=27520.0, ans=0.07 2024-03-09 15:15:47,085 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=27586.666666666668, ans=0.125 2024-03-09 15:16:14,482 INFO [train.py:997] (1/4) Epoch 27, batch 50, loss[loss=0.1323, simple_loss=0.2173, pruned_loss=0.0236, over 23626.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2368, pruned_loss=0.03192, over 1066885.62 frames. ], batch size: 128, lr: 1.54e-02, grad_scale: 64.0 2024-03-09 15:16:17,983 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27720.0, ans=0.1 2024-03-09 15:16:25,684 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=27720.0, ans=0.004843478260869565 2024-03-09 15:16:39,593 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27786.666666666668, ans=0.125 2024-03-09 15:16:40,080 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=27786.666666666668, ans=12.0 2024-03-09 15:16:55,414 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=27853.333333333332, ans=0.2 2024-03-09 15:17:05,273 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=12.0 2024-03-09 15:17:33,756 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.709e+01 7.734e+01 8.550e+01 9.615e+01 1.355e+02, threshold=1.710e+02, percent-clipped=0.0 2024-03-09 15:17:33,785 INFO [train.py:997] (1/4) Epoch 27, batch 100, loss[loss=0.1542, simple_loss=0.25, pruned_loss=0.02917, over 24014.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2381, pruned_loss=0.03137, over 1882935.61 frames. ], batch size: 416, lr: 1.53e-02, grad_scale: 64.0 2024-03-09 15:17:39,324 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=15.0 2024-03-09 15:17:43,262 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28053.333333333332, ans=0.1 2024-03-09 15:18:06,538 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=28120.0, ans=0.125 2024-03-09 15:18:24,802 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28253.333333333332, ans=0.0 2024-03-09 15:18:55,714 INFO [train.py:997] (1/4) Epoch 27, batch 150, loss[loss=0.1337, simple_loss=0.2283, pruned_loss=0.01952, over 22917.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2382, pruned_loss=0.03038, over 2517897.76 frames. ], batch size: 609, lr: 1.53e-02, grad_scale: 64.0 2024-03-09 15:19:49,042 INFO [train.py:997] (1/4) Epoch 28, batch 0, loss[loss=0.1279, simple_loss=0.2179, pruned_loss=0.01895, over 22927.00 frames. ], tot_loss[loss=0.1279, simple_loss=0.2179, pruned_loss=0.01895, over 22927.00 frames. ], batch size: 609, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:19:49,043 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:19:59,336 INFO [train.py:1029] (1/4) Epoch 28, validation: loss=0.2107, simple_loss=0.3034, pruned_loss=0.05903, over 452978.00 frames. 2024-03-09 15:19:59,337 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:20:43,537 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28573.333333333332, ans=0.125 2024-03-09 15:20:44,160 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2024-03-09 15:21:11,007 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.458e+01 7.529e+01 8.136e+01 8.999e+01 1.198e+02, threshold=1.627e+02, percent-clipped=0.0 2024-03-09 15:21:23,148 INFO [train.py:997] (1/4) Epoch 28, batch 50, loss[loss=0.1199, simple_loss=0.203, pruned_loss=0.01836, over 23652.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2355, pruned_loss=0.02971, over 1063856.34 frames. ], batch size: 116, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:21:31,120 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28773.333333333332, ans=0.125 2024-03-09 15:21:55,464 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=28906.666666666668, ans=0.0 2024-03-09 15:21:55,473 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=28906.666666666668, ans=0.0 2024-03-09 15:22:11,580 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=15.0 2024-03-09 15:22:35,164 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2024-03-09 15:22:40,323 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=29040.0, ans=0.125 2024-03-09 15:22:41,812 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=29106.666666666668, ans=0.2 2024-03-09 15:22:43,075 INFO [train.py:997] (1/4) Epoch 28, batch 100, loss[loss=0.1426, simple_loss=0.227, pruned_loss=0.02907, over 23955.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2359, pruned_loss=0.02927, over 1885254.75 frames. ], batch size: 142, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:23:20,249 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2024-03-09 15:23:21,053 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=29240.0, ans=0.125 2024-03-09 15:23:36,355 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=29306.666666666668, ans=0.2 2024-03-09 15:23:44,342 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=29306.666666666668, ans=0.04949747468305833 2024-03-09 15:23:50,078 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.379e+01 7.431e+01 8.104e+01 8.725e+01 1.109e+02, threshold=1.621e+02, percent-clipped=0.0 2024-03-09 15:24:02,913 INFO [train.py:997] (1/4) Epoch 28, batch 150, loss[loss=0.1437, simple_loss=0.2348, pruned_loss=0.02628, over 24244.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2359, pruned_loss=0.02923, over 2520077.75 frames. ], batch size: 241, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:24:10,040 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29440.0, ans=0.125 2024-03-09 15:24:57,645 INFO [train.py:997] (1/4) Epoch 29, batch 0, loss[loss=0.1573, simple_loss=0.2491, pruned_loss=0.03279, over 23964.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2491, pruned_loss=0.03279, over 23964.00 frames. ], batch size: 416, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:24:57,646 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:25:06,830 INFO [train.py:1029] (1/4) Epoch 29, validation: loss=0.2094, simple_loss=0.3019, pruned_loss=0.05844, over 452978.00 frames. 2024-03-09 15:25:06,831 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:25:11,630 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=29493.333333333332, ans=0.2 2024-03-09 15:25:27,408 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=29560.0, ans=0.125 2024-03-09 15:26:25,786 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.02 vs. limit=15.0 2024-03-09 15:26:31,895 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.23 vs. limit=10.0 2024-03-09 15:26:32,416 INFO [train.py:997] (1/4) Epoch 29, batch 50, loss[loss=0.1434, simple_loss=0.2261, pruned_loss=0.03036, over 22618.00 frames. ], tot_loss[loss=0.1447, simple_loss=0.2332, pruned_loss=0.02803, over 1070585.32 frames. ], batch size: 85, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:26:33,237 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2024-03-09 15:26:40,487 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=29826.666666666668, ans=0.125 2024-03-09 15:26:54,317 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=29893.333333333332, ans=0.004371014492753623 2024-03-09 15:27:00,676 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=29893.333333333332, ans=0.2 2024-03-09 15:27:13,838 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.64 vs. limit=15.0 2024-03-09 15:27:14,783 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=29960.0, ans=0.125 2024-03-09 15:27:27,042 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.485e+01 7.617e+01 8.419e+01 9.074e+01 1.218e+02, threshold=1.684e+02, percent-clipped=0.0 2024-03-09 15:27:50,430 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=30093.333333333332, ans=0.0 2024-03-09 15:27:55,000 INFO [train.py:997] (1/4) Epoch 29, batch 100, loss[loss=0.1562, simple_loss=0.252, pruned_loss=0.03023, over 23961.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2357, pruned_loss=0.02858, over 1872465.45 frames. ], batch size: 415, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:28:28,627 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30293.333333333332, ans=0.125 2024-03-09 15:28:43,638 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30360.0, ans=0.1 2024-03-09 15:28:47,205 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2024-03-09 15:28:48,838 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=12.0 2024-03-09 15:29:00,842 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=30426.666666666668, ans=0.2 2024-03-09 15:29:12,925 INFO [train.py:997] (1/4) Epoch 29, batch 150, loss[loss=0.185, simple_loss=0.2716, pruned_loss=0.04923, over 23246.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2364, pruned_loss=0.02893, over 2507383.47 frames. ], batch size: 534, lr: 1.46e-02, grad_scale: 64.0 2024-03-09 15:30:06,217 INFO [train.py:997] (1/4) Epoch 30, batch 0, loss[loss=0.1262, simple_loss=0.2115, pruned_loss=0.02046, over 23951.00 frames. ], tot_loss[loss=0.1262, simple_loss=0.2115, pruned_loss=0.02046, over 23951.00 frames. ], batch size: 142, lr: 1.44e-02, grad_scale: 64.0 2024-03-09 15:30:06,217 INFO [train.py:1020] (1/4) Computing validation loss 2024-03-09 15:30:13,305 INFO [zipformer.py:1858] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([6.0229, 5.3271, 5.9799, 5.1682], device='cuda:1') 2024-03-09 15:30:18,506 INFO [train.py:1029] (1/4) Epoch 30, validation: loss=0.2105, simple_loss=0.3027, pruned_loss=0.05915, over 452978.00 frames. 2024-03-09 15:30:18,507 INFO [train.py:1030] (1/4) Maximum memory allocated so far is 28192MB 2024-03-09 15:30:54,233 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30680.0, ans=0.0 2024-03-09 15:30:59,366 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.50 vs. limit=15.0 2024-03-09 15:31:01,625 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.221e+01 6.992e+01 7.523e+01 8.232e+01 1.586e+02, threshold=1.505e+02, percent-clipped=0.0 2024-03-09 15:31:14,460 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=30746.666666666668, ans=0.125 2024-03-09 15:31:40,985 INFO [train.py:997] (1/4) Epoch 30, batch 50, loss[loss=0.1548, simple_loss=0.2376, pruned_loss=0.03602, over 24125.00 frames. ], tot_loss[loss=0.1443, simple_loss=0.2326, pruned_loss=0.02797, over 1073821.76 frames. ], batch size: 176, lr: 1.44e-02, grad_scale: 64.0 2024-03-09 15:32:17,020 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=31013.333333333332, ans=0.2 2024-03-09 15:32:28,338 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=15.0 2024-03-09 15:32:33,904 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31080.0, ans=0.1 2024-03-09 15:32:39,345 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.43 vs. limit=12.0 2024-03-09 15:32:40,224 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=31080.0, ans=0.125 2024-03-09 15:32:48,769 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.36 vs. limit=15.0 2024-03-09 15:32:53,034 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.89 vs. limit=15.0 2024-03-09 15:33:01,487 INFO [train.py:997] (1/4) Epoch 30, batch 100, loss[loss=0.1501, simple_loss=0.2459, pruned_loss=0.02712, over 23983.00 frames. ], tot_loss[loss=0.1444, simple_loss=0.2342, pruned_loss=0.02731, over 1880523.54 frames. ], batch size: 388, lr: 1.43e-02, grad_scale: 64.0 2024-03-09 15:33:11,004 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=31213.333333333332, ans=0.125 2024-03-09 15:33:34,527 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2024-03-09 15:33:43,907 WARNING [optim.py:487] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.080e+01 7.332e+01 7.826e+01 8.661e+01 1.231e+02, threshold=1.565e+02, percent-clipped=0.0 2024-03-09 15:34:06,317 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=31480.0, ans=0.004026086956521739 2024-03-09 15:34:20,983 INFO [train.py:997] (1/4) Epoch 30, batch 150, loss[loss=0.1474, simple_loss=0.2341, pruned_loss=0.03036, over 23204.00 frames. ], tot_loss[loss=0.1447, simple_loss=0.2341, pruned_loss=0.02768, over 2514734.26 frames. ], batch size: 102, lr: 1.43e-02, grad_scale: 64.0 2024-03-09 15:34:28,247 INFO [scaling.py:1023] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=15.0 2024-03-09 15:34:30,538 INFO [scaling.py:214] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=31546.666666666668, ans=0.2 2024-03-09 15:34:33,731 INFO [train.py:1248] (1/4) Done!