icefall-asr-mdcc-zipformer-2024-03-11 / exp /log /log-train-2024-03-09-17-05-59-2
zrjin's picture
Upload 19 files
7baae4b verified
2024-03-09 17:05:59,770 INFO [train.py:1065] (2/4) Training started
2024-03-09 17:05:59,770 INFO [train.py:1075] (2/4) Device: cuda:2
2024-03-09 17:05:59,856 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt
2024-03-09 17:05:59,871 INFO [train.py:1086] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2989b0b1186fa6022932804f5b39fbb2781ebf42', 'k2-git-date': 'Fri Nov 24 11:34:10 2023', 'lhotse-version': '1.22.0.dev+git.d8ed1bbb.dirty', 'torch-version': '1.11.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'dev/mdcc', 'icefall-git-sha1': '8b7ca604-clean', 'icefall-git-date': 'Sat Mar 9 14:09:58 2024', 'icefall-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/icefall-1.0-py3.9.egg', 'k2-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/k2-1.24.4.dev20231207+cuda10.2.torch1.11.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/lhotse-1.22.0.dev0+git.d8ed1bbb.dirty-py3.9.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-1207150844-f49d8c4f4-c49d5', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 31, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 1, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 4852}
2024-03-09 17:05:59,871 INFO [train.py:1088] (2/4) About to create model
2024-03-09 17:06:00,576 INFO [train.py:1092] (2/4) Number of model parameters: 74470867
2024-03-09 17:06:00,577 INFO [checkpoint.py:112] (2/4) Loading checkpoint from zipformer/exp/epoch-30.pt
2024-03-09 17:06:07,913 INFO [train.py:1107] (2/4) Using DDP
2024-03-09 17:06:08,423 INFO [train.py:1119] (2/4) Loading optimizer state dict
2024-03-09 17:06:09,551 INFO [train.py:1127] (2/4) Loading scheduler state dict
2024-03-09 17:06:09,552 INFO [asr_datamodule.py:368] (2/4) About to get train cuts
2024-03-09 17:06:09,556 INFO [asr_datamodule.py:376] (2/4) About to get valid cuts
2024-03-09 17:06:09,558 INFO [asr_datamodule.py:195] (2/4) About to get Musan cuts
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:200] (2/4) Enable MUSAN
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:223] (2/4) Enable SpecAugment
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:224] (2/4) Time warp factor: 80
2024-03-09 17:06:12,022 INFO [asr_datamodule.py:234] (2/4) Num frame mask: 10
2024-03-09 17:06:12,023 INFO [asr_datamodule.py:247] (2/4) About to create train dataset
2024-03-09 17:06:12,023 INFO [asr_datamodule.py:273] (2/4) Using DynamicBucketingSampler.
2024-03-09 17:06:12,782 INFO [asr_datamodule.py:290] (2/4) About to create train dataloader
2024-03-09 17:06:12,783 INFO [asr_datamodule.py:315] (2/4) About to create dev dataset
2024-03-09 17:06:13,095 INFO [asr_datamodule.py:332] (2/4) About to create dev dataloader
2024-03-09 17:06:13,095 INFO [train.py:1205] (2/4) Loading grad scaler state dict
2024-03-09 17:06:53,814 INFO [train.py:997] (2/4) Epoch 31, batch 0, loss[loss=0.1464, simple_loss=0.2317, pruned_loss=0.03049, over 22540.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2317, pruned_loss=0.03049, over 22540.00 frames. ], batch size: 85, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:06:53,814 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:07:03,247 INFO [train.py:1029] (2/4) Epoch 31, validation: loss=0.2089, simple_loss=0.3019, pruned_loss=0.05794, over 452978.00 frames.
2024-03-09 17:07:03,248 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 26094MB
2024-03-09 17:07:04,302 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0
2024-03-09 17:07:17,133 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.49 vs. limit=15.0
2024-03-09 17:07:21,695 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=15.0
2024-03-09 17:08:00,478 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=31800.0, ans=0.04949747468305833
2024-03-09 17:08:15,852 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31866.666666666668, ans=0.125
2024-03-09 17:08:19,165 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=31866.666666666668, ans=0.2
2024-03-09 17:08:21,814 INFO [train.py:997] (2/4) Epoch 31, batch 50, loss[loss=0.1534, simple_loss=0.2489, pruned_loss=0.02897, over 23857.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2335, pruned_loss=0.02738, over 1071633.96 frames. ], batch size: 447, lr: 1.41e-02, grad_scale: 64.0
2024-03-09 17:08:23,704 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=31933.333333333332, ans=0.5
2024-03-09 17:08:26,706 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=31933.333333333332, ans=0.2
2024-03-09 17:08:54,735 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.909e+01 7.298e+01 7.941e+01 8.893e+01 1.039e+02, threshold=1.588e+02, percent-clipped=0.0
2024-03-09 17:09:16,043 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=32133.333333333332, ans=0.2
2024-03-09 17:09:17,579 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=32133.333333333332, ans=0.125
2024-03-09 17:09:39,306 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=32200.0, ans=0.125
2024-03-09 17:09:43,950 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=32200.0, ans=0.125
2024-03-09 17:09:48,294 INFO [train.py:997] (2/4) Epoch 31, batch 100, loss[loss=0.14, simple_loss=0.2268, pruned_loss=0.02655, over 24240.00 frames. ], tot_loss[loss=0.1442, simple_loss=0.2339, pruned_loss=0.0272, over 1879009.89 frames. ], batch size: 188, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:10:01,774 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0
2024-03-09 17:10:13,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=32333.333333333332, ans=0.1
2024-03-09 17:10:16,168 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=32333.333333333332, ans=0.0038405797101449283
2024-03-09 17:10:53,962 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32533.333333333332, ans=0.125
2024-03-09 17:10:57,006 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=32533.333333333332, ans=0.125
2024-03-09 17:11:00,033 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=32533.333333333332, ans=0.07
2024-03-09 17:11:08,745 INFO [train.py:997] (2/4) Epoch 31, batch 150, loss[loss=0.1535, simple_loss=0.2408, pruned_loss=0.03313, over 24237.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2351, pruned_loss=0.02781, over 2517109.18 frames. ], batch size: 198, lr: 1.40e-02, grad_scale: 64.0
2024-03-09 17:11:10,504 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32600.0, ans=0.1
2024-03-09 17:12:06,856 INFO [train.py:997] (2/4) Epoch 32, batch 0, loss[loss=0.1951, simple_loss=0.2726, pruned_loss=0.05876, over 23222.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2726, pruned_loss=0.05876, over 23222.00 frames. ], batch size: 534, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:12:06,857 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:12:16,507 INFO [train.py:1029] (2/4) Epoch 32, validation: loss=0.2101, simple_loss=0.3027, pruned_loss=0.0588, over 452978.00 frames.
2024-03-09 17:12:16,508 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27051MB
2024-03-09 17:12:18,465 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:19,936 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32653.333333333332, ans=0.125
2024-03-09 17:12:26,195 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=32653.333333333332, ans=0.0
2024-03-09 17:12:32,050 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.203e+01 7.071e+01 7.685e+01 8.593e+01 1.169e+02, threshold=1.537e+02, percent-clipped=0.0
2024-03-09 17:12:33,909 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=32720.0, ans=0.125
2024-03-09 17:12:37,615 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.80 vs. limit=15.0
2024-03-09 17:12:57,829 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32786.666666666664, ans=0.125
2024-03-09 17:12:58,574 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.96 vs. limit=22.5
2024-03-09 17:13:03,988 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=32853.333333333336, ans=0.0
2024-03-09 17:13:14,946 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=32853.333333333336, ans=0.125
2024-03-09 17:13:27,333 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32920.0, ans=0.1
2024-03-09 17:13:28,081 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.88 vs. limit=15.0
2024-03-09 17:13:34,604 INFO [train.py:997] (2/4) Epoch 32, batch 50, loss[loss=0.1302, simple_loss=0.2187, pruned_loss=0.02087, over 23953.00 frames. ], tot_loss[loss=0.144, simple_loss=0.2325, pruned_loss=0.02775, over 1067019.37 frames. ], batch size: 142, lr: 1.38e-02, grad_scale: 64.0
2024-03-09 17:13:44,581 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0
2024-03-09 17:14:01,597 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33053.333333333336, ans=0.1
2024-03-09 17:14:16,281 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=33120.0, ans=0.0
2024-03-09 17:14:27,390 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=33186.666666666664, ans=0.0
2024-03-09 17:14:33,630 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=33186.666666666664, ans=0.125
2024-03-09 17:14:59,398 INFO [train.py:997] (2/4) Epoch 32, batch 100, loss[loss=0.1402, simple_loss=0.2245, pruned_loss=0.02798, over 19870.00 frames. ], tot_loss[loss=0.1436, simple_loss=0.2327, pruned_loss=0.02729, over 1886966.60 frames. ], batch size: 60, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:15:15,496 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.885e+01 7.174e+01 7.568e+01 8.159e+01 1.038e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:15:15,861 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:15:33,196 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.59 vs. limit=10.0
2024-03-09 17:15:34,172 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=33453.333333333336, ans=0.003597101449275362
2024-03-09 17:15:51,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33520.0, ans=0.1
2024-03-09 17:16:19,716 INFO [train.py:997] (2/4) Epoch 32, batch 150, loss[loss=0.1359, simple_loss=0.2226, pruned_loss=0.02461, over 23572.00 frames. ], tot_loss[loss=0.1435, simple_loss=0.2329, pruned_loss=0.02701, over 2520168.77 frames. ], batch size: 128, lr: 1.37e-02, grad_scale: 64.0
2024-03-09 17:17:09,038 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33706.666666666664, ans=0.1
2024-03-09 17:17:14,940 INFO [train.py:997] (2/4) Epoch 33, batch 0, loss[loss=0.1378, simple_loss=0.2217, pruned_loss=0.02702, over 24242.00 frames. ], tot_loss[loss=0.1378, simple_loss=0.2217, pruned_loss=0.02702, over 24242.00 frames. ], batch size: 229, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:17:14,941 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:17:24,825 INFO [train.py:1029] (2/4) Epoch 33, validation: loss=0.2104, simple_loss=0.3043, pruned_loss=0.05821, over 452978.00 frames.
2024-03-09 17:17:24,826 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27051MB
2024-03-09 17:17:36,332 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=33706.666666666664, ans=0.0
2024-03-09 17:17:41,717 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0
2024-03-09 17:18:20,064 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:18:43,172 INFO [train.py:997] (2/4) Epoch 33, batch 50, loss[loss=0.1854, simple_loss=0.2694, pruned_loss=0.0507, over 23231.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.229, pruned_loss=0.02591, over 1049607.84 frames. ], batch size: 534, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:18:43,483 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=34040.0, ans=0.2
2024-03-09 17:18:45,080 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=34040.0, ans=0.0
2024-03-09 17:18:46,184 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+01 7.058e+01 7.697e+01 8.414e+01 1.529e+02, threshold=1.539e+02, percent-clipped=1.0
2024-03-09 17:18:51,724 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0
2024-03-09 17:18:54,167 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=34040.0, ans=0.0
2024-03-09 17:18:57,360 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=34106.666666666664, ans=0.0
2024-03-09 17:18:59,493 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.59 vs. limit=15.0
2024-03-09 17:19:11,925 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=34106.666666666664, ans=0.125
2024-03-09 17:19:18,571 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=34173.333333333336, ans=0.0
2024-03-09 17:19:18,653 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=34173.333333333336, ans=10.0
2024-03-09 17:19:27,118 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34173.333333333336, ans=0.1
2024-03-09 17:20:08,483 INFO [train.py:997] (2/4) Epoch 33, batch 100, loss[loss=0.1549, simple_loss=0.2513, pruned_loss=0.02926, over 23824.00 frames. ], tot_loss[loss=0.142, simple_loss=0.2312, pruned_loss=0.02642, over 1875811.14 frames. ], batch size: 447, lr: 1.35e-02, grad_scale: 64.0
2024-03-09 17:20:10,862 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0
2024-03-09 17:20:22,600 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=34440.0, ans=0.2
2024-03-09 17:20:30,302 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=34440.0, ans=0.1
2024-03-09 17:20:31,757 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34440.0, ans=0.125
2024-03-09 17:20:44,038 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=34506.666666666664, ans=0.125
2024-03-09 17:20:47,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=34506.666666666664, ans=0.04949747468305833
2024-03-09 17:20:53,222 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34573.333333333336, ans=0.125
2024-03-09 17:21:06,331 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=34573.333333333336, ans=0.05
2024-03-09 17:21:18,745 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34640.0, ans=0.1
2024-03-09 17:21:28,183 INFO [train.py:997] (2/4) Epoch 33, batch 150, loss[loss=0.1461, simple_loss=0.2362, pruned_loss=0.02803, over 24280.00 frames. ], tot_loss[loss=0.1438, simple_loss=0.2338, pruned_loss=0.02691, over 2514156.33 frames. ], batch size: 267, lr: 1.34e-02, grad_scale: 64.0
2024-03-09 17:21:31,133 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.628e+01 7.574e+01 8.231e+01 9.009e+01 1.365e+02, threshold=1.646e+02, percent-clipped=0.0
2024-03-09 17:22:22,791 INFO [train.py:997] (2/4) Epoch 34, batch 0, loss[loss=0.1448, simple_loss=0.2345, pruned_loss=0.02754, over 24058.00 frames. ], tot_loss[loss=0.1448, simple_loss=0.2345, pruned_loss=0.02754, over 24058.00 frames. ], batch size: 176, lr: 1.32e-02, grad_scale: 64.0
2024-03-09 17:22:22,791 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:22:32,283 INFO [train.py:1029] (2/4) Epoch 34, validation: loss=0.2117, simple_loss=0.3053, pruned_loss=0.0591, over 452978.00 frames.
2024-03-09 17:22:32,284 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:22:37,355 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=34760.0, ans=0.125
2024-03-09 17:22:37,447 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=34760.0, ans=0.0
2024-03-09 17:22:56,099 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=34826.666666666664, ans=0.0032985507246376814
2024-03-09 17:23:00,750 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=34826.666666666664, ans=0.0
2024-03-09 17:23:49,727 INFO [train.py:997] (2/4) Epoch 34, batch 50, loss[loss=0.129, simple_loss=0.2203, pruned_loss=0.01887, over 23885.00 frames. ], tot_loss[loss=0.1394, simple_loss=0.2283, pruned_loss=0.02526, over 1071868.33 frames. ], batch size: 142, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:24:03,318 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=35093.333333333336, ans=0.0
2024-03-09 17:24:26,755 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=35226.666666666664, ans=0.2
2024-03-09 17:24:35,890 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=35226.666666666664, ans=0.95
2024-03-09 17:24:51,409 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=35293.333333333336, ans=0.125
2024-03-09 17:24:54,525 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=35293.333333333336, ans=0.0031971014492753625
2024-03-09 17:24:57,512 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=35360.0, ans=0.125
2024-03-09 17:25:04,709 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.790e+01 6.987e+01 7.379e+01 8.041e+01 1.553e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 17:25:13,956 INFO [train.py:997] (2/4) Epoch 34, batch 100, loss[loss=0.1515, simple_loss=0.249, pruned_loss=0.02695, over 23827.00 frames. ], tot_loss[loss=0.1396, simple_loss=0.2292, pruned_loss=0.025, over 1881309.78 frames. ], batch size: 447, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:25:14,328 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=35426.666666666664, ans=0.0
2024-03-09 17:25:23,486 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35426.666666666664, ans=0.125
2024-03-09 17:25:30,476 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0
2024-03-09 17:25:34,248 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=35493.333333333336, ans=0.2
2024-03-09 17:26:22,831 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35693.333333333336, ans=0.1
2024-03-09 17:26:32,942 INFO [train.py:997] (2/4) Epoch 34, batch 150, loss[loss=0.1508, simple_loss=0.2458, pruned_loss=0.02793, over 23968.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2306, pruned_loss=0.02507, over 2514436.88 frames. ], batch size: 416, lr: 1.32e-02, grad_scale: 128.0
2024-03-09 17:27:26,417 INFO [train.py:997] (2/4) Epoch 35, batch 0, loss[loss=0.1368, simple_loss=0.2253, pruned_loss=0.02416, over 24306.00 frames. ], tot_loss[loss=0.1368, simple_loss=0.2253, pruned_loss=0.02416, over 24306.00 frames. ], batch size: 254, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:27:26,418 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:27:38,584 INFO [train.py:1029] (2/4) Epoch 35, validation: loss=0.2098, simple_loss=0.3027, pruned_loss=0.05849, over 452978.00 frames.
2024-03-09 17:27:38,584 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:28:15,932 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=35946.666666666664, ans=0.2
2024-03-09 17:28:16,321 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0
2024-03-09 17:28:25,533 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:28:30,263 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36013.333333333336, ans=0.1
2024-03-09 17:28:34,547 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.276e+01 7.140e+01 7.953e+01 8.912e+01 1.249e+02, threshold=1.591e+02, percent-clipped=0.0
2024-03-09 17:28:58,535 INFO [train.py:997] (2/4) Epoch 35, batch 50, loss[loss=0.1376, simple_loss=0.2266, pruned_loss=0.02431, over 23909.00 frames. ], tot_loss[loss=0.1429, simple_loss=0.2317, pruned_loss=0.02705, over 1052272.91 frames. ], batch size: 153, lr: 1.30e-02, grad_scale: 128.0
2024-03-09 17:28:58,803 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36146.666666666664, ans=0.125
2024-03-09 17:29:18,678 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=36213.333333333336, ans=0.002997101449275361
2024-03-09 17:29:24,696 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=36213.333333333336, ans=0.125
2024-03-09 17:29:46,663 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=36346.666666666664, ans=0.125
2024-03-09 17:30:02,015 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36413.333333333336, ans=0.125
2024-03-09 17:30:18,437 INFO [train.py:997] (2/4) Epoch 35, batch 100, loss[loss=0.1428, simple_loss=0.2342, pruned_loss=0.02573, over 24271.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2338, pruned_loss=0.02722, over 1876023.22 frames. ], batch size: 267, lr: 1.29e-02, grad_scale: 128.0
2024-03-09 17:31:18,163 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.800e+01 7.204e+01 7.789e+01 8.601e+01 1.817e+02, threshold=1.558e+02, percent-clipped=1.0
2024-03-09 17:31:19,313 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=22.5
2024-03-09 17:31:38,613 INFO [train.py:997] (2/4) Epoch 35, batch 150, loss[loss=0.145, simple_loss=0.2414, pruned_loss=0.02429, over 24050.00 frames. ], tot_loss[loss=0.1435, simple_loss=0.2337, pruned_loss=0.02665, over 2506960.03 frames. ], batch size: 365, lr: 1.29e-02, grad_scale: 64.0
2024-03-09 17:31:47,935 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36813.333333333336, ans=0.125
2024-03-09 17:32:32,828 INFO [train.py:997] (2/4) Epoch 36, batch 0, loss[loss=0.1224, simple_loss=0.2015, pruned_loss=0.02166, over 23589.00 frames. ], tot_loss[loss=0.1224, simple_loss=0.2015, pruned_loss=0.02166, over 23589.00 frames. ], batch size: 116, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:32:32,828 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:32:40,920 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.9525, 2.3762, 2.5905, 2.6092], device='cuda:2')
2024-03-09 17:32:42,863 INFO [train.py:1029] (2/4) Epoch 36, validation: loss=0.212, simple_loss=0.307, pruned_loss=0.05847, over 452978.00 frames.
2024-03-09 17:32:42,864 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:33:03,428 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=36933.333333333336, ans=0.09899494936611666
2024-03-09 17:33:10,444 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0
2024-03-09 17:33:21,744 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=37000.0, ans=0.0
2024-03-09 17:33:29,963 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0
2024-03-09 17:33:36,135 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.46 vs. limit=15.0
2024-03-09 17:34:10,522 INFO [train.py:997] (2/4) Epoch 36, batch 50, loss[loss=0.1477, simple_loss=0.2458, pruned_loss=0.02483, over 23977.00 frames. ], tot_loss[loss=0.1408, simple_loss=0.23, pruned_loss=0.02584, over 1069727.98 frames. ], batch size: 416, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:34:23,010 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=37200.0, ans=0.125
2024-03-09 17:34:24,724 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=37266.666666666664, ans=0.125
2024-03-09 17:34:32,325 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:34:32,369 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=37266.666666666664, ans=0.2
2024-03-09 17:34:43,633 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37333.333333333336, ans=0.125
2024-03-09 17:34:52,052 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0
2024-03-09 17:34:52,769 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=37333.333333333336, ans=0.125
2024-03-09 17:34:55,596 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.060e+01 6.975e+01 7.752e+01 8.346e+01 1.468e+02, threshold=1.550e+02, percent-clipped=0.0
2024-03-09 17:35:10,862 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0
2024-03-09 17:35:16,363 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=37466.666666666664, ans=0.2
2024-03-09 17:35:25,797 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:35:28,459 INFO [train.py:997] (2/4) Epoch 36, batch 100, loss[loss=0.1514, simple_loss=0.2453, pruned_loss=0.02874, over 23974.00 frames. ], tot_loss[loss=0.1397, simple_loss=0.2292, pruned_loss=0.02509, over 1882046.42 frames. ], batch size: 416, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:35:56,861 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=37600.0, ans=0.0
2024-03-09 17:35:59,960 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=37600.0, ans=0.125
2024-03-09 17:36:01,376 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=37666.666666666664, ans=0.125
2024-03-09 17:36:20,334 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37733.333333333336, ans=0.125
2024-03-09 17:36:24,067 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.17 vs. limit=10.0
2024-03-09 17:36:37,815 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0
2024-03-09 17:36:38,550 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=37800.0, ans=0.0026521739130434784
2024-03-09 17:36:44,015 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=12.0
2024-03-09 17:36:50,951 INFO [train.py:997] (2/4) Epoch 36, batch 150, loss[loss=0.1439, simple_loss=0.2293, pruned_loss=0.02927, over 24098.00 frames. ], tot_loss[loss=0.1402, simple_loss=0.2299, pruned_loss=0.02527, over 2516283.42 frames. ], batch size: 165, lr: 1.27e-02, grad_scale: 64.0
2024-03-09 17:36:56,448 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.25 vs. limit=15.0
2024-03-09 17:37:46,101 INFO [train.py:997] (2/4) Epoch 37, batch 0, loss[loss=0.1375, simple_loss=0.2303, pruned_loss=0.02238, over 24194.00 frames. ], tot_loss[loss=0.1375, simple_loss=0.2303, pruned_loss=0.02238, over 24194.00 frames. ], batch size: 217, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:37:46,102 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:37:55,592 INFO [train.py:1029] (2/4) Epoch 37, validation: loss=0.2112, simple_loss=0.3044, pruned_loss=0.05893, over 452978.00 frames.
2024-03-09 17:37:55,593 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:37:58,936 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37920.0, ans=0.125
2024-03-09 17:38:01,974 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=37920.0, ans=0.2
2024-03-09 17:38:02,083 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37920.0, ans=0.125
2024-03-09 17:38:17,244 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=37986.666666666664, ans=0.0
2024-03-09 17:38:19,542 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.31 vs. limit=15.0
2024-03-09 17:38:20,419 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37986.666666666664, ans=0.125
2024-03-09 17:38:21,878 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=37986.666666666664, ans=0.125
2024-03-09 17:38:25,027 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=37986.666666666664, ans=0.0026115942028985505
2024-03-09 17:38:30,964 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.112e+01 7.137e+01 7.682e+01 8.524e+01 1.300e+02, threshold=1.536e+02, percent-clipped=0.0
2024-03-09 17:38:47,184 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=38120.0, ans=0.125
2024-03-09 17:39:20,069 INFO [train.py:997] (2/4) Epoch 37, batch 50, loss[loss=0.1495, simple_loss=0.2329, pruned_loss=0.03309, over 24092.00 frames. ], tot_loss[loss=0.1381, simple_loss=0.2271, pruned_loss=0.02458, over 1071551.82 frames. ], batch size: 165, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:39:42,150 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:43,722 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=38320.0, ans=0.125
2024-03-09 17:39:56,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=38386.666666666664, ans=0.5
2024-03-09 17:40:19,358 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=38453.333333333336, ans=0.0025101449275362316
2024-03-09 17:40:36,327 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=38520.0, ans=0.0
2024-03-09 17:40:37,849 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38520.0, ans=0.1
2024-03-09 17:40:40,615 INFO [train.py:997] (2/4) Epoch 37, batch 100, loss[loss=0.1281, simple_loss=0.2133, pruned_loss=0.02146, over 23630.00 frames. ], tot_loss[loss=0.139, simple_loss=0.2291, pruned_loss=0.02449, over 1890939.57 frames. ], batch size: 116, lr: 1.25e-02, grad_scale: 64.0
2024-03-09 17:40:42,503 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=38586.666666666664, ans=0.125
2024-03-09 17:41:08,793 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
2024-03-09 17:41:13,211 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=38720.0, ans=0.0
2024-03-09 17:41:15,924 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.868e+01 6.991e+01 7.571e+01 8.226e+01 1.121e+02, threshold=1.514e+02, percent-clipped=0.0
2024-03-09 17:41:16,295 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=38720.0, ans=0.125
2024-03-09 17:41:16,933 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0
2024-03-09 17:42:00,692 INFO [train.py:997] (2/4) Epoch 37, batch 150, loss[loss=0.1421, simple_loss=0.2347, pruned_loss=0.0247, over 24277.00 frames. ], tot_loss[loss=0.1395, simple_loss=0.2295, pruned_loss=0.02477, over 2520408.82 frames. ], batch size: 267, lr: 1.24e-02, grad_scale: 64.0
2024-03-09 17:42:09,048 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38920.0, ans=0.125
2024-03-09 17:42:52,930 INFO [train.py:997] (2/4) Epoch 38, batch 0, loss[loss=0.1404, simple_loss=0.2289, pruned_loss=0.026, over 24195.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2289, pruned_loss=0.026, over 24195.00 frames. ], batch size: 217, lr: 1.23e-02, grad_scale: 64.0
2024-03-09 17:42:52,930 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:43:02,283 INFO [train.py:1029] (2/4) Epoch 38, validation: loss=0.2136, simple_loss=0.3079, pruned_loss=0.05959, over 452978.00 frames.
2024-03-09 17:43:02,283 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:43:13,152 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=38973.333333333336, ans=0.125
2024-03-09 17:43:17,946 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=38973.333333333336, ans=0.0
2024-03-09 17:43:21,008 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=39040.0, ans=0.125
2024-03-09 17:43:22,017 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0
2024-03-09 17:43:45,519 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39106.666666666664, ans=0.1
2024-03-09 17:44:27,814 INFO [train.py:997] (2/4) Epoch 38, batch 50, loss[loss=0.1443, simple_loss=0.2397, pruned_loss=0.02444, over 24123.00 frames. ], tot_loss[loss=0.1379, simple_loss=0.2277, pruned_loss=0.02404, over 1071400.73 frames. ], batch size: 366, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:44:48,017 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.028e+01 7.170e+01 7.896e+01 8.779e+01 1.113e+02, threshold=1.579e+02, percent-clipped=0.0
2024-03-09 17:44:51,469 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=39373.333333333336, ans=0.002310144927536232
2024-03-09 17:45:03,480 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=39440.0, ans=0.07
2024-03-09 17:45:21,592 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=39506.666666666664, ans=0.0
2024-03-09 17:45:33,572 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.06 vs. limit=15.0
2024-03-09 17:45:38,850 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=39573.333333333336, ans=0.125
2024-03-09 17:45:43,459 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=39573.333333333336, ans=0.05
2024-03-09 17:45:46,179 INFO [train.py:997] (2/4) Epoch 38, batch 100, loss[loss=0.1357, simple_loss=0.2239, pruned_loss=0.02378, over 24043.00 frames. ], tot_loss[loss=0.1393, simple_loss=0.2298, pruned_loss=0.0244, over 1893209.55 frames. ], batch size: 165, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:46:09,701 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=39706.666666666664, ans=0.0
2024-03-09 17:46:16,562 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0
2024-03-09 17:46:20,509 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=39773.333333333336, ans=0.09899494936611666
2024-03-09 17:47:07,593 INFO [train.py:997] (2/4) Epoch 38, batch 150, loss[loss=0.1358, simple_loss=0.2314, pruned_loss=0.0201, over 24093.00 frames. ], tot_loss[loss=0.1397, simple_loss=0.2303, pruned_loss=0.02457, over 2520913.76 frames. ], batch size: 344, lr: 1.22e-02, grad_scale: 64.0
2024-03-09 17:48:03,476 INFO [train.py:997] (2/4) Epoch 39, batch 0, loss[loss=0.1329, simple_loss=0.2262, pruned_loss=0.01986, over 24062.00 frames. ], tot_loss[loss=0.1329, simple_loss=0.2262, pruned_loss=0.01986, over 24062.00 frames. ], batch size: 344, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:48:03,477 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:48:12,216 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.6502, 5.3125, 5.6457, 5.3410], device='cuda:2')
2024-03-09 17:48:12,745 INFO [train.py:1029] (2/4) Epoch 39, validation: loss=0.2141, simple_loss=0.3082, pruned_loss=0.06004, over 452978.00 frames.
2024-03-09 17:48:12,746 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:48:26,644 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.993e+01 6.884e+01 7.356e+01 8.157e+01 1.068e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 17:48:40,644 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40093.333333333336, ans=0.1
2024-03-09 17:48:40,657 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=40093.333333333336, ans=0.0
2024-03-09 17:49:09,858 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=40226.666666666664, ans=0.002124637681159421
2024-03-09 17:49:16,101 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40226.666666666664, ans=0.1
2024-03-09 17:49:41,668 INFO [train.py:997] (2/4) Epoch 39, batch 50, loss[loss=0.1416, simple_loss=0.2311, pruned_loss=0.02604, over 24191.00 frames. ], tot_loss[loss=0.138, simple_loss=0.2272, pruned_loss=0.02437, over 1068898.46 frames. ], batch size: 295, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:49:44,373 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0
2024-03-09 17:50:20,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=40493.333333333336, ans=0.05
2024-03-09 17:50:59,996 INFO [train.py:997] (2/4) Epoch 39, batch 100, loss[loss=0.1243, simple_loss=0.2186, pruned_loss=0.01501, over 22863.00 frames. ], tot_loss[loss=0.1378, simple_loss=0.2281, pruned_loss=0.02373, over 1888872.37 frames. ], batch size: 609, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:51:09,406 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.940e+01 6.841e+01 7.461e+01 8.103e+01 1.250e+02, threshold=1.492e+02, percent-clipped=0.0
2024-03-09 17:51:36,818 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=40826.666666666664, ans=0.0
2024-03-09 17:52:21,040 INFO [train.py:997] (2/4) Epoch 39, batch 150, loss[loss=0.1354, simple_loss=0.2311, pruned_loss=0.01985, over 24144.00 frames. ], tot_loss[loss=0.1382, simple_loss=0.229, pruned_loss=0.02371, over 2523197.24 frames. ], batch size: 366, lr: 1.20e-02, grad_scale: 64.0
2024-03-09 17:52:30,135 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41026.666666666664, ans=0.125
2024-03-09 17:53:16,190 INFO [train.py:997] (2/4) Epoch 40, batch 0, loss[loss=0.1141, simple_loss=0.2117, pruned_loss=0.008274, over 21415.00 frames. ], tot_loss[loss=0.1141, simple_loss=0.2117, pruned_loss=0.008274, over 21415.00 frames. ], batch size: 718, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:53:16,190 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:53:25,709 INFO [train.py:1029] (2/4) Epoch 40, validation: loss=0.2148, simple_loss=0.3085, pruned_loss=0.06058, over 452978.00 frames.
2024-03-09 17:53:25,709 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:53:54,642 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=15.0
2024-03-09 17:54:00,191 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:07,647 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41213.333333333336, ans=0.125
2024-03-09 17:54:13,866 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=41213.333333333336, ans=10.0
2024-03-09 17:54:23,974 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0
2024-03-09 17:54:39,624 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=41346.666666666664, ans=0.125
2024-03-09 17:54:43,491 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.77 vs. limit=15.0
2024-03-09 17:54:47,011 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.979e+01 7.013e+01 7.603e+01 8.055e+01 1.247e+02, threshold=1.521e+02, percent-clipped=0.0
2024-03-09 17:54:51,545 INFO [train.py:997] (2/4) Epoch 40, batch 50, loss[loss=0.1435, simple_loss=0.2316, pruned_loss=0.02769, over 24076.00 frames. ], tot_loss[loss=0.1369, simple_loss=0.2275, pruned_loss=0.02311, over 1062746.76 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:55:00,924 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41413.333333333336, ans=0.1
2024-03-09 17:55:34,858 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41546.666666666664, ans=0.125
2024-03-09 17:55:36,380 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=41613.333333333336, ans=0.025
2024-03-09 17:55:41,381 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0
2024-03-09 17:56:11,514 INFO [train.py:997] (2/4) Epoch 40, batch 100, loss[loss=0.1421, simple_loss=0.2409, pruned_loss=0.02163, over 24057.00 frames. ], tot_loss[loss=0.1387, simple_loss=0.2298, pruned_loss=0.02385, over 1880352.11 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:56:25,273 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41813.333333333336, ans=0.125
2024-03-09 17:56:25,274 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=41813.333333333336, ans=0.125
2024-03-09 17:56:38,662 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=41813.333333333336, ans=0.0017797101449275356
2024-03-09 17:57:01,827 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.09 vs. limit=10.0
2024-03-09 17:57:15,104 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=42013.333333333336, ans=0.0
2024-03-09 17:57:25,919 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.807e+01 6.999e+01 7.479e+01 8.341e+01 1.133e+02, threshold=1.496e+02, percent-clipped=0.0
2024-03-09 17:57:29,664 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=42080.0, ans=0.0
2024-03-09 17:57:29,749 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=42080.0, ans=0.2
2024-03-09 17:57:30,900 INFO [train.py:997] (2/4) Epoch 40, batch 150, loss[loss=0.1284, simple_loss=0.2239, pruned_loss=0.01647, over 22919.00 frames. ], tot_loss[loss=0.1385, simple_loss=0.2292, pruned_loss=0.02386, over 2520675.51 frames. ], batch size: 609, lr: 1.18e-02, grad_scale: 64.0
2024-03-09 17:57:34,323 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=42080.0, ans=0.0
2024-03-09 17:58:21,383 INFO [train.py:997] (2/4) Epoch 41, batch 0, loss[loss=0.1449, simple_loss=0.2422, pruned_loss=0.0238, over 23837.00 frames. ], tot_loss[loss=0.1449, simple_loss=0.2422, pruned_loss=0.0238, over 23837.00 frames. ], batch size: 447, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 17:58:21,383 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 17:58:30,942 INFO [train.py:1029] (2/4) Epoch 41, validation: loss=0.2136, simple_loss=0.3076, pruned_loss=0.05982, over 452978.00 frames.
2024-03-09 17:58:30,942 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 17:58:52,553 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=42200.0, ans=0.0016956521739130443
2024-03-09 17:59:16,947 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=42266.666666666664, ans=0.04949747468305833
2024-03-09 17:59:26,230 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=42333.333333333336, ans=0.1
2024-03-09 17:59:41,645 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42400.0, ans=0.1
2024-03-09 17:59:44,713 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=42400.0, ans=0.0016521739130434792
2024-03-09 17:59:45,314 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.49 vs. limit=12.0
2024-03-09 17:59:53,110 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0
2024-03-09 17:59:53,553 INFO [train.py:997] (2/4) Epoch 41, batch 50, loss[loss=0.1425, simple_loss=0.2332, pruned_loss=0.02589, over 24080.00 frames. ], tot_loss[loss=0.1364, simple_loss=0.2274, pruned_loss=0.02268, over 1071951.54 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:00:15,798 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=42533.333333333336, ans=0.2
2024-03-09 18:00:18,834 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=42533.333333333336, ans=0.125
2024-03-09 18:00:36,173 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0
2024-03-09 18:00:55,611 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+01 7.025e+01 7.943e+01 8.921e+01 1.202e+02, threshold=1.589e+02, percent-clipped=0.0
2024-03-09 18:00:56,005 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42733.333333333336, ans=0.125
2024-03-09 18:01:09,779 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=42733.333333333336, ans=0.0
2024-03-09 18:01:14,009 INFO [train.py:997] (2/4) Epoch 41, batch 100, loss[loss=0.1255, simple_loss=0.2142, pruned_loss=0.01841, over 23967.00 frames. ], tot_loss[loss=0.1367, simple_loss=0.2276, pruned_loss=0.02288, over 1877208.30 frames. ], batch size: 142, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:01:37,481 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=42866.666666666664, ans=0.125
2024-03-09 18:01:54,136 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=42933.333333333336, ans=0.125
2024-03-09 18:01:55,711 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42933.333333333336, ans=0.1
2024-03-09 18:02:22,732 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=22.5
2024-03-09 18:02:33,531 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=43133.333333333336, ans=0.125
2024-03-09 18:02:34,744 INFO [train.py:997] (2/4) Epoch 41, batch 150, loss[loss=0.1384, simple_loss=0.23, pruned_loss=0.02337, over 24168.00 frames. ], tot_loss[loss=0.1376, simple_loss=0.2285, pruned_loss=0.02335, over 2507465.28 frames. ], batch size: 326, lr: 1.16e-02, grad_scale: 64.0
2024-03-09 18:03:28,794 INFO [train.py:997] (2/4) Epoch 42, batch 0, loss[loss=0.134, simple_loss=0.2254, pruned_loss=0.02131, over 24201.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2254, pruned_loss=0.02131, over 24201.00 frames. ], batch size: 295, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:03:28,794 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:03:38,341 INFO [train.py:1029] (2/4) Epoch 42, validation: loss=0.2135, simple_loss=0.3075, pruned_loss=0.05972, over 452978.00 frames.
2024-03-09 18:03:38,342 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:04:03,016 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43253.333333333336, ans=0.1
2024-03-09 18:04:23,937 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=43320.0, ans=15.0
2024-03-09 18:04:26,993 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.11 vs. limit=15.0
2024-03-09 18:04:29,006 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+01 6.812e+01 7.244e+01 8.018e+01 1.063e+02, threshold=1.449e+02, percent-clipped=0.0
2024-03-09 18:04:33,174 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.04 vs. limit=12.0
2024-03-09 18:04:39,615 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.16 vs. limit=22.5
2024-03-09 18:04:58,778 INFO [train.py:997] (2/4) Epoch 42, batch 50, loss[loss=0.1374, simple_loss=0.2325, pruned_loss=0.02117, over 24209.00 frames. ], tot_loss[loss=0.1344, simple_loss=0.224, pruned_loss=0.0224, over 1063053.21 frames. ], batch size: 327, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:05:06,716 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=43520.0, ans=0.0014086956521739136
2024-03-09 18:05:44,548 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.46 vs. limit=22.5
2024-03-09 18:05:57,777 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=43720.0, ans=0.125
2024-03-09 18:06:20,953 INFO [train.py:997] (2/4) Epoch 42, batch 100, loss[loss=0.1233, simple_loss=0.2109, pruned_loss=0.01781, over 23717.00 frames. ], tot_loss[loss=0.1346, simple_loss=0.2245, pruned_loss=0.02238, over 1878545.59 frames. ], batch size: 116, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:06:23,450 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.95 vs. limit=10.0
2024-03-09 18:07:09,738 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.750e+01 6.712e+01 7.266e+01 7.977e+01 1.080e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:07:24,501 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=44120.0, ans=0.125
2024-03-09 18:07:39,996 INFO [train.py:997] (2/4) Epoch 42, batch 150, loss[loss=0.137, simple_loss=0.223, pruned_loss=0.02552, over 19881.00 frames. ], tot_loss[loss=0.1349, simple_loss=0.2254, pruned_loss=0.02216, over 2514217.23 frames. ], batch size: 59, lr: 1.14e-02, grad_scale: 64.0
2024-03-09 18:07:40,262 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=44186.666666666664, ans=0.125
2024-03-09 18:08:31,608 INFO [train.py:997] (2/4) Epoch 43, batch 0, loss[loss=0.1316, simple_loss=0.218, pruned_loss=0.02257, over 24324.00 frames. ], tot_loss[loss=0.1316, simple_loss=0.218, pruned_loss=0.02257, over 24324.00 frames. ], batch size: 208, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:08:31,608 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:08:41,004 INFO [train.py:1029] (2/4) Epoch 43, validation: loss=0.2134, simple_loss=0.3077, pruned_loss=0.05952, over 452978.00 frames.
2024-03-09 18:08:41,005 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:09:23,371 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=44373.333333333336, ans=0.125
2024-03-09 18:09:36,220 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.63 vs. limit=22.5
2024-03-09 18:09:46,538 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=44506.666666666664, ans=0.125
2024-03-09 18:10:01,379 INFO [train.py:997] (2/4) Epoch 43, batch 50, loss[loss=0.1368, simple_loss=0.2225, pruned_loss=0.02552, over 24062.00 frames. ], tot_loss[loss=0.1347, simple_loss=0.2255, pruned_loss=0.02198, over 1072218.91 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:10:22,500 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0
2024-03-09 18:10:36,518 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.916e+01 6.864e+01 7.263e+01 8.155e+01 1.054e+02, threshold=1.453e+02, percent-clipped=0.0
2024-03-09 18:10:40,006 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=44706.666666666664, ans=0.2
2024-03-09 18:10:43,434 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0
2024-03-09 18:10:46,091 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=44773.333333333336, ans=0.125
2024-03-09 18:10:46,106 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=44773.333333333336, ans=0.0
2024-03-09 18:10:49,207 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=44773.333333333336, ans=0.05
2024-03-09 18:11:06,743 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.18 vs. limit=6.0
2024-03-09 18:11:19,226 INFO [train.py:997] (2/4) Epoch 43, batch 100, loss[loss=0.1264, simple_loss=0.2111, pruned_loss=0.02089, over 23618.00 frames. ], tot_loss[loss=0.1352, simple_loss=0.2259, pruned_loss=0.02223, over 1893635.62 frames. ], batch size: 128, lr: 1.12e-02, grad_scale: 64.0
2024-03-09 18:11:55,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=45040.0, ans=0.125
2024-03-09 18:12:38,760 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.08 vs. limit=12.0
2024-03-09 18:12:40,866 INFO [train.py:997] (2/4) Epoch 43, batch 150, loss[loss=0.1335, simple_loss=0.2295, pruned_loss=0.01875, over 24175.00 frames. ], tot_loss[loss=0.1369, simple_loss=0.2276, pruned_loss=0.02307, over 2526801.06 frames. ], batch size: 366, lr: 1.12e-02, grad_scale: 32.0
2024-03-09 18:13:36,397 INFO [train.py:997] (2/4) Epoch 44, batch 0, loss[loss=0.1362, simple_loss=0.2244, pruned_loss=0.02402, over 24256.00 frames. ], tot_loss[loss=0.1362, simple_loss=0.2244, pruned_loss=0.02402, over 24256.00 frames. ], batch size: 198, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:13:36,397 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:13:45,433 INFO [train.py:1029] (2/4) Epoch 44, validation: loss=0.2121, simple_loss=0.3064, pruned_loss=0.05891, over 452978.00 frames.
2024-03-09 18:13:45,433 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:14:19,829 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.880e+01 6.918e+01 7.525e+01 8.097e+01 1.200e+02, threshold=1.505e+02, percent-clipped=0.0
2024-03-09 18:14:20,284 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=45360.0, ans=0.125
2024-03-09 18:14:40,642 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45493.333333333336, ans=0.125
2024-03-09 18:14:48,183 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45493.333333333336, ans=0.1
2024-03-09 18:15:08,394 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.043e-02
2024-03-09 18:15:12,600 INFO [train.py:997] (2/4) Epoch 44, batch 50, loss[loss=0.1366, simple_loss=0.2257, pruned_loss=0.0237, over 24213.00 frames. ], tot_loss[loss=0.1363, simple_loss=0.2266, pruned_loss=0.02297, over 1070982.79 frames. ], batch size: 241, lr: 1.10e-02, grad_scale: 32.0
2024-03-09 18:15:22,049 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=45626.666666666664, ans=0.1
2024-03-09 18:15:31,433 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=45693.333333333336, ans=0.125
2024-03-09 18:15:48,257 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=45760.0, ans=0.125
2024-03-09 18:16:00,464 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=45826.666666666664, ans=0.125
2024-03-09 18:16:03,549 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=45826.666666666664, ans=0.0
2024-03-09 18:16:14,469 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=45893.333333333336, ans=0.125
2024-03-09 18:16:30,709 INFO [train.py:997] (2/4) Epoch 44, batch 100, loss[loss=0.129, simple_loss=0.2209, pruned_loss=0.01858, over 24261.00 frames. ], tot_loss[loss=0.1355, simple_loss=0.2263, pruned_loss=0.02232, over 1879870.47 frames. ], batch size: 198, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:16:49,127 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=46026.666666666664, ans=0.2
2024-03-09 18:17:01,036 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.671e+01 6.824e+01 7.356e+01 8.103e+01 1.148e+02, threshold=1.471e+02, percent-clipped=0.0
2024-03-09 18:17:09,048 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=46093.333333333336, ans=0.0
2024-03-09 18:17:24,195 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46160.0, ans=0.125
2024-03-09 18:17:34,646 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:41,283 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:44,868 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=46226.666666666664, ans=0.125
2024-03-09 18:17:51,570 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.40 vs. limit=15.0
2024-03-09 18:17:51,977 INFO [train.py:997] (2/4) Epoch 44, batch 150, loss[loss=0.1372, simple_loss=0.2233, pruned_loss=0.0256, over 24294.00 frames. ], tot_loss[loss=0.1353, simple_loss=0.227, pruned_loss=0.0218, over 2514145.75 frames. ], batch size: 188, lr: 1.10e-02, grad_scale: 16.0
2024-03-09 18:17:53,657 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=46293.333333333336, ans=0.125
2024-03-09 18:18:43,510 INFO [train.py:997] (2/4) Epoch 45, batch 0, loss[loss=0.1288, simple_loss=0.2152, pruned_loss=0.02118, over 24274.00 frames. ], tot_loss[loss=0.1288, simple_loss=0.2152, pruned_loss=0.02118, over 24274.00 frames. ], batch size: 229, lr: 1.09e-02, grad_scale: 32.0
2024-03-09 18:18:43,510 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:18:53,094 INFO [train.py:1029] (2/4) Epoch 45, validation: loss=0.2137, simple_loss=0.3089, pruned_loss=0.05927, over 452978.00 frames.
2024-03-09 18:18:53,095 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27338MB
2024-03-09 18:19:05,780 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=46346.666666666664, ans=0.1
2024-03-09 18:19:23,120 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5
2024-03-09 18:19:32,732 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0
2024-03-09 18:19:41,266 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46546.666666666664, ans=0.1
2024-03-09 18:19:46,926 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=46546.666666666664, ans=0.2
2024-03-09 18:19:57,743 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=46546.666666666664, ans=0.05
2024-03-09 18:20:01,403 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0
2024-03-09 18:20:02,439 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=46613.333333333336, ans=0.0007362318840579696
2024-03-09 18:20:16,257 INFO [train.py:997] (2/4) Epoch 45, batch 50, loss[loss=0.1174, simple_loss=0.2083, pruned_loss=0.01326, over 23907.00 frames. ], tot_loss[loss=0.1332, simple_loss=0.2231, pruned_loss=0.02163, over 1066649.49 frames. ], batch size: 142, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:20:25,682 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46680.0, ans=0.1
2024-03-09 18:20:29,931 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.843e+01 6.817e+01 7.386e+01 8.152e+01 1.203e+02, threshold=1.477e+02, percent-clipped=0.0
2024-03-09 18:20:42,431 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=46746.666666666664, ans=0.125
2024-03-09 18:20:45,564 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=46813.333333333336, ans=0.125
2024-03-09 18:20:53,107 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46813.333333333336, ans=0.125
2024-03-09 18:21:02,448 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=46880.0, ans=0.0
2024-03-09 18:21:35,457 INFO [train.py:997] (2/4) Epoch 45, batch 100, loss[loss=0.1478, simple_loss=0.2458, pruned_loss=0.02492, over 23684.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2252, pruned_loss=0.02144, over 1890075.48 frames. ], batch size: 486, lr: 1.08e-02, grad_scale: 32.0
2024-03-09 18:21:37,217 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=47013.333333333336, ans=0.0
2024-03-09 18:21:53,417 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.55 vs. limit=15.0
2024-03-09 18:22:03,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=47080.0, ans=0.2
2024-03-09 18:22:10,049 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=22.5
2024-03-09 18:22:16,801 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=47146.666666666664, ans=0.035
2024-03-09 18:22:24,506 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47213.333333333336, ans=0.125
2024-03-09 18:22:42,566 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.26 vs. limit=15.0
2024-03-09 18:22:46,486 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=47280.0, ans=0.2
2024-03-09 18:22:55,744 INFO [train.py:997] (2/4) Epoch 45, batch 150, loss[loss=0.1664, simple_loss=0.254, pruned_loss=0.03941, over 23212.00 frames. ], tot_loss[loss=0.1359, simple_loss=0.2269, pruned_loss=0.02249, over 2518070.61 frames. ], batch size: 534, lr: 1.08e-02, grad_scale: 16.0
2024-03-09 18:23:43,806 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=47400.0, ans=0.0
2024-03-09 18:23:50,623 INFO [train.py:997] (2/4) Epoch 46, batch 0, loss[loss=0.1432, simple_loss=0.2268, pruned_loss=0.02982, over 24129.00 frames. ], tot_loss[loss=0.1432, simple_loss=0.2268, pruned_loss=0.02982, over 24129.00 frames. ], batch size: 165, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:23:50,624 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:24:00,486 INFO [train.py:1029] (2/4) Epoch 46, validation: loss=0.2142, simple_loss=0.3085, pruned_loss=0.05997, over 452978.00 frames.
2024-03-09 18:24:00,487 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:24:05,179 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+01 6.849e+01 7.495e+01 7.996e+01 1.078e+02, threshold=1.499e+02, percent-clipped=0.0
2024-03-09 18:24:07,569 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5
2024-03-09 18:24:16,641 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=15.0
2024-03-09 18:24:29,570 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=47466.666666666664, ans=0.2
2024-03-09 18:25:25,829 INFO [train.py:997] (2/4) Epoch 46, batch 50, loss[loss=0.1487, simple_loss=0.246, pruned_loss=0.02576, over 23774.00 frames. ], tot_loss[loss=0.1351, simple_loss=0.2257, pruned_loss=0.02229, over 1056332.04 frames. ], batch size: 447, lr: 1.07e-02, grad_scale: 16.0
2024-03-09 18:25:39,126 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.67 vs. limit=15.0
2024-03-09 18:26:24,958 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=47933.333333333336, ans=0.0
2024-03-09 18:26:37,848 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=48000.0, ans=0.125
2024-03-09 18:26:40,906 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=48000.0, ans=0.0004347826086956528
2024-03-09 18:26:45,324 INFO [train.py:997] (2/4) Epoch 46, batch 100, loss[loss=0.1348, simple_loss=0.2278, pruned_loss=0.02095, over 24272.00 frames. ], tot_loss[loss=0.1357, simple_loss=0.2265, pruned_loss=0.02243, over 1878666.24 frames. ], batch size: 281, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:26:49,978 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.653e+01 6.627e+01 7.164e+01 7.678e+01 1.012e+02, threshold=1.433e+02, percent-clipped=0.0
2024-03-09 18:26:55,813 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.78 vs. limit=10.0
2024-03-09 18:27:27,831 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=48200.0, ans=0.0
2024-03-09 18:27:45,397 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=48266.666666666664, ans=0.125
2024-03-09 18:28:06,133 INFO [train.py:997] (2/4) Epoch 46, batch 150, loss[loss=0.1401, simple_loss=0.2273, pruned_loss=0.02641, over 24049.00 frames. ], tot_loss[loss=0.1358, simple_loss=0.2269, pruned_loss=0.02238, over 2512007.53 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 16.0
2024-03-09 18:28:07,823 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00
2024-03-09 18:29:00,570 INFO [train.py:997] (2/4) Epoch 47, batch 0, loss[loss=0.1299, simple_loss=0.2214, pruned_loss=0.0192, over 24297.00 frames. ], tot_loss[loss=0.1299, simple_loss=0.2214, pruned_loss=0.0192, over 24297.00 frames. ], batch size: 281, lr: 1.05e-02, grad_scale: 32.0
2024-03-09 18:29:00,571 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:29:10,390 INFO [train.py:1029] (2/4) Epoch 47, validation: loss=0.2152, simple_loss=0.3095, pruned_loss=0.06041, over 452978.00 frames.
2024-03-09 18:29:10,391 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:29:11,564 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5
2024-03-09 18:29:19,106 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.47 vs. limit=10.0
2024-03-09 18:29:33,277 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=48520.0, ans=0.125
2024-03-09 18:29:42,361 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=48586.666666666664, ans=0.1
2024-03-09 18:30:07,579 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48653.333333333336, ans=0.0
2024-03-09 18:30:28,055 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.832e+01 6.822e+01 7.253e+01 7.989e+01 1.051e+02, threshold=1.451e+02, percent-clipped=0.0
2024-03-09 18:30:34,259 INFO [train.py:997] (2/4) Epoch 47, batch 50, loss[loss=0.1384, simple_loss=0.2347, pruned_loss=0.02108, over 24005.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2238, pruned_loss=0.02148, over 1081251.69 frames. ], batch size: 388, lr: 1.05e-02, grad_scale: 16.0
2024-03-09 18:30:36,114 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=48786.666666666664, ans=0.125
2024-03-09 18:30:46,476 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0
2024-03-09 18:30:47,614 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=48786.666666666664, ans=0.125
2024-03-09 18:30:56,769 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=48853.333333333336, ans=0.07
2024-03-09 18:31:02,013 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0
2024-03-09 18:31:18,828 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.03 vs. limit=22.5
2024-03-09 18:31:20,263 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.88 vs. limit=10.0
2024-03-09 18:31:25,803 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=48986.666666666664, ans=0.2
2024-03-09 18:31:53,476 INFO [train.py:997] (2/4) Epoch 47, batch 100, loss[loss=0.1376, simple_loss=0.2362, pruned_loss=0.01955, over 23933.00 frames. ], tot_loss[loss=0.1331, simple_loss=0.2243, pruned_loss=0.02098, over 1884190.11 frames. ], batch size: 387, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:32:23,553 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0
2024-03-09 18:32:40,429 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=49253.333333333336, ans=0.125
2024-03-09 18:32:41,758 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=49253.333333333336, ans=0.125
2024-03-09 18:33:02,454 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.48 vs. limit=15.0
2024-03-09 18:33:10,429 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.939e+01 7.123e+01 7.707e+01 8.583e+01 1.160e+02, threshold=1.541e+02, percent-clipped=0.0
2024-03-09 18:33:12,794 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=49386.666666666664, ans=0.125
2024-03-09 18:33:15,548 INFO [train.py:997] (2/4) Epoch 47, batch 150, loss[loss=0.1442, simple_loss=0.2278, pruned_loss=0.03028, over 24055.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2249, pruned_loss=0.02099, over 2506885.08 frames. ], batch size: 165, lr: 1.05e-02, grad_scale: 8.0
2024-03-09 18:33:15,846 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=49453.333333333336, ans=0.07
2024-03-09 18:34:03,520 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49506.666666666664, ans=0.125
2024-03-09 18:34:05,688 INFO [train.py:997] (2/4) Epoch 48, batch 0, loss[loss=0.1148, simple_loss=0.1943, pruned_loss=0.01763, over 23890.00 frames. ], tot_loss[loss=0.1148, simple_loss=0.1943, pruned_loss=0.01763, over 23890.00 frames. ], batch size: 117, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:34:05,689 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:34:15,167 INFO [train.py:1029] (2/4) Epoch 48, validation: loss=0.2149, simple_loss=0.3083, pruned_loss=0.06081, over 452978.00 frames.
2024-03-09 18:34:15,168 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:34:46,412 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=49573.333333333336, ans=9.275362318840637e-05
2024-03-09 18:35:08,324 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=12.0
2024-03-09 18:35:31,699 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=49773.333333333336, ans=0.0
2024-03-09 18:35:40,446 INFO [train.py:997] (2/4) Epoch 48, batch 50, loss[loss=0.1329, simple_loss=0.2198, pruned_loss=0.02301, over 24057.00 frames. ], tot_loss[loss=0.1334, simple_loss=0.2253, pruned_loss=0.02073, over 1074633.96 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:35:52,080 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5
2024-03-09 18:36:16,806 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=49973.333333333336, ans=0.125
2024-03-09 18:36:18,405 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=49973.333333333336, ans=0.07
2024-03-09 18:36:38,196 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=50040.0, ans=0.2
2024-03-09 18:36:39,624 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=50040.0, ans=0.0
2024-03-09 18:36:42,440 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.770e+01 6.729e+01 7.301e+01 8.005e+01 9.735e+01, threshold=1.460e+02, percent-clipped=0.0
2024-03-09 18:36:59,148 INFO [train.py:997] (2/4) Epoch 48, batch 100, loss[loss=0.1308, simple_loss=0.2182, pruned_loss=0.0217, over 24239.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.226, pruned_loss=0.02118, over 1878205.65 frames. ], batch size: 229, lr: 1.03e-02, grad_scale: 16.0
2024-03-09 18:37:18,327 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=15.0
2024-03-09 18:37:25,132 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=50240.0, ans=0.125
2024-03-09 18:37:43,216 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=50306.666666666664, ans=0.5
2024-03-09 18:37:49,335 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50373.333333333336, ans=0.1
2024-03-09 18:38:16,390 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0
2024-03-09 18:38:17,397 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=50440.0, ans=0.0
2024-03-09 18:38:20,108 INFO [train.py:997] (2/4) Epoch 48, batch 150, loss[loss=0.1177, simple_loss=0.2043, pruned_loss=0.01552, over 23931.00 frames. ], tot_loss[loss=0.1333, simple_loss=0.2249, pruned_loss=0.02085, over 2510086.46 frames. ], batch size: 142, lr: 1.03e-02, grad_scale: 8.0
2024-03-09 18:38:28,107 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=50506.666666666664, ans=0.07
2024-03-09 18:38:28,138 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=50506.666666666664, ans=0.0
2024-03-09 18:39:15,082 INFO [train.py:997] (2/4) Epoch 49, batch 0, loss[loss=0.1342, simple_loss=0.2, pruned_loss=0.03417, over 23983.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.2, pruned_loss=0.03417, over 23983.00 frames. ], batch size: 142, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:39:15,083 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:39:24,777 INFO [train.py:1029] (2/4) Epoch 49, validation: loss=0.2171, simple_loss=0.31, pruned_loss=0.06203, over 452978.00 frames.
2024-03-09 18:39:24,778 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:39:43,422 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=50560.0, ans=0.035
2024-03-09 18:39:45,477 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0
2024-03-09 18:39:49,595 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=50626.666666666664, ans=0.2
2024-03-09 18:39:50,271 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.68 vs. limit=22.5
2024-03-09 18:39:50,991 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=50626.666666666664, ans=0.125
2024-03-09 18:40:01,326 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0
2024-03-09 18:40:22,423 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50760.0, ans=0.1
2024-03-09 18:40:23,493 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+01 6.914e+01 7.599e+01 8.430e+01 1.205e+02, threshold=1.520e+02, percent-clipped=0.0
2024-03-09 18:40:51,380 INFO [train.py:997] (2/4) Epoch 49, batch 50, loss[loss=0.1187, simple_loss=0.2148, pruned_loss=0.01126, over 22871.00 frames. ], tot_loss[loss=0.1338, simple_loss=0.2248, pruned_loss=0.0214, over 1065684.76 frames. ], batch size: 609, lr: 1.02e-02, grad_scale: 16.0
2024-03-09 18:40:58,397 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.69 vs. limit=6.0
2024-03-09 18:41:06,977 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=50960.0, ans=0.125
2024-03-09 18:41:11,068 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0
2024-03-09 18:41:19,399 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=50960.0, ans=0.0
2024-03-09 18:41:19,451 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=50960.0, ans=0.1
2024-03-09 18:41:20,993 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=51026.666666666664, ans=0.125
2024-03-09 18:41:23,461 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0
2024-03-09 18:42:10,663 INFO [train.py:997] (2/4) Epoch 49, batch 100, loss[loss=0.1376, simple_loss=0.2243, pruned_loss=0.02545, over 24069.00 frames. ], tot_loss[loss=0.1337, simple_loss=0.2248, pruned_loss=0.02128, over 1887734.01 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:42:14,001 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=51226.666666666664, ans=0.125
2024-03-09 18:42:19,178 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0
2024-03-09 18:42:54,560 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=22.5
2024-03-09 18:42:58,424 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51426.666666666664, ans=0.1
2024-03-09 18:43:04,158 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.035e+01 6.804e+01 7.380e+01 7.884e+01 1.078e+02, threshold=1.476e+02, percent-clipped=0.0
2024-03-09 18:43:12,040 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=51426.666666666664, ans=0.0
2024-03-09 18:43:30,601 INFO [train.py:997] (2/4) Epoch 49, batch 150, loss[loss=0.1402, simple_loss=0.2252, pruned_loss=0.02761, over 24089.00 frames. ], tot_loss[loss=0.1341, simple_loss=0.2252, pruned_loss=0.02148, over 2524757.47 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 8.0
2024-03-09 18:43:34,367 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51560.0, ans=0.125
2024-03-09 18:44:22,349 INFO [train.py:997] (2/4) Epoch 50, batch 0, loss[loss=0.1177, simple_loss=0.2032, pruned_loss=0.01613, over 23802.00 frames. ], tot_loss[loss=0.1177, simple_loss=0.2032, pruned_loss=0.01613, over 23802.00 frames. ], batch size: 117, lr: 1.00e-02, grad_scale: 16.0
2024-03-09 18:44:22,349 INFO [train.py:1020] (2/4) Computing validation loss
2024-03-09 18:44:31,468 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.1928, 4.4109, 4.2274, 2.9856], device='cuda:2')
2024-03-09 18:44:31,920 INFO [train.py:1029] (2/4) Epoch 50, validation: loss=0.2164, simple_loss=0.3113, pruned_loss=0.06071, over 452978.00 frames.
2024-03-09 18:44:31,921 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28212MB
2024-03-09 18:44:41,739 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=51613.333333333336, ans=0.09899494936611666
2024-03-09 18:45:19,022 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=51746.666666666664, ans=0.0
2024-03-09 18:45:19,178 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=51746.666666666664, ans=0.2
2024-03-09 18:45:29,903 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=51813.333333333336, ans=0.0
2024-03-09 18:45:40,648 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=51880.0, ans=0.0
2024-03-09 18:45:40,665 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51880.0, ans=0.1
2024-03-09 18:45:57,119 INFO [train.py:997] (2/4) Epoch 50, batch 50, loss[loss=0.1315, simple_loss=0.2219, pruned_loss=0.0205, over 24266.00 frames. ], tot_loss[loss=0.1328, simple_loss=0.2238, pruned_loss=0.02086, over 1076247.40 frames. ], batch size: 254, lr: 1.00e-02, grad_scale: 8.0
2024-03-09 18:46:37,165 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.908e+01 6.888e+01 7.222e+01 7.907e+01 1.090e+02, threshold=1.444e+02, percent-clipped=0.0
2024-03-09 18:47:12,660 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=52280.0, ans=0.2
2024-03-09 18:47:13,806 INFO [train.py:997] (2/4) Epoch 50, batch 100, loss[loss=0.1378, simple_loss=0.2215, pruned_loss=0.02701, over 23908.00 frames. ], tot_loss[loss=0.1339, simple_loss=0.2252, pruned_loss=0.02123, over 1891086.62 frames. ], batch size: 153, lr: 9.99e-03, grad_scale: 8.0
2024-03-09 18:47:18,834 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=52280.0, ans=0.125
2024-03-09 18:47:37,302 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=52346.666666666664, ans=0.0
2024-03-09 18:47:41,672 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=2.550e-03
2024-03-09 18:47:52,411 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=52413.333333333336, ans=0.2
2024-03-09 18:47:53,903 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=52413.333333333336, ans=0.125
2024-03-09 18:47:55,375 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52413.333333333336, ans=0.1
2024-03-09 18:47:55,407 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=52413.333333333336, ans=0.2
2024-03-09 18:47:56,878 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=52413.333333333336, ans=0.125
2024-03-09 18:48:02,070 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0
2024-03-09 18:48:10,607 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52480.0, ans=0.125
2024-03-09 18:48:19,523 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:21,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=52546.666666666664, ans=0.125
2024-03-09 18:48:24,807 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.76 vs. limit=22.5
2024-03-09 18:48:27,106 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=52546.666666666664, ans=0.0
2024-03-09 18:48:36,460 INFO [train.py:997] (2/4) Epoch 50, batch 150, loss[loss=0.1327, simple_loss=0.2285, pruned_loss=0.01848, over 24183.00 frames. ], tot_loss[loss=0.1342, simple_loss=0.2258, pruned_loss=0.02127, over 2521109.74 frames. ], batch size: 345, lr: 9.97e-03, grad_scale: 8.0
2024-03-09 18:48:45,967 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=52613.333333333336, ans=0.0
2024-03-09 18:48:48,730 INFO [train.py:1248] (2/4) Done!