2024-03-09 12:56:09,174 INFO [train.py:1065] (2/4) Training started 2024-03-09 12:56:09,175 INFO [train.py:1075] (2/4) Device: cuda:2 2024-03-09 12:56:09,272 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt 2024-03-09 12:56:09,334 INFO [train.py:1086] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2989b0b1186fa6022932804f5b39fbb2781ebf42', 'k2-git-date': 'Fri Nov 24 11:34:10 2023', 'lhotse-version': '1.22.0.dev+git.d8ed1bbb.dirty', 'torch-version': '1.11.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'dev/mdcc', 'icefall-git-sha1': 'f62fc7f0-clean', 'icefall-git-date': 'Sat Mar 9 12:55:42 2024', 'icefall-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/icefall-1.0-py3.9.egg', 'k2-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/k2-1.24.4.dev20231207+cuda10.2.torch1.11.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/star-home/jinzengrui/lib/miniconda3/envs/dev39/lib/python3.9/site-packages/lhotse-1.22.0.dev0+git.d8ed1bbb.dirty-py3.9.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-1207150844-f49d8c4f4-c49d5', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 1, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 4852} 2024-03-09 12:56:09,334 INFO [train.py:1088] (2/4) About to create model 2024-03-09 12:56:10,020 INFO [train.py:1092] (2/4) Number of model parameters: 74470867 2024-03-09 12:56:14,943 INFO [train.py:1107] (2/4) Using DDP 2024-03-09 12:56:15,513 INFO [asr_datamodule.py:368] (2/4) About to get train cuts 2024-03-09 12:56:15,622 INFO [asr_datamodule.py:376] (2/4) About to get valid cuts 2024-03-09 12:56:15,640 INFO [asr_datamodule.py:195] (2/4) About to get Musan cuts 2024-03-09 12:56:18,246 INFO [asr_datamodule.py:200] (2/4) Enable MUSAN 2024-03-09 12:56:18,246 INFO [asr_datamodule.py:223] (2/4) Enable SpecAugment 2024-03-09 12:56:18,247 INFO [asr_datamodule.py:224] (2/4) Time warp factor: 80 2024-03-09 12:56:18,247 INFO [asr_datamodule.py:234] (2/4) Num frame mask: 10 2024-03-09 12:56:18,247 INFO [asr_datamodule.py:247] (2/4) About to create train dataset 2024-03-09 12:56:18,247 INFO [asr_datamodule.py:273] (2/4) Using DynamicBucketingSampler. 2024-03-09 12:56:19,042 INFO [asr_datamodule.py:290] (2/4) About to create train dataloader 2024-03-09 12:56:19,042 INFO [asr_datamodule.py:315] (2/4) About to create dev dataset 2024-03-09 12:56:19,376 INFO [asr_datamodule.py:332] (2/4) About to create dev dataloader 2024-03-09 12:57:18,478 INFO [train.py:997] (2/4) Epoch 1, batch 0, loss[loss=10.4, simple_loss=9.484, pruned_loss=9.14, over 24046.00 frames. ], tot_loss[loss=10.4, simple_loss=9.484, pruned_loss=9.14, over 24046.00 frames. ], batch size: 388, lr: 2.25e-02, grad_scale: 1.0 2024-03-09 12:57:18,478 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 12:57:28,776 INFO [train.py:1029] (2/4) Epoch 1, validation: loss=10.41, simple_loss=9.49, pruned_loss=9.134, over 452978.00 frames. 2024-03-09 12:57:28,777 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 25142MB 2024-03-09 12:57:36,668 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=0.0, ans=0.5 2024-03-09 12:57:38,369 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2 2024-03-09 12:57:40,682 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=0.0, ans=7.5 2024-03-09 12:57:45,949 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.68 vs. limit=7.525 2024-03-09 12:57:49,078 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66.66666666666667, ans=0.496875 2024-03-09 12:57:52,241 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+03 5.651e+03 5.908e+03 6.903e+03 6.981e+03, threshold=2.363e+04, percent-clipped=0.0 2024-03-09 12:57:57,029 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=462.90 vs. limit=5.033333333333333 2024-03-09 12:57:57,038 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=3.01 2024-03-09 12:57:59,978 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66.66666666666667, ans=0.29933333333333334 2024-03-09 12:58:10,357 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.724e+03 3.453e+03 5.651e+03 6.615e+03 7.215e+03, threshold=2.260e+04, percent-clipped=0.0 2024-03-09 12:58:25,770 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=64.19 vs. limit=4.08 2024-03-09 12:58:26,076 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=370.61 vs. limit=7.65 2024-03-09 12:58:38,413 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=497.65 vs. limit=7.575 2024-03-09 12:58:43,622 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=347.69 vs. limit=7.6 2024-03-09 12:58:46,100 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.817e+02 1.921e+03 2.306e+03 5.651e+03 7.215e+03, threshold=9.223e+03, percent-clipped=0.0 2024-03-09 12:58:52,979 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=216.05 vs. limit=7.6 2024-03-09 12:58:53,830 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=266.6666666666667, ans=0.20400000000000001 2024-03-09 12:58:54,614 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=79.29 vs. limit=4.1066666666666665 2024-03-09 12:58:54,978 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=64.16 vs. limit=7.7 2024-03-09 12:58:59,106 INFO [train.py:997] (2/4) Epoch 1, batch 50, loss[loss=1.18, simple_loss=1.056, pruned_loss=1.115, over 24079.00 frames. ], tot_loss[loss=3.852, simple_loss=3.546, pruned_loss=3.001, over 1077413.50 frames. ], batch size: 176, lr: 2.48e-02, grad_scale: 0.25 2024-03-09 12:59:05,089 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333.3333333333333, ans=0.484375 2024-03-09 12:59:05,990 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=202.94 vs. limit=7.625 2024-03-09 12:59:16,332 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=345.47 vs. limit=7.65 2024-03-09 12:59:20,979 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=400.0, ans=0.48125 2024-03-09 12:59:27,793 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=400.0, ans=0.5 2024-03-09 12:59:42,679 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=151.93 vs. limit=7.675 2024-03-09 12:59:42,989 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=83.97 vs. limit=7.675 2024-03-09 12:59:44,783 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=30.12 vs. limit=7.675 2024-03-09 12:59:49,252 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=466.6666666666667, ans=0.8836666666666667 2024-03-09 12:59:51,690 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=468.54 vs. limit=7.675 2024-03-09 12:59:54,004 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.57 vs. limit=7.85 2024-03-09 12:59:55,556 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=487.69 vs. limit=7.9 2024-03-09 13:00:07,344 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2024-03-09 13:00:10,822 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2024-03-09 13:00:29,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=600.0, ans=7.725 2024-03-09 13:00:31,787 INFO [train.py:997] (2/4) Epoch 1, batch 100, loss[loss=0.9652, simple_loss=0.833, pruned_loss=1.053, over 23607.00 frames. ], tot_loss[loss=2.326, simple_loss=2.118, pruned_loss=1.941, over 1893932.56 frames. ], batch size: 128, lr: 2.70e-02, grad_scale: 0.5 2024-03-09 13:00:36,109 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=4.266666666666667 2024-03-09 13:00:37,048 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.046e+01 9.193e+01 2.011e+02 2.156e+03 7.215e+03, threshold=4.023e+02, percent-clipped=0.0 2024-03-09 13:00:38,088 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=336.58 vs. limit=8.0 2024-03-09 13:00:38,410 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=55.78 vs. limit=7.75 2024-03-09 13:00:48,762 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=46.73 vs. limit=7.775 2024-03-09 13:00:59,190 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=7.24 vs. limit=4.293333333333333 2024-03-09 13:01:00,284 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=733.3333333333334, ans=5.458333333333333 2024-03-09 13:01:04,510 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=29.96 vs. limit=5.183333333333334 2024-03-09 13:01:20,618 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=196.82 vs. limit=7.8 2024-03-09 13:01:23,434 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=22.86 vs. limit=7.8 2024-03-09 13:01:35,713 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.56 vs. limit=3.13 2024-03-09 13:01:52,148 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=933.3333333333334, ans=7.85 2024-03-09 13:01:55,704 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=237.97 vs. limit=5.466666666666667 2024-03-09 13:02:01,014 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=29.91 vs. limit=7.85 2024-03-09 13:02:01,713 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1000.0, ans=0.046875 2024-03-09 13:02:02,803 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.63 vs. limit=8.25 2024-03-09 13:02:03,199 INFO [train.py:997] (2/4) Epoch 1, batch 150, loss[loss=0.8802, simple_loss=0.7437, pruned_loss=0.9803, over 24058.00 frames. ], tot_loss[loss=1.764, simple_loss=1.587, pruned_loss=1.559, over 2533491.52 frames. ], batch size: 142, lr: 2.93e-02, grad_scale: 0.5 2024-03-09 13:02:12,787 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=358.19 vs. limit=7.875 2024-03-09 13:03:01,368 INFO [train.py:997] (2/4) Epoch 2, batch 0, loss[loss=0.8932, simple_loss=0.7547, pruned_loss=0.9826, over 20610.00 frames. ], tot_loss[loss=0.8932, simple_loss=0.7547, pruned_loss=0.9826, over 20610.00 frames. ], batch size: 62, lr: 2.91e-02, grad_scale: 1.0 2024-03-09 13:03:01,369 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:03:11,802 INFO [train.py:1029] (2/4) Epoch 2, validation: loss=0.9516, simple_loss=0.8161, pruned_loss=0.9787, over 452978.00 frames. 2024-03-09 13:03:11,802 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27666MB 2024-03-09 13:03:14,969 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=139.96 vs. limit=7.895 2024-03-09 13:03:30,772 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=38.78 vs. limit=5.5600000000000005 2024-03-09 13:03:32,120 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=88.73 vs. limit=7.92 2024-03-09 13:03:32,608 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=7.92 2024-03-09 13:03:36,069 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=20.34 vs. limit=7.92 2024-03-09 13:03:36,090 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=186.49 vs. limit=7.92 2024-03-09 13:03:41,093 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=4.448 2024-03-09 13:03:45,858 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1186.6666666666667, ans=0.444375 2024-03-09 13:03:46,492 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=23.45 vs. limit=5.593333333333334 2024-03-09 13:03:54,513 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1186.6666666666667, ans=0.3516666666666667 2024-03-09 13:04:03,187 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1253.3333333333333, ans=0.153 2024-03-09 13:04:04,277 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=23.58 vs. limit=7.97 2024-03-09 13:04:09,167 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=91.40 vs. limit=7.97 2024-03-09 13:04:11,041 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.97 vs. limit=8.44 2024-03-09 13:04:13,161 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=3.188 2024-03-09 13:04:20,014 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=89.16 vs. limit=8.44 2024-03-09 13:04:24,621 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1320.0, ans=0.438125 2024-03-09 13:04:31,668 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.992e+01 8.885e+01 1.035e+02 1.288e+02 2.193e+02, threshold=2.069e+02, percent-clipped=0.0 2024-03-09 13:04:41,297 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=1386.6666666666667, ans=0.14800000000000002 2024-03-09 13:04:42,728 INFO [train.py:997] (2/4) Epoch 2, batch 50, loss[loss=0.9425, simple_loss=0.8092, pruned_loss=0.9044, over 23983.00 frames. ], tot_loss[loss=0.9012, simple_loss=0.77, pruned_loss=0.9107, over 1065250.05 frames. ], batch size: 416, lr: 3.13e-02, grad_scale: 1.0 2024-03-09 13:04:42,910 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1386.6666666666667, ans=0.435 2024-03-09 13:04:45,597 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=164.54 vs. limit=8.54 2024-03-09 13:04:46,569 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1386.6666666666667, ans=0.0688 2024-03-09 13:04:50,703 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=8.02 2024-03-09 13:05:04,270 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=1453.3333333333333, ans=0.8491333333333334 2024-03-09 13:05:19,034 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1520.0, ans=0.0658 2024-03-09 13:05:30,111 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.67 vs. limit=5.38 2024-03-09 13:05:33,489 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=8.07 2024-03-09 13:05:41,665 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=1586.6666666666667, ans=0.0643 2024-03-09 13:05:42,496 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=13.51 vs. limit=5.3966666666666665 2024-03-09 13:05:57,804 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=8.12 2024-03-09 13:05:58,013 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=25.74 vs. limit=5.826666666666666 2024-03-09 13:06:14,150 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=63.88 vs. limit=8.12 2024-03-09 13:06:16,170 INFO [train.py:997] (2/4) Epoch 2, batch 100, loss[loss=0.7818, simple_loss=0.6622, pruned_loss=0.7504, over 23947.00 frames. ], tot_loss[loss=0.8733, simple_loss=0.7452, pruned_loss=0.8588, over 1885914.07 frames. ], batch size: 142, lr: 3.35e-02, grad_scale: 2.0 2024-03-09 13:06:23,390 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1720.0, ans=0.0613 2024-03-09 13:06:31,446 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=8.145 2024-03-09 13:06:34,736 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=40.18 vs. limit=8.84 2024-03-09 13:06:36,931 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=9.16 vs. limit=4.357333333333333 2024-03-09 13:06:42,721 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1786.6666666666667, ans=0.0598 2024-03-09 13:06:54,526 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1853.3333333333333, ans=0.2683333333333333 2024-03-09 13:07:03,924 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.58 vs. limit=8.89 2024-03-09 13:07:25,565 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1920.0, ans=0.41000000000000003 2024-03-09 13:07:28,699 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.58 vs. limit=8.245 2024-03-09 13:07:28,951 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=8.73 vs. limit=4.794666666666667 2024-03-09 13:07:34,361 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=4.397333333333333 2024-03-09 13:07:37,186 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=33.25 vs. limit=8.245 2024-03-09 13:07:37,966 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.386e+01 8.999e+01 1.029e+02 1.187e+02 2.200e+02, threshold=2.058e+02, percent-clipped=1.0 2024-03-09 13:07:44,328 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.35 vs. limit=8.99 2024-03-09 13:07:46,563 INFO [train.py:997] (2/4) Epoch 2, batch 150, loss[loss=0.8352, simple_loss=0.7115, pruned_loss=0.7474, over 24162.00 frames. ], tot_loss[loss=0.8593, simple_loss=0.7321, pruned_loss=0.8235, over 2518296.24 frames. ], batch size: 345, lr: 3.57e-02, grad_scale: 2.0 2024-03-09 13:07:47,589 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=98.20 vs. limit=8.27 2024-03-09 13:07:51,015 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.30 vs. limit=9.040000000000001 2024-03-09 13:07:57,405 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=31.42 vs. limit=8.27 2024-03-09 13:08:38,025 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2106.6666666666665, ans=0.8262666666666667 2024-03-09 13:08:44,925 INFO [train.py:997] (2/4) Epoch 3, batch 0, loss[loss=0.7551, simple_loss=0.6376, pruned_loss=0.6924, over 23767.00 frames. ], tot_loss[loss=0.7551, simple_loss=0.6376, pruned_loss=0.6924, over 23767.00 frames. ], batch size: 117, lr: 3.42e-02, grad_scale: 4.0 2024-03-09 13:08:44,925 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:08:54,189 INFO [train.py:1029] (2/4) Epoch 3, validation: loss=0.8556, simple_loss=0.7313, pruned_loss=0.7513, over 452978.00 frames. 2024-03-09 13:08:54,190 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:08:55,405 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.02 vs. limit=8.29 2024-03-09 13:09:00,490 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=31.51 vs. limit=8.29 2024-03-09 13:09:01,688 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2106.6666666666665, ans=0.2366666666666667 2024-03-09 13:09:19,691 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2173.3333333333335, ans=6.086666666666667 2024-03-09 13:09:21,442 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=22.17 vs. limit=8.315 2024-03-09 13:09:35,494 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=6.12 2024-03-09 13:09:45,268 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=2306.6666666666665, ans=0.04279166666666667 2024-03-09 13:09:50,502 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=2306.6666666666665, ans=0.391875 2024-03-09 13:09:51,391 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.62 vs. limit=9.23 2024-03-09 13:09:54,788 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2306.6666666666665, ans=0.391875 2024-03-09 13:09:55,942 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=26.05 vs. limit=6.153333333333333 2024-03-09 13:10:02,415 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2306.6666666666665, ans=0.391875 2024-03-09 13:10:14,675 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2373.3333333333335, ans=0.38875 2024-03-09 13:10:17,163 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=77.79 vs. limit=8.39 2024-03-09 13:10:19,014 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.39 vs. limit=9.28 2024-03-09 13:10:25,315 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2440.0, ans=0.1085 2024-03-09 13:10:26,721 INFO [train.py:997] (2/4) Epoch 3, batch 50, loss[loss=0.8558, simple_loss=0.7284, pruned_loss=0.7325, over 23624.00 frames. ], tot_loss[loss=0.7935, simple_loss=0.6736, pruned_loss=0.6989, over 1070423.23 frames. ], batch size: 485, lr: 3.63e-02, grad_scale: 4.0 2024-03-09 13:10:39,389 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2440.0, ans=0.042375 2024-03-09 13:10:39,965 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=25.40 vs. limit=8.415 2024-03-09 13:10:40,378 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=8.415 2024-03-09 13:10:42,777 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2506.6666666666665, ans=0.8122666666666667 2024-03-09 13:10:45,207 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.54 vs. limit=5.626666666666667 2024-03-09 13:10:45,343 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=26.22 vs. limit=8.44 2024-03-09 13:10:48,794 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=81.34 vs. limit=8.44 2024-03-09 13:10:50,470 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=51.80 vs. limit=8.44 2024-03-09 13:10:54,083 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=7.63 vs. limit=5.002666666666666 2024-03-09 13:11:06,065 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.84 vs. limit=9.43 2024-03-09 13:11:06,357 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.80 vs. limit=5.6433333333333335 2024-03-09 13:11:11,598 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=8.465 2024-03-09 13:11:16,189 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=9.43 2024-03-09 13:11:17,536 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2573.3333333333335, ans=0.5 2024-03-09 13:11:26,677 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=53.02 vs. limit=8.49 2024-03-09 13:11:26,702 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=11.77 vs. limit=5.66 2024-03-09 13:11:34,502 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.376e+01 1.355e+02 1.829e+02 2.456e+02 5.542e+02, threshold=3.657e+02, percent-clipped=39.0 2024-03-09 13:11:46,117 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=18.82 vs. limit=8.515 2024-03-09 13:11:49,074 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=5.676666666666667 2024-03-09 13:11:54,281 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=5.082666666666666 2024-03-09 13:11:54,739 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.04 vs. limit=9.53 2024-03-09 13:11:56,949 INFO [train.py:997] (2/4) Epoch 3, batch 100, loss[loss=0.7081, simple_loss=0.6114, pruned_loss=0.5578, over 22668.00 frames. ], tot_loss[loss=0.7698, simple_loss=0.6566, pruned_loss=0.6538, over 1884258.44 frames. ], batch size: 85, lr: 3.84e-02, grad_scale: 8.0 2024-03-09 13:11:57,837 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=2773.3333333333335, ans=8.54 2024-03-09 13:12:02,442 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2773.3333333333335, ans=0.37 2024-03-09 13:12:12,596 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=2840.0, ans=0.366875 2024-03-09 13:12:17,198 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2840.0, ans=0.03609999999999999 2024-03-09 13:12:26,130 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=9.629999999999999 2024-03-09 13:12:46,284 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=8.59 2024-03-09 13:12:50,667 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=5.743333333333333 2024-03-09 13:12:52,082 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.75 vs. limit=9.73 2024-03-09 13:13:12,134 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=3040.0, ans=0.35750000000000004 2024-03-09 13:13:17,880 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.20 vs. limit=8.64 2024-03-09 13:13:19,581 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.58 vs. limit=8.64 2024-03-09 13:13:25,268 INFO [train.py:997] (2/4) Epoch 3, batch 150, loss[loss=0.582, simple_loss=0.5184, pruned_loss=0.4002, over 22884.00 frames. ], tot_loss[loss=0.721, simple_loss=0.6207, pruned_loss=0.5842, over 2514364.60 frames. ], batch size: 609, lr: 4.05e-02, grad_scale: 8.0 2024-03-09 13:13:28,622 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=3106.6666666666665, ans=0.035 2024-03-09 13:14:18,419 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=10.25 vs. limit=6.58 2024-03-09 13:14:27,474 INFO [train.py:997] (2/4) Epoch 4, batch 0, loss[loss=0.6524, simple_loss=0.5752, pruned_loss=0.4639, over 23792.00 frames. ], tot_loss[loss=0.6524, simple_loss=0.5752, pruned_loss=0.4639, over 23792.00 frames. ], batch size: 447, lr: 3.82e-02, grad_scale: 16.0 2024-03-09 13:14:27,475 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:14:37,770 INFO [train.py:1029] (2/4) Epoch 4, validation: loss=0.515, simple_loss=0.4763, pruned_loss=0.3039, over 452978.00 frames. 2024-03-09 13:14:37,770 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:14:58,613 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=8.71 2024-03-09 13:15:20,250 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3293.3333333333335, ans=0.2670666666666667 2024-03-09 13:15:25,207 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=3293.3333333333335, ans=0.07649999999999998 2024-03-09 13:15:31,397 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.775e+02 3.449e+02 4.262e+02 1.233e+03, threshold=6.899e+02, percent-clipped=36.0 2024-03-09 13:15:34,099 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=1.92 vs. limit=3.504 2024-03-09 13:15:34,120 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=3360.0, ans=8.76 2024-03-09 13:15:38,530 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3360.0, ans=0.26639999999999997 2024-03-09 13:15:44,220 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.76 vs. limit=6.68 2024-03-09 13:15:50,846 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.79 vs. limit=8.785 2024-03-09 13:15:56,796 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=3426.6666666666665, ans=0.022900000000000004 2024-03-09 13:16:06,508 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.64 vs. limit=10.120000000000001 2024-03-09 13:16:07,034 INFO [train.py:997] (2/4) Epoch 4, batch 50, loss[loss=0.5105, simple_loss=0.4645, pruned_loss=0.318, over 21338.00 frames. ], tot_loss[loss=0.5297, simple_loss=0.4777, pruned_loss=0.3439, over 1064411.35 frames. ], batch size: 714, lr: 3.92e-02, grad_scale: 8.0 2024-03-09 13:16:22,479 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=3560.0, ans=0.26439999999999997 2024-03-09 13:16:29,411 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=3560.0, ans=0.333125 2024-03-09 13:16:44,287 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:16:45,054 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.12 vs. limit=8.86 2024-03-09 13:16:45,278 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=5.906666666666666 2024-03-09 13:16:47,711 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=3626.6666666666665, ans=0.21373333333333333 2024-03-09 13:16:50,962 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=3626.6666666666665, ans=0.32999999999999996 2024-03-09 13:16:57,421 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=3693.3333333333335, ans=0.326875 2024-03-09 13:17:12,172 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.62 vs. limit=5.477333333333333 2024-03-09 13:17:15,516 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=10.32 2024-03-09 13:17:29,705 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=3760.0, ans=0.32375 2024-03-09 13:17:32,789 INFO [train.py:997] (2/4) Epoch 4, batch 100, loss[loss=0.5181, simple_loss=0.4756, pruned_loss=0.3092, over 23706.00 frames. ], tot_loss[loss=0.4924, simple_loss=0.4509, pruned_loss=0.3004, over 1885774.29 frames. ], batch size: 486, lr: 3.92e-02, grad_scale: 8.0 2024-03-09 13:17:47,987 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=3893.3333333333335, ans=0.05399999999999999 2024-03-09 13:18:10,242 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.99 vs. limit=6.98 2024-03-09 13:18:26,598 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.478e+02 2.209e+02 2.728e+02 3.814e+02 7.926e+02, threshold=5.455e+02, percent-clipped=1.0 2024-03-09 13:18:49,075 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=9.035 2024-03-09 13:18:57,700 INFO [train.py:997] (2/4) Epoch 4, batch 150, loss[loss=0.4517, simple_loss=0.4256, pruned_loss=0.2429, over 24103.00 frames. ], tot_loss[loss=0.4583, simple_loss=0.4252, pruned_loss=0.2651, over 2520663.07 frames. ], batch size: 366, lr: 3.91e-02, grad_scale: 8.0 2024-03-09 13:19:02,779 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=4160.0, ans=0.0 2024-03-09 13:19:02,877 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=4160.0, ans=0.009965217391304348 2024-03-09 13:19:56,259 INFO [train.py:997] (2/4) Epoch 5, batch 0, loss[loss=0.3738, simple_loss=0.3634, pruned_loss=0.1765, over 24265.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.3634, pruned_loss=0.1765, over 24265.00 frames. ], batch size: 254, lr: 3.65e-02, grad_scale: 16.0 2024-03-09 13:19:56,259 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:20:05,952 INFO [train.py:1029] (2/4) Epoch 5, validation: loss=0.3626, simple_loss=0.3682, pruned_loss=0.1368, over 452978.00 frames. 2024-03-09 13:20:05,953 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:20:36,806 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.13 vs. limit=6.07 2024-03-09 13:21:01,363 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.66 vs. limit=9.155 2024-03-09 13:21:25,406 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.94 vs. limit=6.12 2024-03-09 13:21:30,487 INFO [train.py:997] (2/4) Epoch 5, batch 50, loss[loss=0.4358, simple_loss=0.4151, pruned_loss=0.2249, over 23806.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.3599, pruned_loss=0.1748, over 1054742.96 frames. ], batch size: 447, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:22:05,756 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.99 vs. limit=7.34 2024-03-09 13:22:09,206 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.206e+02 1.970e+02 2.387e+02 3.231e+02 6.932e+02, threshold=4.775e+02, percent-clipped=2.0 2024-03-09 13:22:16,113 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=4680.0, ans=0.035375000000000004 2024-03-09 13:22:22,924 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=4746.666666666667, ans=0.27749999999999997 2024-03-09 13:22:22,994 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=4746.666666666667, ans=0.27749999999999997 2024-03-09 13:22:23,459 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=11.06 2024-03-09 13:22:24,646 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=4746.666666666667, ans=0.00983768115942029 2024-03-09 13:22:29,373 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=4746.666666666667, ans=0.035166666666666666 2024-03-09 13:22:37,534 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=4813.333333333333, ans=0.27437500000000004 2024-03-09 13:22:39,121 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.080e+01 2024-03-09 13:22:55,077 INFO [train.py:997] (2/4) Epoch 5, batch 100, loss[loss=0.3469, simple_loss=0.3465, pruned_loss=0.1486, over 24166.00 frames. ], tot_loss[loss=0.3621, simple_loss=0.3549, pruned_loss=0.1668, over 1864646.60 frames. ], batch size: 345, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:23:07,445 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.50 vs. limit=6.22 2024-03-09 13:23:14,425 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4946.666666666667, ans=0.25053333333333333 2024-03-09 13:23:16,703 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.61 vs. limit=6.236666666666666 2024-03-09 13:23:17,811 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4946.666666666667, ans=0.25053333333333333 2024-03-09 13:23:25,742 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=4946.666666666667, ans=0.268125 2024-03-09 13:23:38,291 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=5013.333333333333, ans=0.7245333333333334 2024-03-09 13:23:44,887 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=5080.0, ans=0.2492 2024-03-09 13:24:19,172 INFO [train.py:997] (2/4) Epoch 5, batch 150, loss[loss=0.334, simple_loss=0.337, pruned_loss=0.1392, over 21518.00 frames. ], tot_loss[loss=0.357, simple_loss=0.3521, pruned_loss=0.1613, over 2483490.76 frames. ], batch size: 718, lr: 3.64e-02, grad_scale: 8.0 2024-03-09 13:25:15,972 INFO [train.py:997] (2/4) Epoch 6, batch 0, loss[loss=0.3153, simple_loss=0.324, pruned_loss=0.1214, over 24146.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.324, pruned_loss=0.1214, over 24146.00 frames. ], batch size: 240, lr: 3.40e-02, grad_scale: 16.0 2024-03-09 13:25:15,973 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:25:26,276 INFO [train.py:1029] (2/4) Epoch 6, validation: loss=0.3173, simple_loss=0.3385, pruned_loss=0.1003, over 452978.00 frames. 2024-03-09 13:25:26,277 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:26:01,729 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.753e+02 2.102e+02 2.732e+02 4.816e+02, threshold=4.205e+02, percent-clipped=1.0 2024-03-09 13:26:02,009 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=5333.333333333333, ans=0.25 2024-03-09 13:26:13,969 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.42 vs. limit=6.35 2024-03-09 13:26:24,640 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=5466.666666666667, ans=0.24375000000000002 2024-03-09 13:26:37,313 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=5466.666666666667, ans=0.24375000000000002 2024-03-09 13:26:38,902 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5533.333333333333, ans=0.24062499999999998 2024-03-09 13:26:50,712 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.95 vs. limit=9.575 2024-03-09 13:26:56,051 INFO [train.py:997] (2/4) Epoch 6, batch 50, loss[loss=0.3083, simple_loss=0.3195, pruned_loss=0.1171, over 24237.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3257, pruned_loss=0.127, over 1070587.86 frames. ], batch size: 267, lr: 3.40e-02, grad_scale: 16.0 2024-03-09 13:27:12,589 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029 2024-03-09 13:27:22,883 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=9.625 2024-03-09 13:27:23,742 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=5666.666666666667, ans=0.234375 2024-03-09 13:27:36,541 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=5733.333333333333, ans=0.23125 2024-03-09 13:27:52,400 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=5800.0, ans=0.22812500000000002 2024-03-09 13:27:58,588 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=5866.666666666667, ans=0.28800000000000003 2024-03-09 13:28:14,534 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=5866.666666666667, ans=0.0 2024-03-09 13:28:17,535 INFO [train.py:997] (2/4) Epoch 6, batch 100, loss[loss=0.3112, simple_loss=0.3219, pruned_loss=0.1217, over 24103.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.321, pruned_loss=0.1207, over 1883765.31 frames. ], batch size: 344, lr: 3.40e-02, grad_scale: 8.0 2024-03-09 13:28:47,229 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.031e+02 1.395e+02 1.660e+02 2.447e+02 5.591e+02, threshold=3.319e+02, percent-clipped=4.0 2024-03-09 13:29:09,534 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=6133.333333333333, ans=0.21250000000000002 2024-03-09 13:29:12,754 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=6133.333333333333, ans=0.21250000000000002 2024-03-09 13:29:40,034 INFO [train.py:997] (2/4) Epoch 6, batch 150, loss[loss=0.2993, simple_loss=0.3126, pruned_loss=0.115, over 22609.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3197, pruned_loss=0.1187, over 2497902.59 frames. ], batch size: 85, lr: 3.39e-02, grad_scale: 8.0 2024-03-09 13:30:37,228 INFO [train.py:997] (2/4) Epoch 7, batch 0, loss[loss=0.2721, simple_loss=0.2933, pruned_loss=0.09182, over 23126.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.2933, pruned_loss=0.09182, over 23126.00 frames. ], batch size: 102, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:30:37,229 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:30:47,279 INFO [train.py:1029] (2/4) Epoch 7, validation: loss=0.2933, simple_loss=0.3253, pruned_loss=0.08566, over 452978.00 frames. 2024-03-09 13:30:47,280 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:30:52,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=6320.0, ans=0.060500000000000005 2024-03-09 13:31:12,516 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=6386.666666666667, ans=8.991666666666667 2024-03-09 13:31:18,193 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.14 vs. limit=9.895 2024-03-09 13:32:15,711 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.01 vs. limit=6.661333333333333 2024-03-09 13:32:16,202 INFO [train.py:997] (2/4) Epoch 7, batch 50, loss[loss=0.3491, simple_loss=0.356, pruned_loss=0.1493, over 23720.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3037, pruned_loss=0.1005, over 1066470.45 frames. ], batch size: 486, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:32:18,152 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=6653.333333333333, ans=0.6671333333333334 2024-03-09 13:32:26,413 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=6653.333333333333, ans=0.188125 2024-03-09 13:32:30,830 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.025e+02 1.360e+02 1.605e+02 1.865e+02 3.683e+02, threshold=3.211e+02, percent-clipped=2.0 2024-03-09 13:32:41,518 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=10.02 2024-03-09 13:33:05,732 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=6853.333333333333, ans=0.009379710144927536 2024-03-09 13:33:06,398 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=12.64 2024-03-09 13:33:11,979 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=6853.333333333333, ans=0.17875000000000002 2024-03-09 13:33:23,878 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.86 vs. limit=6.73 2024-03-09 13:33:37,025 INFO [train.py:997] (2/4) Epoch 7, batch 100, loss[loss=0.2848, simple_loss=0.31, pruned_loss=0.09813, over 22871.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3025, pruned_loss=0.09813, over 1877186.27 frames. ], batch size: 609, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:33:52,658 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6986.666666666667, ans=0.23013333333333333 2024-03-09 13:34:05,079 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=7053.333333333333, ans=0.0 2024-03-09 13:34:05,607 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=10.145 2024-03-09 13:34:12,780 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:34:33,121 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=7186.666666666667, ans=0.16312500000000002 2024-03-09 13:34:58,865 INFO [train.py:997] (2/4) Epoch 7, batch 150, loss[loss=0.2438, simple_loss=0.2698, pruned_loss=0.08091, over 23960.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3001, pruned_loss=0.09514, over 2516167.23 frames. ], batch size: 142, lr: 3.18e-02, grad_scale: 16.0 2024-03-09 13:35:57,461 INFO [train.py:997] (2/4) Epoch 8, batch 0, loss[loss=0.2485, simple_loss=0.2816, pruned_loss=0.07489, over 23738.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.2816, pruned_loss=0.07489, over 23738.00 frames. ], batch size: 128, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:35:57,461 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:36:07,343 INFO [train.py:1029] (2/4) Epoch 8, validation: loss=0.2797, simple_loss=0.3212, pruned_loss=0.07915, over 452978.00 frames. 2024-03-09 13:36:07,344 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:36:08,858 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.023e+02 1.314e+02 1.638e+02 1.955e+02 4.296e+02, threshold=3.277e+02, percent-clipped=3.0 2024-03-09 13:36:09,735 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=6.949333333333334 2024-03-09 13:36:40,623 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7506.666666666667, ans=0.148125 2024-03-09 13:36:50,860 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=10.315 2024-03-09 13:37:03,473 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=7573.333333333333, ans=0.14500000000000002 2024-03-09 13:37:20,987 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=10.365 2024-03-09 13:37:22,078 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=7640.0, ans=0.034833333333333334 2024-03-09 13:37:31,064 INFO [train.py:997] (2/4) Epoch 8, batch 50, loss[loss=0.2473, simple_loss=0.2822, pruned_loss=0.07526, over 24037.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.291, pruned_loss=0.08554, over 1066333.17 frames. ], batch size: 165, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:37:46,988 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=7773.333333333333, ans=0.03427777777777778 2024-03-09 13:38:01,667 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.81 vs. limit=6.96 2024-03-09 13:38:32,029 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=7973.333333333333, ans=0.6209333333333333 2024-03-09 13:38:51,042 INFO [train.py:997] (2/4) Epoch 8, batch 100, loss[loss=0.2407, simple_loss=0.282, pruned_loss=0.06815, over 21572.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.29, pruned_loss=0.08363, over 1877418.19 frames. ], batch size: 718, lr: 2.99e-02, grad_scale: 32.0 2024-03-09 13:38:52,572 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.761e+01 1.115e+02 1.336e+02 1.652e+02 2.844e+02, threshold=2.672e+02, percent-clipped=0.0 2024-03-09 13:39:58,376 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.38 vs. limit=13.73 2024-03-09 13:40:11,308 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=8373.333333333334, ans=0.125 2024-03-09 13:40:12,948 INFO [train.py:997] (2/4) Epoch 8, batch 150, loss[loss=0.2504, simple_loss=0.288, pruned_loss=0.0795, over 24186.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.289, pruned_loss=0.08197, over 2512627.10 frames. ], batch size: 217, lr: 2.99e-02, grad_scale: 16.0 2024-03-09 13:40:19,538 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=8373.333333333334, ans=10.0 2024-03-09 13:41:11,727 INFO [train.py:997] (2/4) Epoch 9, batch 0, loss[loss=0.2417, simple_loss=0.28, pruned_loss=0.07502, over 24241.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.28, pruned_loss=0.07502, over 24241.00 frames. ], batch size: 311, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:41:11,727 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:41:21,825 INFO [train.py:1029] (2/4) Epoch 9, validation: loss=0.2624, simple_loss=0.312, pruned_loss=0.07326, over 452978.00 frames. 2024-03-09 13:41:21,826 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:42:00,284 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8560.0, ans=0.21439999999999998 2024-03-09 13:42:15,251 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.90 vs. limit=10.71 2024-03-09 13:42:19,111 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=8626.666666666666, ans=0.125 2024-03-09 13:42:36,633 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=8693.333333333334, ans=10.76 2024-03-09 13:42:41,969 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.295e+01 1.084e+02 1.217e+02 1.477e+02 3.480e+02, threshold=2.433e+02, percent-clipped=5.0 2024-03-09 13:42:51,113 INFO [train.py:997] (2/4) Epoch 9, batch 50, loss[loss=0.24, simple_loss=0.2823, pruned_loss=0.07305, over 24278.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.2837, pruned_loss=0.0751, over 1076636.12 frames. ], batch size: 267, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:43:02,248 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=8760.0, ans=0.030166666666666668 2024-03-09 13:43:06,770 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8826.666666666666, ans=0.21173333333333333 2024-03-09 13:43:35,168 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=8893.333333333334, ans=10.835 2024-03-09 13:43:36,163 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=8960.0, ans=0.008921739130434782 2024-03-09 13:43:40,840 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=8960.0, ans=0.125 2024-03-09 13:43:47,087 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=8960.0, ans=0.125 2024-03-09 13:43:59,358 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=9026.666666666666, ans=0.02905555555555556 2024-03-09 13:44:05,456 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=9026.666666666666, ans=0.02905555555555556 2024-03-09 13:44:05,565 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=9026.666666666666, ans=0.125 2024-03-09 13:44:08,282 INFO [train.py:997] (2/4) Epoch 9, batch 100, loss[loss=0.2396, simple_loss=0.2868, pruned_loss=0.07101, over 23197.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.2831, pruned_loss=0.07467, over 1881942.97 frames. ], batch size: 102, lr: 2.83e-02, grad_scale: 32.0 2024-03-09 13:44:30,282 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=9160.0, ans=0.05 2024-03-09 13:44:49,903 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=9226.666666666666, ans=0.125 2024-03-09 13:44:50,558 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.38 vs. limit=14.42 2024-03-09 13:44:59,141 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=9293.333333333334, ans=0.20706666666666665 2024-03-09 13:45:02,055 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=9293.333333333334, ans=0.125 2024-03-09 13:45:03,649 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9293.333333333334, ans=0.20706666666666665 2024-03-09 13:45:08,554 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9293.333333333334, ans=0.125 2024-03-09 13:45:10,085 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9293.333333333334, ans=0.125 2024-03-09 13:45:20,413 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.522e+01 1.120e+02 1.341e+02 1.607e+02 2.660e+02, threshold=2.681e+02, percent-clipped=5.0 2024-03-09 13:45:28,942 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=9426.666666666666, ans=10.0 2024-03-09 13:45:30,105 INFO [train.py:997] (2/4) Epoch 9, batch 150, loss[loss=0.2463, simple_loss=0.2946, pruned_loss=0.07586, over 23994.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.2829, pruned_loss=0.07397, over 2511688.82 frames. ], batch size: 388, lr: 2.82e-02, grad_scale: 32.0 2024-03-09 13:45:37,431 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=4.414 2024-03-09 13:46:27,249 INFO [train.py:997] (2/4) Epoch 10, batch 0, loss[loss=0.2246, simple_loss=0.2758, pruned_loss=0.06344, over 24267.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2758, pruned_loss=0.06344, over 24267.00 frames. ], batch size: 281, lr: 2.69e-02, grad_scale: 32.0 2024-03-09 13:46:27,249 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:46:37,027 INFO [train.py:1029] (2/4) Epoch 10, validation: loss=0.2538, simple_loss=0.3122, pruned_loss=0.07122, over 452978.00 frames. 2024-03-09 13:46:37,028 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:46:58,787 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=9546.666666666666, ans=0.15453333333333336 2024-03-09 13:47:06,542 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=9546.666666666666, ans=0.125 2024-03-09 13:47:31,818 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=9680.0, ans=0.125 2024-03-09 13:47:50,474 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=9746.666666666666, ans=0.5588666666666667 2024-03-09 13:48:02,512 INFO [train.py:997] (2/4) Epoch 10, batch 50, loss[loss=0.2193, simple_loss=0.2715, pruned_loss=0.06266, over 23901.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2746, pruned_loss=0.06561, over 1077376.33 frames. ], batch size: 153, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:48:09,941 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.79 vs. limit=14.86 2024-03-09 13:48:23,239 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9880.0, ans=0.125 2024-03-09 13:48:31,455 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=11.205 2024-03-09 13:48:32,541 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9946.666666666666, ans=0.20053333333333334 2024-03-09 13:48:32,560 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=9946.666666666666, ans=0.125 2024-03-09 13:48:32,569 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=9946.666666666666, ans=0.5518666666666667 2024-03-09 13:48:37,222 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=9946.666666666666, ans=0.125 2024-03-09 13:48:41,863 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=9946.666666666666, ans=0.125 2024-03-09 13:48:58,431 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.812e+01 1.075e+02 1.246e+02 1.479e+02 2.668e+02, threshold=2.491e+02, percent-clipped=0.0 2024-03-09 13:48:58,845 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=10013.333333333334, ans=0.125 2024-03-09 13:49:21,969 INFO [train.py:997] (2/4) Epoch 10, batch 100, loss[loss=0.2098, simple_loss=0.2608, pruned_loss=0.0615, over 24268.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.277, pruned_loss=0.06728, over 1896580.95 frames. ], batch size: 229, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:49:52,744 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10213.333333333334, ans=0.19786666666666666 2024-03-09 13:49:55,210 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.79 vs. limit=15.16 2024-03-09 13:50:05,182 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10280.0, ans=0.125 2024-03-09 13:50:08,785 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.85 vs. limit=15.21 2024-03-09 13:50:29,471 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=10413.333333333334, ans=10.0 2024-03-09 13:50:34,704 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.52 vs. limit=8.165333333333333 2024-03-09 13:50:43,580 INFO [train.py:997] (2/4) Epoch 10, batch 150, loss[loss=0.2098, simple_loss=0.2669, pruned_loss=0.05921, over 24250.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2761, pruned_loss=0.06652, over 2523729.39 frames. ], batch size: 241, lr: 2.68e-02, grad_scale: 32.0 2024-03-09 13:50:52,575 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=11.43 2024-03-09 13:50:53,053 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=10480.0, ans=0.2 2024-03-09 13:51:41,262 INFO [train.py:997] (2/4) Epoch 11, batch 0, loss[loss=0.2089, simple_loss=0.2665, pruned_loss=0.05877, over 23186.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2665, pruned_loss=0.05877, over 23186.00 frames. ], batch size: 102, lr: 2.56e-02, grad_scale: 32.0 2024-03-09 13:51:41,262 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:51:51,065 INFO [train.py:1029] (2/4) Epoch 11, validation: loss=0.2397, simple_loss=0.3066, pruned_loss=0.06689, over 452978.00 frames. 2024-03-09 13:51:51,066 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:52:00,485 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=10533.333333333334, ans=0.125 2024-03-09 13:52:27,658 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=10666.666666666666, ans=0.125 2024-03-09 13:52:38,411 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.688e+01 1.049e+02 1.183e+02 1.464e+02 2.170e+02, threshold=2.365e+02, percent-clipped=0.0 2024-03-09 13:52:45,933 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=10733.333333333334, ans=0.361 2024-03-09 13:52:53,703 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=10733.333333333334, ans=0.02194444444444444 2024-03-09 13:53:18,217 INFO [train.py:997] (2/4) Epoch 11, batch 50, loss[loss=0.2747, simple_loss=0.316, pruned_loss=0.105, over 23230.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2693, pruned_loss=0.06002, over 1066360.00 frames. ], batch size: 534, lr: 2.56e-02, grad_scale: 32.0 2024-03-09 13:53:30,530 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.51 vs. limit=7.716666666666667 2024-03-09 13:53:31,149 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 13:53:32,705 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=10933.333333333334, ans=0.07 2024-03-09 13:53:33,244 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.67 vs. limit=8.373333333333335 2024-03-09 13:53:52,319 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11000.0, ans=0.125 2024-03-09 13:54:00,983 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=11.625 2024-03-09 13:54:06,227 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=11066.666666666666, ans=0.5126666666666667 2024-03-09 13:54:25,326 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=11133.333333333334, ans=0.05 2024-03-09 13:54:38,150 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=11.7 2024-03-09 13:54:38,693 INFO [train.py:997] (2/4) Epoch 11, batch 100, loss[loss=0.2101, simple_loss=0.2691, pruned_loss=0.06318, over 24232.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2699, pruned_loss=0.06027, over 1882714.71 frames. ], batch size: 188, lr: 2.55e-02, grad_scale: 32.0 2024-03-09 13:54:48,186 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11200.0, ans=0.188 2024-03-09 13:54:55,313 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.76 vs. limit=10.633333333333333 2024-03-09 13:55:12,352 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=11333.333333333334, ans=0.008405797101449276 2024-03-09 13:55:20,288 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.34 vs. limit=10.666666666666668 2024-03-09 13:55:23,197 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.81 vs. limit=16.0 2024-03-09 13:55:23,737 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.186e+01 9.979e+01 1.131e+02 1.409e+02 2.515e+02, threshold=2.263e+02, percent-clipped=1.0 2024-03-09 13:55:43,366 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=11466.666666666666, ans=0.125 2024-03-09 13:55:58,239 INFO [train.py:997] (2/4) Epoch 11, batch 150, loss[loss=0.185, simple_loss=0.2496, pruned_loss=0.04936, over 23988.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.27, pruned_loss=0.05909, over 2511139.77 frames. ], batch size: 142, lr: 2.55e-02, grad_scale: 32.0 2024-03-09 13:56:55,633 INFO [train.py:997] (2/4) Epoch 12, batch 0, loss[loss=0.1729, simple_loss=0.2344, pruned_loss=0.04574, over 23883.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2344, pruned_loss=0.04574, over 23883.00 frames. ], batch size: 117, lr: 2.45e-02, grad_scale: 32.0 2024-03-09 13:56:55,634 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 13:57:05,243 INFO [train.py:1029] (2/4) Epoch 12, validation: loss=0.2325, simple_loss=0.3061, pruned_loss=0.06737, over 452978.00 frames. 2024-03-09 13:57:05,244 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 13:57:05,637 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=11586.666666666666, ans=0.125 2024-03-09 13:57:27,388 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=11653.333333333334, ans=0.018111111111111106 2024-03-09 13:57:45,805 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=11720.0, ans=0.008321739130434783 2024-03-09 13:57:47,488 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=11720.0, ans=0.4898 2024-03-09 13:58:28,235 INFO [train.py:997] (2/4) Epoch 12, batch 50, loss[loss=0.2646, simple_loss=0.3149, pruned_loss=0.09997, over 23253.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2695, pruned_loss=0.05896, over 1075415.22 frames. ], batch size: 534, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 13:58:28,610 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11920.0, ans=0.1808 2024-03-09 13:58:41,002 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=11920.0, ans=0.008278260869565218 2024-03-09 13:58:52,200 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=11986.666666666666, ans=0.125 2024-03-09 13:58:55,243 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=11986.666666666666, ans=0.125 2024-03-09 13:58:59,608 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.173e+01 9.982e+01 1.112e+02 1.363e+02 2.435e+02, threshold=2.224e+02, percent-clipped=1.0 2024-03-09 13:59:15,991 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=8.847999999999999 2024-03-09 13:59:28,350 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=12.045 2024-03-09 13:59:28,676 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=12.045 2024-03-09 13:59:41,593 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=12186.666666666666, ans=0.01588888888888889 2024-03-09 13:59:49,534 INFO [train.py:997] (2/4) Epoch 12, batch 100, loss[loss=0.2017, simple_loss=0.276, pruned_loss=0.05663, over 24115.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2669, pruned_loss=0.05603, over 1895115.22 frames. ], batch size: 366, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 13:59:57,822 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=12253.333333333334, ans=0.125 2024-03-09 14:00:05,457 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=12320.0, ans=0.125 2024-03-09 14:00:15,378 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.64 vs. limit=8.08 2024-03-09 14:00:59,636 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12520.0, ans=0.17479999999999998 2024-03-09 14:01:09,133 INFO [train.py:997] (2/4) Epoch 12, batch 150, loss[loss=0.2138, simple_loss=0.2874, pruned_loss=0.06543, over 23832.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2672, pruned_loss=0.05651, over 2524385.02 frames. ], batch size: 447, lr: 2.44e-02, grad_scale: 32.0 2024-03-09 14:02:05,582 INFO [train.py:997] (2/4) Epoch 13, batch 0, loss[loss=0.163, simple_loss=0.2399, pruned_loss=0.03865, over 23539.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2399, pruned_loss=0.03865, over 23539.00 frames. ], batch size: 128, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:02:05,582 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:02:18,480 INFO [train.py:1029] (2/4) Epoch 13, validation: loss=0.2245, simple_loss=0.307, pruned_loss=0.06618, over 452978.00 frames. 2024-03-09 14:02:18,481 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:02:21,963 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12640.0, ans=0.17359999999999998 2024-03-09 14:02:21,973 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=12640.0, ans=0.05 2024-03-09 14:02:23,696 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=12640.0, ans=0.125 2024-03-09 14:02:29,254 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.96 vs. limit=8.16 2024-03-09 14:02:37,306 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.720e+01 1.064e+02 1.199e+02 1.343e+02 2.089e+02, threshold=2.398e+02, percent-clipped=0.0 2024-03-09 14:02:50,260 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=12706.666666666666, ans=0.125 2024-03-09 14:03:04,330 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12773.333333333334, ans=0.0 2024-03-09 14:03:42,229 INFO [train.py:997] (2/4) Epoch 13, batch 50, loss[loss=0.165, simple_loss=0.2439, pruned_loss=0.04081, over 22940.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2585, pruned_loss=0.04996, over 1080463.11 frames. ], batch size: 609, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:04:18,088 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.04 vs. limit=12.415 2024-03-09 14:04:23,455 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=13106.666666666666, ans=0.125 2024-03-09 14:05:00,485 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.83 vs. limit=12.465 2024-03-09 14:05:04,142 INFO [train.py:997] (2/4) Epoch 13, batch 100, loss[loss=0.1865, simple_loss=0.2624, pruned_loss=0.05523, over 22684.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2598, pruned_loss=0.04988, over 1886542.74 frames. ], batch size: 85, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:05:05,844 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=13306.666666666666, ans=0.125 2024-03-09 14:05:08,343 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=12.49 2024-03-09 14:05:12,574 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=13306.666666666666, ans=0.0 2024-03-09 14:05:17,918 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.55 vs. limit=17.48 2024-03-09 14:05:24,893 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.077e+01 1.017e+02 1.138e+02 1.327e+02 1.773e+02, threshold=2.276e+02, percent-clipped=0.0 2024-03-09 14:05:25,221 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13373.333333333334, ans=0.16626666666666667 2024-03-09 14:05:28,219 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=13373.333333333334, ans=0.00796231884057971 2024-03-09 14:05:34,346 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=13373.333333333334, ans=0.125 2024-03-09 14:05:40,423 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13440.0, ans=0.125 2024-03-09 14:05:58,476 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=13506.666666666666, ans=0.16493333333333335 2024-03-09 14:06:06,119 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=13506.666666666666, ans=0.42726666666666674 2024-03-09 14:06:09,064 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=13573.333333333334, ans=0.125 2024-03-09 14:06:22,602 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13573.333333333334, ans=0.16426666666666667 2024-03-09 14:06:25,349 INFO [train.py:997] (2/4) Epoch 13, batch 150, loss[loss=0.2237, simple_loss=0.2986, pruned_loss=0.07434, over 23714.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2611, pruned_loss=0.05045, over 2523499.69 frames. ], batch size: 486, lr: 2.34e-02, grad_scale: 32.0 2024-03-09 14:06:27,165 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=13640.0, ans=0.0 2024-03-09 14:06:31,627 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=13640.0, ans=0.4046 2024-03-09 14:07:22,757 INFO [train.py:997] (2/4) Epoch 14, batch 0, loss[loss=0.1712, simple_loss=0.2541, pruned_loss=0.04408, over 24183.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2541, pruned_loss=0.04408, over 24183.00 frames. ], batch size: 217, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:07:22,758 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:07:32,050 INFO [train.py:1029] (2/4) Epoch 14, validation: loss=0.2172, simple_loss=0.3059, pruned_loss=0.06427, over 452978.00 frames. 2024-03-09 14:07:32,051 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:07:46,490 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=13760.0, ans=0.00933333333333334 2024-03-09 14:07:59,306 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13760.0, ans=0.0 2024-03-09 14:08:31,948 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=13893.333333333334, ans=0.125 2024-03-09 14:08:50,524 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=13960.0, ans=0.125 2024-03-09 14:08:53,266 INFO [train.py:997] (2/4) Epoch 14, batch 50, loss[loss=0.1685, simple_loss=0.2499, pruned_loss=0.04349, over 24262.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2547, pruned_loss=0.04587, over 1060121.13 frames. ], batch size: 254, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:08:54,305 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.32 vs. limit=18.02 2024-03-09 14:08:59,478 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.009e+01 1.028e+02 1.152e+02 1.303e+02 2.373e+02, threshold=2.304e+02, percent-clipped=1.0 2024-03-09 14:09:04,351 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=14026.666666666666, ans=0.125 2024-03-09 14:09:12,665 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14093.333333333334, ans=0.15906666666666666 2024-03-09 14:09:14,315 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=14093.333333333334, ans=0.007805797101449276 2024-03-09 14:09:20,478 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:09:40,732 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=14226.666666666666, ans=12.113333333333333 2024-03-09 14:09:41,625 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14226.666666666666, ans=0.125 2024-03-09 14:10:01,470 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=14293.333333333334, ans=0.07 2024-03-09 14:10:10,148 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.26 vs. limit=18.22 2024-03-09 14:10:12,226 INFO [train.py:997] (2/4) Epoch 14, batch 100, loss[loss=0.1564, simple_loss=0.2482, pruned_loss=0.03233, over 21440.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2557, pruned_loss=0.04601, over 1869860.31 frames. ], batch size: 718, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:10:12,598 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:10:38,138 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=14426.666666666666, ans=0.0065555555555555575 2024-03-09 14:11:10,196 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=14560.0, ans=0.1544 2024-03-09 14:11:26,606 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=14626.666666666666, ans=0.007689855072463768 2024-03-09 14:11:34,643 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14693.333333333334, ans=0.15306666666666666 2024-03-09 14:11:35,855 INFO [train.py:997] (2/4) Epoch 14, batch 150, loss[loss=0.1697, simple_loss=0.2547, pruned_loss=0.04239, over 24268.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2578, pruned_loss=0.04699, over 2498936.47 frames. ], batch size: 241, lr: 2.25e-02, grad_scale: 32.0 2024-03-09 14:11:41,695 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.629e+01 9.664e+01 1.070e+02 1.194e+02 2.380e+02, threshold=2.140e+02, percent-clipped=1.0 2024-03-09 14:12:33,828 INFO [train.py:997] (2/4) Epoch 15, batch 0, loss[loss=0.1713, simple_loss=0.2526, pruned_loss=0.04504, over 23227.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2526, pruned_loss=0.04504, over 23227.00 frames. ], batch size: 102, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:12:33,829 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:12:40,405 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([1.0329, 1.4877, 1.3723, 1.3578, 1.3659, 1.3121, 1.3058, 1.3389], device='cuda:2') 2024-03-09 14:12:43,268 INFO [train.py:1029] (2/4) Epoch 15, validation: loss=0.2144, simple_loss=0.3029, pruned_loss=0.06295, over 452978.00 frames. 2024-03-09 14:12:43,269 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:13:09,712 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.98 vs. limit=13.055 2024-03-09 14:13:52,947 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=15013.333333333334, ans=0.3745333333333333 2024-03-09 14:14:02,107 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=15013.333333333334, ans=0.125 2024-03-09 14:14:04,892 INFO [train.py:997] (2/4) Epoch 15, batch 50, loss[loss=0.1694, simple_loss=0.2501, pruned_loss=0.04434, over 22921.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2565, pruned_loss=0.04665, over 1066873.64 frames. ], batch size: 101, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:14:08,311 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=15080.0, ans=0.09899494936611666 2024-03-09 14:14:13,864 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=13.155000000000001 2024-03-09 14:14:15,855 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=15080.0, ans=0.1492 2024-03-09 14:15:06,063 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=13.23 2024-03-09 14:15:09,998 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=15346.666666666666, ans=0.125 2024-03-09 14:15:19,052 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.102e+01 1.026e+02 1.164e+02 1.400e+02 2.237e+02, threshold=2.327e+02, percent-clipped=1.0 2024-03-09 14:15:27,121 INFO [train.py:997] (2/4) Epoch 15, batch 100, loss[loss=0.2066, simple_loss=0.2897, pruned_loss=0.06171, over 23701.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2584, pruned_loss=0.04699, over 1883018.28 frames. ], batch size: 486, lr: 2.17e-02, grad_scale: 32.0 2024-03-09 14:15:30,523 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=15413.333333333334, ans=0.125 2024-03-09 14:15:32,531 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=13.280000000000001 2024-03-09 14:15:42,651 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=15480.0, ans=0.125 2024-03-09 14:15:56,208 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=15480.0, ans=0.002166666666666671 2024-03-09 14:15:56,286 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=15480.0, ans=0.007504347826086957 2024-03-09 14:16:01,347 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=13.33 2024-03-09 14:16:46,429 INFO [train.py:997] (2/4) Epoch 15, batch 150, loss[loss=0.179, simple_loss=0.268, pruned_loss=0.04502, over 23998.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2588, pruned_loss=0.04652, over 2518265.87 frames. ], batch size: 388, lr: 2.16e-02, grad_scale: 32.0 2024-03-09 14:17:45,388 INFO [train.py:997] (2/4) Epoch 16, batch 0, loss[loss=0.1539, simple_loss=0.2371, pruned_loss=0.03537, over 23089.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2371, pruned_loss=0.03537, over 23089.00 frames. ], batch size: 102, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:17:45,389 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:17:55,603 INFO [train.py:1029] (2/4) Epoch 16, validation: loss=0.2134, simple_loss=0.3039, pruned_loss=0.06146, over 452978.00 frames. 2024-03-09 14:17:55,604 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:17:55,908 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=15800.0, ans=0.14200000000000002 2024-03-09 14:18:15,862 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=15866.666666666666, ans=0.125 2024-03-09 14:19:00,575 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=16000.0, ans=0.33999999999999997 2024-03-09 14:19:03,235 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.463e+01 8.632e+01 1.007e+02 1.180e+02 1.868e+02, threshold=2.014e+02, percent-clipped=0.0 2024-03-09 14:19:17,502 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=16066.666666666666, ans=0.125 2024-03-09 14:19:21,895 INFO [train.py:997] (2/4) Epoch 16, batch 50, loss[loss=0.1527, simple_loss=0.2292, pruned_loss=0.03805, over 23862.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2517, pruned_loss=0.0429, over 1070636.12 frames. ], batch size: 117, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:20:26,766 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=16400.0, ans=0.136 2024-03-09 14:20:38,767 INFO [train.py:997] (2/4) Epoch 16, batch 100, loss[loss=0.1658, simple_loss=0.248, pruned_loss=0.04176, over 24253.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.251, pruned_loss=0.04237, over 1883908.14 frames. ], batch size: 198, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:20:45,754 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=13.675 2024-03-09 14:21:15,483 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.57 vs. limit=13.3 2024-03-09 14:21:20,802 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=16600.0, ans=0.0 2024-03-09 14:21:43,562 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.243e+01 8.931e+01 9.706e+01 1.091e+02 1.368e+02, threshold=1.941e+02, percent-clipped=0.0 2024-03-09 14:22:02,414 INFO [train.py:997] (2/4) Epoch 16, batch 150, loss[loss=0.2137, simple_loss=0.2871, pruned_loss=0.07012, over 23274.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2543, pruned_loss=0.04385, over 2507530.10 frames. ], batch size: 534, lr: 2.09e-02, grad_scale: 32.0 2024-03-09 14:23:00,942 INFO [train.py:997] (2/4) Epoch 17, batch 0, loss[loss=0.1595, simple_loss=0.244, pruned_loss=0.03751, over 24262.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.244, pruned_loss=0.03751, over 24262.00 frames. ], batch size: 198, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:23:00,942 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:23:11,390 INFO [train.py:1029] (2/4) Epoch 17, validation: loss=0.215, simple_loss=0.3066, pruned_loss=0.06175, over 452978.00 frames. 2024-03-09 14:23:11,390 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:23:48,072 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16986.666666666668, ans=0.125 2024-03-09 14:24:00,827 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.98 vs. limit=10.794666666666668 2024-03-09 14:24:19,686 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=17120.0, ans=0.007147826086956522 2024-03-09 14:24:36,332 INFO [train.py:997] (2/4) Epoch 17, batch 50, loss[loss=0.1713, simple_loss=0.2578, pruned_loss=0.04239, over 24295.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2494, pruned_loss=0.0398, over 1071655.44 frames. ], batch size: 267, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:24:56,749 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=17253.333333333332, ans=0.0 2024-03-09 14:25:20,050 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=17320.0, ans=0.125 2024-03-09 14:25:22,878 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.440e+01 9.326e+01 1.031e+02 1.175e+02 1.521e+02, threshold=2.062e+02, percent-clipped=0.0 2024-03-09 14:25:33,880 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=17386.666666666668, ans=0.125 2024-03-09 14:25:46,175 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=17453.333333333332, ans=0.125 2024-03-09 14:25:57,094 INFO [train.py:997] (2/4) Epoch 17, batch 100, loss[loss=0.1489, simple_loss=0.234, pruned_loss=0.03189, over 23621.00 frames. ], tot_loss[loss=0.1643, simple_loss=0.2493, pruned_loss=0.03964, over 1886151.25 frames. ], batch size: 128, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:26:15,750 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=17586.666666666668, ans=0.07 2024-03-09 14:27:02,376 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=17786.666666666668, ans=0.125 2024-03-09 14:27:04,282 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=14.17 2024-03-09 14:27:15,877 INFO [train.py:997] (2/4) Epoch 17, batch 150, loss[loss=0.1669, simple_loss=0.2544, pruned_loss=0.03967, over 24263.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2485, pruned_loss=0.03938, over 2517586.06 frames. ], batch size: 254, lr: 2.02e-02, grad_scale: 32.0 2024-03-09 14:28:12,293 INFO [train.py:997] (2/4) Epoch 18, batch 0, loss[loss=0.16, simple_loss=0.2442, pruned_loss=0.03786, over 23180.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2442, pruned_loss=0.03786, over 23180.00 frames. ], batch size: 102, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:28:12,293 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:28:22,755 INFO [train.py:1029] (2/4) Epoch 18, validation: loss=0.213, simple_loss=0.3039, pruned_loss=0.06107, over 452978.00 frames. 2024-03-09 14:28:22,757 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:28:38,135 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.49 vs. limit=13.953333333333335 2024-03-09 14:28:46,792 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=14.24 2024-03-09 14:28:52,939 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=13.986666666666666 2024-03-09 14:29:02,778 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.169e+01 8.782e+01 9.645e+01 1.059e+02 1.496e+02, threshold=1.929e+02, percent-clipped=0.0 2024-03-09 14:29:36,893 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=18173.333333333332, ans=0.26393333333333346 2024-03-09 14:29:44,458 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=18240.0, ans=0.006904347826086957 2024-03-09 14:29:45,626 INFO [train.py:997] (2/4) Epoch 18, batch 50, loss[loss=0.1683, simple_loss=0.2509, pruned_loss=0.0428, over 24049.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2476, pruned_loss=0.03966, over 1068772.95 frames. ], batch size: 176, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:30:12,535 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=18306.666666666668, ans=0.0 2024-03-09 14:30:27,167 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.00 vs. limit=21.28 2024-03-09 14:30:27,937 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=18373.333333333332, ans=0.125 2024-03-09 14:31:06,271 INFO [train.py:997] (2/4) Epoch 18, batch 100, loss[loss=0.2164, simple_loss=0.2919, pruned_loss=0.07039, over 23227.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2486, pruned_loss=0.03897, over 1889676.72 frames. ], batch size: 534, lr: 1.96e-02, grad_scale: 32.0 2024-03-09 14:31:41,830 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.054e+01 8.645e+01 9.593e+01 1.057e+02 1.559e+02, threshold=1.919e+02, percent-clipped=0.0 2024-03-09 14:31:46,741 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=18706.666666666668, ans=0.0 2024-03-09 14:31:55,434 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=18773.333333333332, ans=0.24293333333333345 2024-03-09 14:32:11,620 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=14.565000000000001 2024-03-09 14:32:26,160 INFO [train.py:997] (2/4) Epoch 18, batch 150, loss[loss=0.182, simple_loss=0.2705, pruned_loss=0.04677, over 23978.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.2494, pruned_loss=0.03919, over 2526521.77 frames. ], batch size: 416, lr: 1.95e-02, grad_scale: 32.0 2024-03-09 14:32:32,633 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=18906.666666666668, ans=0.23826666666666674 2024-03-09 14:33:23,375 INFO [train.py:997] (2/4) Epoch 19, batch 0, loss[loss=0.1568, simple_loss=0.2396, pruned_loss=0.03704, over 24163.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2396, pruned_loss=0.03704, over 24163.00 frames. ], batch size: 240, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:33:23,375 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:33:31,033 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.6644, 5.3118, 5.6000, 5.4018], device='cuda:2') 2024-03-09 14:33:35,284 INFO [train.py:1029] (2/4) Epoch 19, validation: loss=0.2133, simple_loss=0.3046, pruned_loss=0.061, over 452978.00 frames. 2024-03-09 14:33:35,286 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:34:03,871 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=14.635 2024-03-09 14:34:04,813 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=19026.666666666668, ans=0.9402666666666666 2024-03-09 14:34:05,715 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=14.635 2024-03-09 14:34:06,457 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=19026.666666666668, ans=0.125 2024-03-09 14:34:54,317 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19293.333333333332, ans=0.125 2024-03-09 14:34:55,502 INFO [train.py:997] (2/4) Epoch 19, batch 50, loss[loss=0.1635, simple_loss=0.2537, pruned_loss=0.03661, over 24256.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2473, pruned_loss=0.03737, over 1074096.48 frames. ], batch size: 327, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:35:05,218 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=19293.333333333332, ans=0.125 2024-03-09 14:35:08,253 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=19293.333333333332, ans=0.125 2024-03-09 14:35:08,259 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19293.333333333332, ans=0.125 2024-03-09 14:35:17,296 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.127e+01 8.675e+01 9.444e+01 1.046e+02 1.924e+02, threshold=1.889e+02, percent-clipped=1.0 2024-03-09 14:35:24,779 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=19360.0, ans=14.76 2024-03-09 14:35:40,980 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=19426.666666666668, ans=0.22006666666666674 2024-03-09 14:36:16,163 INFO [train.py:997] (2/4) Epoch 19, batch 100, loss[loss=0.1656, simple_loss=0.2498, pruned_loss=0.04069, over 23174.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2454, pruned_loss=0.03643, over 1874065.93 frames. ], batch size: 102, lr: 1.90e-02, grad_scale: 32.0 2024-03-09 14:36:36,457 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=19693.333333333332, ans=0.125 2024-03-09 14:36:45,390 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=19693.333333333332, ans=0.125 2024-03-09 14:36:53,068 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=19760.0, ans=0.0 2024-03-09 14:36:54,569 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=19760.0, ans=0.0 2024-03-09 14:37:11,304 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=19826.666666666668, ans=0.125 2024-03-09 14:37:12,843 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=19826.666666666668, ans=0.20606666666666673 2024-03-09 14:37:12,890 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=19826.666666666668, ans=0.04949747468305833 2024-03-09 14:37:25,772 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=19893.333333333332, ans=0.125 2024-03-09 14:37:36,532 INFO [train.py:997] (2/4) Epoch 19, batch 150, loss[loss=0.1652, simple_loss=0.2576, pruned_loss=0.0364, over 24010.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2464, pruned_loss=0.03728, over 2507258.33 frames. ], batch size: 416, lr: 1.89e-02, grad_scale: 32.0 2024-03-09 14:37:41,338 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=19960.0, ans=0.125 2024-03-09 14:38:31,429 INFO [train.py:997] (2/4) Epoch 20, batch 0, loss[loss=0.1356, simple_loss=0.2301, pruned_loss=0.02058, over 21374.00 frames. ], tot_loss[loss=0.1356, simple_loss=0.2301, pruned_loss=0.02058, over 21374.00 frames. ], batch size: 714, lr: 1.85e-02, grad_scale: 32.0 2024-03-09 14:38:31,430 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:38:38,368 INFO [zipformer.py:1858] (2/4) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.2765, 3.0589, 2.9398, 2.3960], device='cuda:2') 2024-03-09 14:38:40,965 INFO [train.py:1029] (2/4) Epoch 20, validation: loss=0.2111, simple_loss=0.3031, pruned_loss=0.05952, over 452978.00 frames. 2024-03-09 14:38:40,966 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:38:53,197 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.462e+01 8.448e+01 9.307e+01 1.038e+02 2.078e+02, threshold=1.861e+02, percent-clipped=1.0 2024-03-09 14:39:18,462 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.42 vs. limit=15.0 2024-03-09 14:39:18,566 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.47 vs. limit=15.0 2024-03-09 14:39:45,524 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=20213.333333333332, ans=0.0 2024-03-09 14:39:59,320 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=20280.0, ans=0.006460869565217391 2024-03-09 14:40:03,544 INFO [train.py:997] (2/4) Epoch 20, batch 50, loss[loss=0.145, simple_loss=0.2316, pruned_loss=0.02915, over 23840.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2449, pruned_loss=0.03638, over 1068315.05 frames. ], batch size: 117, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:41:25,658 INFO [train.py:997] (2/4) Epoch 20, batch 100, loss[loss=0.1549, simple_loss=0.2473, pruned_loss=0.03125, over 24192.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2465, pruned_loss=0.03693, over 1880618.97 frames. ], batch size: 327, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:41:31,263 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=20680.0, ans=15.0 2024-03-09 14:41:34,815 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.448e+01 8.010e+01 8.832e+01 9.695e+01 1.353e+02, threshold=1.766e+02, percent-clipped=0.0 2024-03-09 14:41:50,515 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=20746.666666666668, ans=0.2 2024-03-09 14:41:59,551 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=20813.333333333332, ans=0.0 2024-03-09 14:42:08,615 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=20813.333333333332, ans=0.125 2024-03-09 14:42:16,332 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=20880.0, ans=0.2 2024-03-09 14:42:23,730 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=20880.0, ans=0.006330434782608696 2024-03-09 14:42:28,916 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=20946.666666666668, ans=10.0 2024-03-09 14:42:31,880 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=20946.666666666668, ans=0.125 2024-03-09 14:42:44,016 INFO [train.py:997] (2/4) Epoch 20, batch 150, loss[loss=0.1501, simple_loss=0.2423, pruned_loss=0.029, over 24228.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2449, pruned_loss=0.03551, over 2510452.97 frames. ], batch size: 281, lr: 1.84e-02, grad_scale: 32.0 2024-03-09 14:42:45,843 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=21013.333333333332, ans=0.0 2024-03-09 14:42:53,453 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21013.333333333332, ans=0.1 2024-03-09 14:43:39,751 INFO [train.py:997] (2/4) Epoch 21, batch 0, loss[loss=0.1919, simple_loss=0.2694, pruned_loss=0.05723, over 23279.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2694, pruned_loss=0.05723, over 23279.00 frames. ], batch size: 533, lr: 1.79e-02, grad_scale: 32.0 2024-03-09 14:43:39,752 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:43:49,465 INFO [train.py:1029] (2/4) Epoch 21, validation: loss=0.2106, simple_loss=0.3015, pruned_loss=0.05984, over 452978.00 frames. 2024-03-09 14:43:49,465 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 27741MB 2024-03-09 14:44:14,236 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=21133.333333333332, ans=0.2 2024-03-09 14:44:43,375 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=21266.666666666668, ans=0.125 2024-03-09 14:45:02,856 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=21333.333333333332, ans=0.125 2024-03-09 14:45:10,282 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.134e+01 8.236e+01 9.284e+01 1.075e+02 1.651e+02, threshold=1.857e+02, percent-clipped=0.0 2024-03-09 14:45:13,755 INFO [train.py:997] (2/4) Epoch 21, batch 50, loss[loss=0.1721, simple_loss=0.2607, pruned_loss=0.04172, over 24038.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.244, pruned_loss=0.03657, over 1072739.44 frames. ], batch size: 416, lr: 1.79e-02, grad_scale: 32.0 2024-03-09 14:45:18,594 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=21400.0, ans=0.125 2024-03-09 14:45:20,213 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=21400.0, ans=0.07 2024-03-09 14:45:20,222 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=21400.0, ans=0.1 2024-03-09 14:45:23,303 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=21400.0, ans=0.07 2024-03-09 14:45:37,476 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=21466.666666666668, ans=0.125 2024-03-09 14:45:54,551 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=21533.333333333332, ans=0.125 2024-03-09 14:46:00,688 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=21600.0, ans=0.05 2024-03-09 14:46:32,868 INFO [train.py:997] (2/4) Epoch 21, batch 100, loss[loss=0.1653, simple_loss=0.2449, pruned_loss=0.04288, over 23947.00 frames. ], tot_loss[loss=0.158, simple_loss=0.244, pruned_loss=0.03598, over 1881723.46 frames. ], batch size: 153, lr: 1.79e-02, grad_scale: 64.0 2024-03-09 14:46:53,384 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:47:10,728 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2024-03-09 14:47:28,122 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=21933.333333333332, ans=0.07 2024-03-09 14:47:28,144 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=21933.333333333332, ans=0.125 2024-03-09 14:47:29,689 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=21933.333333333332, ans=0.125 2024-03-09 14:47:31,262 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=21933.333333333332, ans=0.125 2024-03-09 14:47:50,686 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=22000.0, ans=0.0 2024-03-09 14:47:51,969 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.744e+01 8.144e+01 8.919e+01 1.026e+02 1.301e+02, threshold=1.784e+02, percent-clipped=0.0 2024-03-09 14:47:55,056 INFO [train.py:997] (2/4) Epoch 21, batch 150, loss[loss=0.1772, simple_loss=0.2698, pruned_loss=0.04233, over 23722.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2443, pruned_loss=0.03523, over 2515180.17 frames. ], batch size: 486, lr: 1.79e-02, grad_scale: 64.0 2024-03-09 14:47:58,827 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=22066.666666666668, ans=0.0 2024-03-09 14:48:42,994 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.70 vs. limit=12.0 2024-03-09 14:48:51,237 INFO [train.py:997] (2/4) Epoch 22, batch 0, loss[loss=0.1561, simple_loss=0.2402, pruned_loss=0.03602, over 24228.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2402, pruned_loss=0.03602, over 24228.00 frames. ], batch size: 241, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:48:51,237 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:49:00,963 INFO [train.py:1029] (2/4) Epoch 22, validation: loss=0.2117, simple_loss=0.3028, pruned_loss=0.06033, over 452978.00 frames. 2024-03-09 14:49:00,963 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 14:49:01,295 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22120.0, ans=0.1 2024-03-09 14:49:10,609 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=22120.0, ans=0.1 2024-03-09 14:49:15,471 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=22186.666666666668, ans=0.125 2024-03-09 14:49:39,355 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=22253.333333333332, ans=0.1 2024-03-09 14:49:57,902 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22320.0, ans=0.1 2024-03-09 14:49:57,975 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=22320.0, ans=0.125 2024-03-09 14:50:16,331 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=22386.666666666668, ans=0.125 2024-03-09 14:50:23,729 INFO [train.py:997] (2/4) Epoch 22, batch 50, loss[loss=0.1556, simple_loss=0.2445, pruned_loss=0.03331, over 24196.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2425, pruned_loss=0.03481, over 1067414.37 frames. ], batch size: 295, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:50:31,840 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22453.333333333332, ans=0.1 2024-03-09 14:50:37,962 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=22520.0, ans=0.125 2024-03-09 14:51:28,007 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.973e+01 8.132e+01 8.918e+01 9.986e+01 1.265e+02, threshold=1.784e+02, percent-clipped=0.0 2024-03-09 14:51:45,186 INFO [train.py:997] (2/4) Epoch 22, batch 100, loss[loss=0.1423, simple_loss=0.2369, pruned_loss=0.0239, over 21294.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2437, pruned_loss=0.03435, over 1884131.84 frames. ], batch size: 718, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:51:58,605 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.52 vs. limit=22.5 2024-03-09 14:52:02,795 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=22853.333333333332, ans=15.0 2024-03-09 14:52:08,484 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=22853.333333333332, ans=0.125 2024-03-09 14:52:20,565 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=22920.0, ans=0.1 2024-03-09 14:52:21,184 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.69 vs. limit=15.0 2024-03-09 14:52:36,033 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=22986.666666666668, ans=0.125 2024-03-09 14:52:52,650 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=23053.333333333332, ans=0.125 2024-03-09 14:53:05,945 INFO [train.py:997] (2/4) Epoch 22, batch 150, loss[loss=0.1434, simple_loss=0.2267, pruned_loss=0.02998, over 24295.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.244, pruned_loss=0.03483, over 2505328.34 frames. ], batch size: 208, lr: 1.74e-02, grad_scale: 64.0 2024-03-09 14:53:13,982 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=23120.0, ans=0.125 2024-03-09 14:54:00,138 INFO [train.py:997] (2/4) Epoch 23, batch 0, loss[loss=0.1241, simple_loss=0.2186, pruned_loss=0.0148, over 21442.00 frames. ], tot_loss[loss=0.1241, simple_loss=0.2186, pruned_loss=0.0148, over 21442.00 frames. ], batch size: 714, lr: 1.70e-02, grad_scale: 64.0 2024-03-09 14:54:00,139 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:54:09,891 INFO [train.py:1029] (2/4) Epoch 23, validation: loss=0.2115, simple_loss=0.3036, pruned_loss=0.0597, over 452978.00 frames. 2024-03-09 14:54:09,892 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 14:54:13,373 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=23173.333333333332, ans=0.125 2024-03-09 14:54:14,999 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=23173.333333333332, ans=0.125 2024-03-09 14:54:21,755 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.09 vs. limit=15.0 2024-03-09 14:54:39,516 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=23240.0, ans=0.035 2024-03-09 14:55:00,860 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=23373.333333333332, ans=0.0 2024-03-09 14:55:05,174 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.526e+01 7.783e+01 8.704e+01 9.596e+01 1.275e+02, threshold=1.741e+02, percent-clipped=0.0 2024-03-09 14:55:33,121 INFO [train.py:997] (2/4) Epoch 23, batch 50, loss[loss=0.14, simple_loss=0.2224, pruned_loss=0.02878, over 23636.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2431, pruned_loss=0.03361, over 1066872.35 frames. ], batch size: 116, lr: 1.70e-02, grad_scale: 64.0 2024-03-09 14:55:47,524 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=23573.333333333332, ans=0.0 2024-03-09 14:55:52,144 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=23573.333333333332, ans=0.125 2024-03-09 14:56:06,079 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23640.0, ans=0.1 2024-03-09 14:56:53,727 INFO [train.py:997] (2/4) Epoch 23, batch 100, loss[loss=0.1644, simple_loss=0.2564, pruned_loss=0.03618, over 24125.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2422, pruned_loss=0.03332, over 1878060.18 frames. ], batch size: 366, lr: 1.69e-02, grad_scale: 64.0 2024-03-09 14:56:57,181 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=23840.0, ans=0.025 2024-03-09 14:57:00,249 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.541e-03 2024-03-09 14:57:23,762 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.58 vs. limit=12.0 2024-03-09 14:57:45,467 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.240e+01 7.813e+01 8.574e+01 9.589e+01 1.326e+02, threshold=1.715e+02, percent-clipped=0.0 2024-03-09 14:58:02,224 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=24106.666666666668, ans=0.125 2024-03-09 14:58:07,484 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=24106.666666666668, ans=0.005628985507246376 2024-03-09 14:58:13,649 INFO [train.py:997] (2/4) Epoch 23, batch 150, loss[loss=0.1685, simple_loss=0.2634, pruned_loss=0.03678, over 23985.00 frames. ], tot_loss[loss=0.1538, simple_loss=0.242, pruned_loss=0.0328, over 2516444.51 frames. ], batch size: 416, lr: 1.69e-02, grad_scale: 64.0 2024-03-09 14:58:21,934 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=24173.333333333332, ans=0.2 2024-03-09 14:59:02,379 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=22.5 2024-03-09 14:59:07,188 INFO [train.py:997] (2/4) Epoch 24, batch 0, loss[loss=0.1487, simple_loss=0.2404, pruned_loss=0.02848, over 24226.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2404, pruned_loss=0.02848, over 24226.00 frames. ], batch size: 311, lr: 1.66e-02, grad_scale: 64.0 2024-03-09 14:59:07,189 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 14:59:16,704 INFO [train.py:1029] (2/4) Epoch 24, validation: loss=0.2123, simple_loss=0.3043, pruned_loss=0.06014, over 452978.00 frames. 2024-03-09 14:59:16,705 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 14:59:46,349 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2024-03-09 14:59:48,027 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=24293.333333333332, ans=0.125 2024-03-09 14:59:52,465 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=24293.333333333332, ans=0.125 2024-03-09 15:00:23,367 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=24426.666666666668, ans=0.2 2024-03-09 15:00:33,217 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2024-03-09 15:00:43,098 INFO [train.py:997] (2/4) Epoch 24, batch 50, loss[loss=0.1252, simple_loss=0.2049, pruned_loss=0.02276, over 23638.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2378, pruned_loss=0.03211, over 1069137.85 frames. ], batch size: 128, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:00:45,625 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=15.0 2024-03-09 15:00:48,211 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24560.0, ans=0.0 2024-03-09 15:00:57,860 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.25 vs. limit=22.5 2024-03-09 15:01:20,107 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.514e+01 7.866e+01 8.423e+01 9.105e+01 1.243e+02, threshold=1.685e+02, percent-clipped=0.0 2024-03-09 15:02:03,701 INFO [train.py:997] (2/4) Epoch 24, batch 100, loss[loss=0.1371, simple_loss=0.2247, pruned_loss=0.02476, over 23981.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2371, pruned_loss=0.03161, over 1881039.84 frames. ], batch size: 142, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:02:12,943 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=24893.333333333332, ans=0.125 2024-03-09 15:02:23,854 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=24960.0, ans=0.125 2024-03-09 15:02:27,660 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.74 vs. limit=15.0 2024-03-09 15:02:48,523 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=25026.666666666668, ans=0.2 2024-03-09 15:02:56,154 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=25093.333333333332, ans=0.125 2024-03-09 15:03:24,742 INFO [train.py:997] (2/4) Epoch 24, batch 150, loss[loss=0.2018, simple_loss=0.2792, pruned_loss=0.06225, over 23258.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2392, pruned_loss=0.03245, over 2513449.26 frames. ], batch size: 534, lr: 1.65e-02, grad_scale: 64.0 2024-03-09 15:04:17,871 INFO [train.py:997] (2/4) Epoch 25, batch 0, loss[loss=0.1497, simple_loss=0.241, pruned_loss=0.02919, over 24163.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.241, pruned_loss=0.02919, over 24163.00 frames. ], batch size: 345, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:04:17,871 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:04:27,728 INFO [train.py:1029] (2/4) Epoch 25, validation: loss=0.2123, simple_loss=0.3048, pruned_loss=0.05995, over 452978.00 frames. 2024-03-09 15:04:27,729 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:04:56,149 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.291e+01 7.825e+01 8.498e+01 9.317e+01 1.197e+02, threshold=1.700e+02, percent-clipped=0.0 2024-03-09 15:05:25,022 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=15.0 2024-03-09 15:05:50,862 INFO [train.py:997] (2/4) Epoch 25, batch 50, loss[loss=0.1657, simple_loss=0.2606, pruned_loss=0.03536, over 23653.00 frames. ], tot_loss[loss=0.151, simple_loss=0.239, pruned_loss=0.0315, over 1063490.93 frames. ], batch size: 485, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:05:58,159 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.71 vs. limit=6.0 2024-03-09 15:06:27,702 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=25746.666666666668, ans=0.2 2024-03-09 15:07:11,201 INFO [train.py:997] (2/4) Epoch 25, batch 100, loss[loss=0.1492, simple_loss=0.2369, pruned_loss=0.0307, over 24148.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2392, pruned_loss=0.03115, over 1882751.97 frames. ], batch size: 240, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:07:37,666 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.239e+01 7.935e+01 8.679e+01 9.503e+01 1.168e+02, threshold=1.736e+02, percent-clipped=0.0 2024-03-09 15:07:48,582 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=26080.0, ans=0.0052 2024-03-09 15:08:13,123 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.44 vs. limit=15.0 2024-03-09 15:08:23,247 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2024-03-09 15:08:31,503 INFO [train.py:997] (2/4) Epoch 25, batch 150, loss[loss=0.1535, simple_loss=0.2436, pruned_loss=0.03168, over 24111.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2384, pruned_loss=0.03103, over 2511677.42 frames. ], batch size: 345, lr: 1.61e-02, grad_scale: 64.0 2024-03-09 15:09:26,511 INFO [train.py:997] (2/4) Epoch 26, batch 0, loss[loss=0.1454, simple_loss=0.2356, pruned_loss=0.02759, over 24290.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2356, pruned_loss=0.02759, over 24290.00 frames. ], batch size: 281, lr: 1.58e-02, grad_scale: 64.0 2024-03-09 15:09:26,511 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:09:35,914 INFO [train.py:1029] (2/4) Epoch 26, validation: loss=0.2091, simple_loss=0.3013, pruned_loss=0.05842, over 452978.00 frames. 2024-03-09 15:09:35,915 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:09:51,816 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26333.333333333332, ans=0.0 2024-03-09 15:09:56,440 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=26400.0, ans=0.0 2024-03-09 15:09:59,558 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26400.0, ans=0.125 2024-03-09 15:10:04,432 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=26400.0, ans=0.125 2024-03-09 15:10:44,742 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2024-03-09 15:10:48,135 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=15.0 2024-03-09 15:10:59,540 INFO [train.py:997] (2/4) Epoch 26, batch 50, loss[loss=0.1483, simple_loss=0.2352, pruned_loss=0.0307, over 24201.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2356, pruned_loss=0.02983, over 1068310.64 frames. ], batch size: 217, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:11:02,956 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=26666.666666666668, ans=0.125 2024-03-09 15:11:10,632 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=26666.666666666668, ans=0.0 2024-03-09 15:11:11,922 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.387e+01 7.632e+01 8.183e+01 8.952e+01 1.265e+02, threshold=1.637e+02, percent-clipped=0.0 2024-03-09 15:11:12,986 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2024-03-09 15:11:52,270 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=26866.666666666668, ans=0.0 2024-03-09 15:12:11,594 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=26933.333333333332, ans=0.0 2024-03-09 15:12:15,422 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.52 vs. limit=22.5 2024-03-09 15:12:22,500 INFO [train.py:997] (2/4) Epoch 26, batch 100, loss[loss=0.1557, simple_loss=0.2405, pruned_loss=0.03548, over 24117.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2361, pruned_loss=0.03023, over 1882738.61 frames. ], batch size: 176, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:12:37,272 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.43 vs. limit=10.0 2024-03-09 15:12:48,781 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.37 vs. limit=15.0 2024-03-09 15:12:51,128 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=27066.666666666668, ans=0.2 2024-03-09 15:12:54,146 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=27133.333333333332, ans=0.125 2024-03-09 15:12:57,168 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=27133.333333333332, ans=0.0 2024-03-09 15:13:18,636 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27200.0, ans=0.125 2024-03-09 15:13:42,267 INFO [train.py:997] (2/4) Epoch 26, batch 150, loss[loss=0.1376, simple_loss=0.2313, pruned_loss=0.02193, over 22840.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2366, pruned_loss=0.02962, over 2517729.39 frames. ], batch size: 609, lr: 1.57e-02, grad_scale: 64.0 2024-03-09 15:14:38,730 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.880e+01 7.565e+01 8.210e+01 9.162e+01 1.256e+02, threshold=1.642e+02, percent-clipped=0.0 2024-03-09 15:14:38,760 INFO [train.py:997] (2/4) Epoch 27, batch 0, loss[loss=0.1457, simple_loss=0.2385, pruned_loss=0.02641, over 24279.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2385, pruned_loss=0.02641, over 24279.00 frames. ], batch size: 311, lr: 1.54e-02, grad_scale: 64.0 2024-03-09 15:14:38,760 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:14:48,407 INFO [train.py:1029] (2/4) Epoch 27, validation: loss=0.2114, simple_loss=0.3031, pruned_loss=0.05987, over 452978.00 frames. 2024-03-09 15:14:48,407 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:15:40,964 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=27520.0, ans=0.125 2024-03-09 15:15:42,472 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=27586.666666666668, ans=0.125 2024-03-09 15:15:47,181 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=27586.666666666668, ans=0.0 2024-03-09 15:16:14,482 INFO [train.py:997] (2/4) Epoch 27, batch 50, loss[loss=0.1352, simple_loss=0.2232, pruned_loss=0.02359, over 23697.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2359, pruned_loss=0.03061, over 1080089.85 frames. ], batch size: 128, lr: 1.54e-02, grad_scale: 64.0 2024-03-09 15:16:18,026 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=27720.0, ans=0.125 2024-03-09 15:16:24,184 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=27720.0, ans=0.125 2024-03-09 15:16:25,689 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27720.0, ans=0.1 2024-03-09 15:16:27,166 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=27720.0, ans=0.004843478260869565 2024-03-09 15:16:42,109 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.15 vs. limit=5.0 2024-03-09 15:16:56,880 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27853.333333333332, ans=0.1 2024-03-09 15:17:04,613 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 15:17:24,367 INFO [scaling.py:1023] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.42 vs. limit=5.0 2024-03-09 15:17:33,756 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.709e+01 7.734e+01 8.550e+01 9.615e+01 1.355e+02, threshold=1.710e+02, percent-clipped=0.0 2024-03-09 15:17:33,785 INFO [train.py:997] (2/4) Epoch 27, batch 100, loss[loss=0.1545, simple_loss=0.2488, pruned_loss=0.03013, over 24136.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2369, pruned_loss=0.03023, over 1889198.97 frames. ], batch size: 366, lr: 1.53e-02, grad_scale: 64.0 2024-03-09 15:17:43,278 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=28053.333333333332, ans=0.125 2024-03-09 15:17:43,291 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28053.333333333332, ans=0.1 2024-03-09 15:17:44,947 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=28053.333333333332, ans=0.125 2024-03-09 15:18:00,287 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28120.0, ans=0.1 2024-03-09 15:18:24,760 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28253.333333333332, ans=0.0 2024-03-09 15:18:53,831 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2024-03-09 15:18:55,710 INFO [train.py:997] (2/4) Epoch 27, batch 150, loss[loss=0.1281, simple_loss=0.2252, pruned_loss=0.01546, over 21505.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2363, pruned_loss=0.02939, over 2517513.83 frames. ], batch size: 718, lr: 1.53e-02, grad_scale: 64.0 2024-03-09 15:19:49,048 INFO [train.py:997] (2/4) Epoch 28, batch 0, loss[loss=0.1811, simple_loss=0.2599, pruned_loss=0.05118, over 23229.00 frames. ], tot_loss[loss=0.1811, simple_loss=0.2599, pruned_loss=0.05118, over 23229.00 frames. ], batch size: 534, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:19:49,049 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:19:59,332 INFO [train.py:1029] (2/4) Epoch 28, validation: loss=0.2107, simple_loss=0.3034, pruned_loss=0.05903, over 452978.00 frames. 2024-03-09 15:19:59,332 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:20:43,541 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=28573.333333333332, ans=0.0046579710144927546 2024-03-09 15:21:11,007 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.458e+01 7.529e+01 8.136e+01 8.999e+01 1.198e+02, threshold=1.627e+02, percent-clipped=0.0 2024-03-09 15:21:23,149 INFO [train.py:997] (2/4) Epoch 28, batch 50, loss[loss=0.1474, simple_loss=0.2406, pruned_loss=0.0271, over 24261.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2376, pruned_loss=0.02981, over 1077172.58 frames. ], batch size: 267, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:21:31,160 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28773.333333333332, ans=0.1 2024-03-09 15:21:56,861 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=28906.666666666668, ans=0.0 2024-03-09 15:22:37,940 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2024-03-09 15:22:39,386 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=29040.0, ans=15.0 2024-03-09 15:22:40,256 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=29040.0, ans=0.2 2024-03-09 15:22:41,861 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=29106.666666666668, ans=0.2 2024-03-09 15:22:43,091 INFO [train.py:997] (2/4) Epoch 28, batch 100, loss[loss=0.1407, simple_loss=0.2329, pruned_loss=0.02421, over 24203.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2387, pruned_loss=0.0302, over 1900825.66 frames. ], batch size: 295, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:23:20,953 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=29240.0, ans=0.125 2024-03-09 15:23:34,774 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29306.666666666668, ans=0.1 2024-03-09 15:23:36,325 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=29306.666666666668, ans=0.125 2024-03-09 15:23:44,413 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=29306.666666666668, ans=0.5 2024-03-09 15:23:44,433 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29306.666666666668, ans=0.1 2024-03-09 15:23:47,397 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=29373.333333333332, ans=0.2 2024-03-09 15:23:50,078 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.379e+01 7.431e+01 8.104e+01 8.725e+01 1.109e+02, threshold=1.621e+02, percent-clipped=0.0 2024-03-09 15:24:02,917 INFO [train.py:997] (2/4) Epoch 28, batch 150, loss[loss=0.1422, simple_loss=0.2253, pruned_loss=0.02953, over 23689.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.237, pruned_loss=0.02978, over 2516736.16 frames. ], batch size: 128, lr: 1.50e-02, grad_scale: 64.0 2024-03-09 15:24:57,643 INFO [train.py:997] (2/4) Epoch 29, batch 0, loss[loss=0.147, simple_loss=0.233, pruned_loss=0.03047, over 24237.00 frames. ], tot_loss[loss=0.147, simple_loss=0.233, pruned_loss=0.03047, over 24237.00 frames. ], batch size: 241, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:24:57,643 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:25:06,828 INFO [train.py:1029] (2/4) Epoch 29, validation: loss=0.2094, simple_loss=0.3019, pruned_loss=0.05844, over 452978.00 frames. 2024-03-09 15:25:06,829 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:25:08,707 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=29493.333333333332, ans=0.2 2024-03-09 15:25:48,701 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=12.0 2024-03-09 15:26:00,512 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=29693.333333333332, ans=0.0 2024-03-09 15:26:28,181 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2024-03-09 15:26:32,413 INFO [train.py:997] (2/4) Epoch 29, batch 50, loss[loss=0.1492, simple_loss=0.2405, pruned_loss=0.02891, over 24148.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.2342, pruned_loss=0.0281, over 1061071.50 frames. ], batch size: 326, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:26:57,465 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=29893.333333333332, ans=0.125 2024-03-09 15:27:14,671 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=29960.0, ans=0.125 2024-03-09 15:27:16,146 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=29960.0, ans=0.2 2024-03-09 15:27:27,042 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.485e+01 7.617e+01 8.419e+01 9.074e+01 1.218e+02, threshold=1.684e+02, percent-clipped=0.0 2024-03-09 15:27:51,941 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=30160.0, ans=0.125 2024-03-09 15:27:55,005 INFO [train.py:997] (2/4) Epoch 29, batch 100, loss[loss=0.1746, simple_loss=0.2605, pruned_loss=0.04441, over 23256.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2355, pruned_loss=0.02933, over 1878284.00 frames. ], batch size: 534, lr: 1.47e-02, grad_scale: 64.0 2024-03-09 15:28:01,453 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=30160.0, ans=0.125 2024-03-09 15:28:28,553 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=30293.333333333332, ans=0.0 2024-03-09 15:28:42,075 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=30360.0, ans=0.125 2024-03-09 15:28:46,787 INFO [scaling.py:1119] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.027e-02 2024-03-09 15:29:12,926 INFO [train.py:997] (2/4) Epoch 29, batch 150, loss[loss=0.1477, simple_loss=0.2339, pruned_loss=0.03078, over 24268.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.234, pruned_loss=0.02816, over 2519546.37 frames. ], batch size: 254, lr: 1.46e-02, grad_scale: 64.0 2024-03-09 15:29:20,865 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=30493.333333333332, ans=0.125 2024-03-09 15:30:06,217 INFO [train.py:997] (2/4) Epoch 30, batch 0, loss[loss=0.1439, simple_loss=0.2391, pruned_loss=0.02435, over 24102.00 frames. ], tot_loss[loss=0.1439, simple_loss=0.2391, pruned_loss=0.02435, over 24102.00 frames. ], batch size: 366, lr: 1.44e-02, grad_scale: 64.0 2024-03-09 15:30:06,217 INFO [train.py:1020] (2/4) Computing validation loss 2024-03-09 15:30:18,506 INFO [train.py:1029] (2/4) Epoch 30, validation: loss=0.2105, simple_loss=0.3027, pruned_loss=0.05915, over 452978.00 frames. 2024-03-09 15:30:18,506 INFO [train.py:1030] (2/4) Maximum memory allocated so far is 28196MB 2024-03-09 15:31:01,625 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.221e+01 6.992e+01 7.523e+01 8.232e+01 1.586e+02, threshold=1.505e+02, percent-clipped=0.0 2024-03-09 15:31:02,543 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.33 vs. limit=15.0 2024-03-09 15:31:10,255 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2024-03-09 15:31:15,869 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=30746.666666666668, ans=0.2 2024-03-09 15:31:35,741 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.00 vs. limit=15.0 2024-03-09 15:31:35,871 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2024-03-09 15:31:40,985 INFO [train.py:997] (2/4) Epoch 30, batch 50, loss[loss=0.1428, simple_loss=0.2276, pruned_loss=0.02904, over 22582.00 frames. ], tot_loss[loss=0.1427, simple_loss=0.231, pruned_loss=0.02722, over 1068010.28 frames. ], batch size: 84, lr: 1.44e-02, grad_scale: 64.0 2024-03-09 15:31:41,286 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=30880.0, ans=0.125 2024-03-09 15:31:56,915 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=30946.666666666668, ans=0.0 2024-03-09 15:32:18,540 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=31013.333333333332, ans=0.004127536231884058 2024-03-09 15:32:18,549 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31013.333333333332, ans=0.1 2024-03-09 15:32:30,797 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=31080.0, ans=0.125 2024-03-09 15:32:44,735 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=31146.666666666668, ans=0.0 2024-03-09 15:33:01,487 INFO [train.py:997] (2/4) Epoch 30, batch 100, loss[loss=0.1415, simple_loss=0.2324, pruned_loss=0.0253, over 24249.00 frames. ], tot_loss[loss=0.1439, simple_loss=0.2333, pruned_loss=0.0273, over 1886597.58 frames. ], batch size: 241, lr: 1.43e-02, grad_scale: 64.0 2024-03-09 15:33:12,542 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=31213.333333333332, ans=0.5 2024-03-09 15:33:22,022 INFO [scaling.py:214] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=31280.0, ans=0.125 2024-03-09 15:33:43,906 WARNING [optim.py:487] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.080e+01 7.332e+01 7.826e+01 8.661e+01 1.231e+02, threshold=1.565e+02, percent-clipped=0.0 2024-03-09 15:34:20,984 INFO [train.py:997] (2/4) Epoch 30, batch 150, loss[loss=0.1511, simple_loss=0.2429, pruned_loss=0.02965, over 24250.00 frames. ], tot_loss[loss=0.145, simple_loss=0.2348, pruned_loss=0.02754, over 2516428.34 frames. ], batch size: 281, lr: 1.43e-02, grad_scale: 64.0 2024-03-09 15:34:23,606 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=22.5 2024-03-09 15:34:30,021 INFO [scaling.py:1023] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2024-03-09 15:34:33,231 INFO [train.py:1248] (2/4) Done!