2023-12-03 22:41:20,222 INFO [train.py:1155] (2/4) Training started 2023-12-03 22:41:20,222 INFO [train.py:1172] (2/4) Device: cuda:2 2023-12-03 22:41:20,226 INFO [train.py:1184] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2b2ac14b326d61d79d04e53fbd69b1ff6d630411', 'k2-git-date': 'Thu Aug 24 05:58:26 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.1', 'icefall-git-branch': 'zipformer_whisper_mvq', 'icefall-git-sha1': '0d26e9c4-dirty', 'icefall-git-date': 'Tue Oct 24 09:46:03 2023', 'icefall-path': '/star-xy/softwares/icefall_development/icefall_mvq', 'k2-path': '/star-xy/softwares/k2_development/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-xy/softwares/anaconda3/envs/multi_KD/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-1-1220091118-57c4d55446-mvd6x', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 18130, 'tensorboard': True, 'num_epochs': 90, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_baseline_960h_no_sp_enable_musan0'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'codebook_loss_scale': 0.1, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'enable_distillation': False, 'num_codebooks': 16, 'distillation_layer': 4, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank_with_whisper_embeddings'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2023-12-03 22:41:20,227 INFO [train.py:1186] (2/4) About to create model 2023-12-03 22:41:20,844 INFO [train.py:1190] (2/4) Number of model parameters: 65549011 2023-12-03 22:41:23,522 INFO [train.py:1205] (2/4) Using DDP 2023-12-03 22:41:24,020 INFO [asr_datamodule.py:434] (2/4) About to get the shuffled train-clean-100, train-clean-360 and train-other-500 cuts 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:239] (2/4) Disable MUSAN 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:257] (2/4) Enable SpecAugment 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:258] (2/4) Time warp factor: 80 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:268] (2/4) Num frame mask: 10 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:281] (2/4) About to create train dataset 2023-12-03 22:41:24,154 INFO [asr_datamodule.py:308] (2/4) Using DynamicBucketingSampler. 2023-12-03 22:41:28,673 INFO [asr_datamodule.py:323] (2/4) About to create train dataloader 2023-12-03 22:41:28,674 INFO [asr_datamodule.py:451] (2/4) About to get dev-clean cuts 2023-12-03 22:41:28,679 INFO [asr_datamodule.py:458] (2/4) About to get dev-other cuts 2023-12-03 22:41:28,682 INFO [asr_datamodule.py:354] (2/4) About to create dev dataset 2023-12-03 22:41:28,974 INFO [asr_datamodule.py:371] (2/4) About to create dev dataloader 2023-12-03 22:41:41,219 INFO [train.py:1087] (2/4) Epoch 1, batch 0, loss[loss=7.609, simple_loss=6.923, pruned_loss=6.849, over 24553.00 frames. ], tot_loss[loss=7.609, simple_loss=6.923, pruned_loss=6.849, over 24553.00 frames. ], batch size: 63, lr: 2.25e-02, grad_scale: 1.0 2023-12-03 22:41:41,220 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-03 22:41:53,591 INFO [train.py:1119] (2/4) Epoch 1, validation: loss=7.57, simple_loss=6.893, pruned_loss=6.755, over 944034.00 frames. 2023-12-03 22:41:53,592 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 14378MB 2023-12-03 22:41:58,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2 2023-12-03 22:41:59,934 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.37 vs. limit=5.0 2023-12-03 22:42:03,485 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.34 vs. limit=7.5 2023-12-03 22:42:10,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=66.66666666666667, ans=0.49166666666666664 2023-12-03 22:42:12,513 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=38.24 vs. limit=7.525 2023-12-03 22:42:13,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=66.66666666666667, ans=0.1975 2023-12-03 22:42:13,454 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=4.026666666666666 2023-12-03 22:42:15,634 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=7.55 2023-12-03 22:42:23,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=133.33333333333334, ans=7.55 2023-12-03 22:42:23,328 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=4.053333333333334 2023-12-03 22:42:37,488 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=58.75 vs. limit=7.575 2023-12-03 22:42:48,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266.6666666666667, ans=0.4875 2023-12-03 22:42:49,144 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=75.69 vs. limit=7.6 2023-12-03 22:42:53,086 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=33.45 vs. limit=5.133333333333334 2023-12-03 22:42:58,034 INFO [train.py:1087] (2/4) Epoch 1, batch 50, loss[loss=1.334, simple_loss=1.185, pruned_loss=1.331, over 24556.00 frames. ], tot_loss[loss=3.201, simple_loss=2.938, pruned_loss=2.561, over 1081445.51 frames. ], batch size: 64, lr: 2.48e-02, grad_scale: 0.25 2023-12-03 22:43:00,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=333.3333333333333, ans=0.484375 2023-12-03 22:43:05,977 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=248.68 vs. limit=7.75 2023-12-03 22:43:10,787 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=400.0, ans=0.0975 2023-12-03 22:43:25,115 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=211.96 vs. limit=7.85 2023-12-03 22:43:26,643 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=107.62 vs. limit=7.85 2023-12-03 22:43:27,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=466.6666666666667, ans=0.0895 2023-12-03 22:43:30,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=466.6666666666667, ans=0.478125 2023-12-03 22:43:30,929 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=82.59 vs. limit=5.116666666666666 2023-12-03 22:43:33,461 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=56.59 vs. limit=7.85 2023-12-03 22:43:43,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=533.3333333333334, ans=0.475 2023-12-03 22:43:45,017 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=153.95 vs. limit=7.7 2023-12-03 22:43:57,927 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=44.72 vs. limit=7.725 2023-12-03 22:43:58,736 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=600.0, ans=0.471875 2023-12-03 22:43:59,127 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=34.22 vs. limit=7.95 2023-12-03 22:44:00,053 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=600.0, ans=0.879 2023-12-03 22:44:02,184 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=99.15 vs. limit=7.75 2023-12-03 22:44:02,727 INFO [train.py:1087] (2/4) Epoch 1, batch 100, loss[loss=1.19, simple_loss=1.03, pruned_loss=1.281, over 24739.00 frames. ], tot_loss[loss=2.136, simple_loss=1.933, pruned_loss=1.87, over 1897954.58 frames. ], batch size: 61, lr: 2.70e-02, grad_scale: 0.5 2023-12-03 22:44:06,917 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.552e+01 1.661e+02 5.889e+02 5.635e+03 1.802e+05, threshold=1.178e+03, percent-clipped=0.0 2023-12-03 22:44:29,126 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=64.85 vs. limit=7.8 2023-12-03 22:44:35,009 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=61.40 vs. limit=7.8 2023-12-03 22:44:35,865 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=4.32 2023-12-03 22:44:37,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.51 vs. limit=5.2 2023-12-03 22:44:40,249 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=83.87 vs. limit=8.1 2023-12-03 22:44:42,961 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.09 vs. limit=8.15 2023-12-03 22:44:50,676 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=35.38 vs. limit=8.15 2023-12-03 22:44:50,825 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=55.15 vs. limit=7.825 2023-12-03 22:44:50,974 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=170.97 vs. limit=7.825 2023-12-03 22:44:51,834 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=4.346666666666667 2023-12-03 22:44:52,970 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=7.825 2023-12-03 22:44:59,701 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=226.44 vs. limit=5.466666666666667 2023-12-03 22:45:01,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=933.3333333333334, ans=0.04666666666666667 2023-12-03 22:45:03,074 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=20.33 vs. limit=7.85 2023-12-03 22:45:03,196 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=213.89 vs. limit=7.85 2023-12-03 22:45:04,125 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=933.3333333333334, ans=0.09416666666666668 2023-12-03 22:45:07,803 INFO [train.py:1087] (2/4) Epoch 1, batch 150, loss[loss=1.047, simple_loss=0.8928, pruned_loss=1.121, over 24564.00 frames. ], tot_loss[loss=1.703, simple_loss=1.519, pruned_loss=1.583, over 2551948.64 frames. ], batch size: 66, lr: 2.93e-02, grad_scale: 0.5 2023-12-03 22:45:08,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.47 vs. limit=8.25 2023-12-03 22:45:08,776 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=168.25 vs. limit=4.2 2023-12-03 22:45:10,025 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=240.59 vs. limit=7.875 2023-12-03 22:45:13,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1000.0, ans=0.09375 2023-12-03 22:45:17,153 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=8.25 2023-12-03 22:45:19,824 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=184.47 vs. limit=7.875 2023-12-03 22:45:23,856 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=3.16 2023-12-03 22:45:25,128 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=7.9 2023-12-03 22:45:43,035 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.38 vs. limit=5.283333333333333 2023-12-03 22:45:52,437 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1200.0, ans=0.0925 2023-12-03 22:45:57,714 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.33 vs. limit=5.3 2023-12-03 22:45:57,970 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=21.37 vs. limit=7.95 2023-12-03 22:46:07,215 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=37.18 vs. limit=7.975 2023-12-03 22:46:14,756 INFO [train.py:1087] (2/4) Epoch 1, batch 200, loss[loss=1.008, simple_loss=0.8573, pruned_loss=1.016, over 24296.00 frames. ], tot_loss[loss=1.458, simple_loss=1.286, pruned_loss=1.394, over 3057372.24 frames. ], batch size: 79, lr: 3.15e-02, grad_scale: 1.0 2023-12-03 22:46:16,262 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1333.3333333333333, ans=0.15 2023-12-03 22:46:18,513 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.195e+01 8.879e+01 1.077e+02 1.337e+02 3.280e+02, threshold=2.153e+02, percent-clipped=0.0 2023-12-03 22:46:22,183 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=3.2 2023-12-03 22:46:27,781 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.25 vs. limit=8.55 2023-12-03 22:46:33,082 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=59.55 vs. limit=8.025 2023-12-03 22:46:45,872 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=4.586666666666667 2023-12-03 22:46:52,674 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=56.15 vs. limit=8.05 2023-12-03 22:46:55,215 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=48.78 vs. limit=8.075 2023-12-03 22:46:59,417 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=53.89 vs. limit=8.65 2023-12-03 22:47:02,096 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=195.46 vs. limit=8.075 2023-12-03 22:47:03,708 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=9.57 vs. limit=4.613333333333333 2023-12-03 22:47:04,836 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=36.96 vs. limit=8.075 2023-12-03 22:47:18,322 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=29.76 vs. limit=8.1 2023-12-03 22:47:21,311 INFO [train.py:1087] (2/4) Epoch 1, batch 250, loss[loss=0.9936, simple_loss=0.8401, pruned_loss=0.9683, over 23446.00 frames. ], tot_loss[loss=1.307, simple_loss=1.143, pruned_loss=1.262, over 3439397.46 frames. ], batch size: 94, lr: 3.38e-02, grad_scale: 1.0 2023-12-03 22:47:23,582 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=125.06 vs. limit=8.125 2023-12-03 22:47:30,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1666.6666666666667, ans=0.421875 2023-12-03 22:47:43,142 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=90.95 vs. limit=8.15 2023-12-03 22:47:43,258 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=21.03 vs. limit=8.15 2023-12-03 22:47:45,833 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1733.3333333333333, ans=5.433333333333334 2023-12-03 22:47:46,089 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=118.34 vs. limit=8.15 2023-12-03 22:47:49,350 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1800.0, ans=0.275 2023-12-03 22:47:50,954 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.13 vs. limit=4.72 2023-12-03 22:48:00,191 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.94 vs. limit=8.2 2023-12-03 22:48:03,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1866.6666666666667, ans=0.8346666666666667 2023-12-03 22:48:05,051 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=38.12 vs. limit=8.2 2023-12-03 22:48:05,155 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=21.33 vs. limit=8.2 2023-12-03 22:48:06,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1866.6666666666667, ans=0.8346666666666667 2023-12-03 22:48:06,204 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=1866.6666666666667, ans=0.4125 2023-12-03 22:48:18,636 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=45.41 vs. limit=8.225 2023-12-03 22:48:21,255 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.54 vs. limit=8.95 2023-12-03 22:48:26,258 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=93.28 vs. limit=8.25 2023-12-03 22:48:26,883 INFO [train.py:1087] (2/4) Epoch 1, batch 300, loss[loss=0.9193, simple_loss=0.7644, pruned_loss=0.902, over 24704.00 frames. ], tot_loss[loss=1.205, simple_loss=1.043, pruned_loss=1.167, over 3754223.94 frames. ], batch size: 74, lr: 3.60e-02, grad_scale: 2.0 2023-12-03 22:48:27,597 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=8.25 2023-12-03 22:48:31,065 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.988e+01 1.210e+02 1.529e+02 2.202e+02 3.785e+02, threshold=3.059e+02, percent-clipped=26.0 2023-12-03 22:48:33,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=2000.0, ans=0.23 2023-12-03 22:48:34,390 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=16.41 vs. limit=5.5 2023-12-03 22:48:39,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2066.6666666666665, ans=0.053500000000000006 2023-12-03 22:48:44,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2066.6666666666665, ans=0.04949747468305833 2023-12-03 22:48:44,589 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=52.80 vs. limit=8.275 2023-12-03 22:48:54,460 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=25.74 vs. limit=8.3 2023-12-03 22:49:04,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2133.3333333333335, ans=0.4 2023-12-03 22:49:09,630 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.76 vs. limit=6.1 2023-12-03 22:49:11,176 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=2200.0, ans=8.325 2023-12-03 22:49:13,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=2200.0, ans=0.050499999999999996 2023-12-03 22:49:14,706 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.02 vs. limit=6.1 2023-12-03 22:49:31,434 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=70.31 vs. limit=8.35 2023-12-03 22:49:32,667 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=53.53 vs. limit=8.375 2023-12-03 22:49:33,355 INFO [train.py:1087] (2/4) Epoch 1, batch 350, loss[loss=0.939, simple_loss=0.7772, pruned_loss=0.8917, over 24765.00 frames. ], tot_loss[loss=1.136, simple_loss=0.9747, pruned_loss=1.097, over 3984181.74 frames. ], batch size: 64, lr: 3.83e-02, grad_scale: 2.0 2023-12-03 22:49:39,932 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=55.58 vs. limit=9.25 2023-12-03 22:49:40,143 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=77.20 vs. limit=8.375 2023-12-03 22:49:46,389 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.90 vs. limit=9.3 2023-12-03 22:49:48,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=2400.0, ans=0.085 2023-12-03 22:49:55,606 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=4.96 2023-12-03 22:49:59,322 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2466.6666666666665, ans=0.384375 2023-12-03 22:50:14,883 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=5.013333333333334 2023-12-03 22:50:20,975 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=8.45 2023-12-03 22:50:29,236 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=8.475 2023-12-03 22:50:32,911 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2600.0, ans=0.27399999999999997 2023-12-03 22:50:32,914 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2600.0, ans=0.378125 2023-12-03 22:50:34,433 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.58 vs. limit=5.04 2023-12-03 22:50:36,110 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.33 vs. limit=5.65 2023-12-03 22:50:37,548 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=27.05 vs. limit=9.5 2023-12-03 22:50:38,484 INFO [train.py:1087] (2/4) Epoch 1, batch 400, loss[loss=0.9533, simple_loss=0.7898, pruned_loss=0.8655, over 24739.00 frames. ], tot_loss[loss=1.084, simple_loss=0.9236, pruned_loss=1.038, over 4147874.19 frames. ], batch size: 63, lr: 4.05e-02, grad_scale: 4.0 2023-12-03 22:50:42,219 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.675e+01 1.297e+02 1.639e+02 1.987e+02 4.198e+02, threshold=3.278e+02, percent-clipped=1.0 2023-12-03 22:50:44,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2666.6666666666665, ans=0.375 2023-12-03 22:50:57,878 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.12 vs. limit=6.366666666666667 2023-12-03 22:51:00,333 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.99 vs. limit=5.093333333333334 2023-12-03 22:51:00,406 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.60 vs. limit=9.55 2023-12-03 22:51:00,640 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=26.48 vs. limit=6.366666666666667 2023-12-03 22:51:12,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2800.0, ans=0.36875 2023-12-03 22:51:19,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=29.26 vs. limit=8.575 2023-12-03 22:51:19,207 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=8.575 2023-12-03 22:51:19,350 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.91 vs. limit=8.575 2023-12-03 22:51:26,951 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=30.47 vs. limit=8.575 2023-12-03 22:51:28,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2866.6666666666665, ans=8.575 2023-12-03 22:51:32,870 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=18.94 vs. limit=8.6 2023-12-03 22:51:38,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2933.3333333333335, ans=0.3625 2023-12-03 22:51:42,939 INFO [train.py:1087] (2/4) Epoch 1, batch 450, loss[loss=0.947, simple_loss=0.7919, pruned_loss=0.8067, over 17050.00 frames. ], tot_loss[loss=1.043, simple_loss=0.8837, pruned_loss=0.9823, over 4291074.31 frames. ], batch size: 177, lr: 4.28e-02, grad_scale: 4.0 2023-12-03 22:51:47,757 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=9.75 2023-12-03 22:51:48,025 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=46.10 vs. limit=8.625 2023-12-03 22:51:50,532 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=8.625 2023-12-03 22:51:51,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=3000.0, ans=0.0875 2023-12-03 22:51:59,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=3066.6666666666665, ans=0.7926666666666667 2023-12-03 22:52:08,824 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=3133.3333333333335, ans=0.353125 2023-12-03 22:52:09,092 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=25.98 vs. limit=8.675 2023-12-03 22:52:09,188 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=24.14 vs. limit=9.85 2023-12-03 22:52:10,461 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=15.79 vs. limit=8.675 2023-12-03 22:52:11,304 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=3.4699999999999998 2023-12-03 22:52:18,069 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=3133.3333333333335, ans=0.247 2023-12-03 22:52:19,971 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.86 vs. limit=5.253333333333334 2023-12-03 22:52:23,871 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.79 vs. limit=9.9 2023-12-03 22:52:28,512 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=3200.0, ans=0.08 2023-12-03 22:52:32,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=3200.0, ans=0.09999999999999998 2023-12-03 22:52:33,956 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.02 vs. limit=8.725 2023-12-03 22:52:46,850 INFO [train.py:1087] (2/4) Epoch 1, batch 500, loss[loss=0.8718, simple_loss=0.7439, pruned_loss=0.6793, over 24757.00 frames. ], tot_loss[loss=1.004, simple_loss=0.8496, pruned_loss=0.9191, over 4384203.11 frames. ], batch size: 66, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:52:47,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=3333.3333333333335, ans=0.024999999999999994 2023-12-03 22:52:50,981 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 2.115e+02 2.895e+02 4.210e+02 7.644e+02, threshold=5.790e+02, percent-clipped=45.0 2023-12-03 22:53:17,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=3466.6666666666665, ans=7.166666666666666 2023-12-03 22:53:21,119 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=9.58 vs. limit=8.8 2023-12-03 22:53:21,156 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=8.8 2023-12-03 22:53:29,065 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.65 vs. limit=5.883333333333334 2023-12-03 22:53:29,256 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.36 vs. limit=8.825 2023-12-03 22:53:31,739 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.55 vs. limit=8.825 2023-12-03 22:53:37,733 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.71 vs. limit=10.2 2023-12-03 22:53:44,951 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.42 vs. limit=8.85 2023-12-03 22:53:50,639 INFO [train.py:1087] (2/4) Epoch 1, batch 550, loss[loss=0.7145, simple_loss=0.6224, pruned_loss=0.511, over 24551.00 frames. ], tot_loss[loss=0.9525, simple_loss=0.8086, pruned_loss=0.8422, over 4452947.61 frames. ], batch size: 62, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:53:57,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=3666.6666666666665, ans=0.7866666666666666 2023-12-03 22:54:04,139 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=8.9 2023-12-03 22:54:17,605 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.29 vs. limit=5.95 2023-12-03 22:54:18,294 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=3800.0, ans=0.025000000000000022 2023-12-03 22:54:19,854 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.09 vs. limit=8.925 2023-12-03 22:54:53,068 INFO [train.py:1087] (2/4) Epoch 1, batch 600, loss[loss=0.6377, simple_loss=0.5685, pruned_loss=0.4174, over 24841.00 frames. ], tot_loss[loss=0.891, simple_loss=0.7614, pruned_loss=0.7558, over 4533165.12 frames. ], batch size: 68, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:54:53,826 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=9.81 vs. limit=6.0 2023-12-03 22:54:55,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=4000.0, ans=0.04999999999999999 2023-12-03 22:54:56,651 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 4.670e+02 6.467e+02 1.067e+03 1.868e+03, threshold=1.293e+03, percent-clipped=55.0 2023-12-03 22:55:03,984 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=6.521e-01 2023-12-03 22:55:06,339 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=4066.6666666666665, ans=0.2593333333333333 2023-12-03 22:55:11,415 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=10.55 2023-12-03 22:55:15,191 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=10.55 2023-12-03 22:55:25,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=4133.333333333333, ans=0.7553333333333334 2023-12-03 22:55:35,422 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.01 vs. limit=10.65 2023-12-03 22:55:39,373 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.27 vs. limit=6.05 2023-12-03 22:55:40,664 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.79 vs. limit=5.68 2023-12-03 22:55:40,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=9.075 2023-12-03 22:55:48,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=4266.666666666667, ans=0.7926666666666666 2023-12-03 22:55:50,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4266.666666666667, ans=0.3 2023-12-03 22:55:53,070 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.30 vs. limit=6.066666666666666 2023-12-03 22:55:56,253 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=11.16 vs. limit=10.7 2023-12-03 22:55:58,045 INFO [train.py:1087] (2/4) Epoch 1, batch 650, loss[loss=0.585, simple_loss=0.5323, pruned_loss=0.3546, over 24864.00 frames. ], tot_loss[loss=0.8322, simple_loss=0.717, pruned_loss=0.6756, over 4594127.27 frames. ], batch size: 68, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:56:10,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=4400.0, ans=0.29375 2023-12-03 22:56:34,668 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.16 vs. limit=9.2 2023-12-03 22:56:35,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=4533.333333333333, ans=0.2875 2023-12-03 22:56:37,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=4533.333333333333, ans=0.04777777777777778 2023-12-03 22:56:39,719 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=7.67 vs. limit=9.2 2023-12-03 22:56:44,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=4600.0, ans=0.009869565217391305 2023-12-03 22:56:58,310 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=11.0 2023-12-03 22:56:58,860 INFO [train.py:1087] (2/4) Epoch 1, batch 700, loss[loss=0.5968, simple_loss=0.527, pruned_loss=0.3904, over 24855.00 frames. ], tot_loss[loss=0.7743, simple_loss=0.674, pruned_loss=0.6, over 4642784.05 frames. ], batch size: 68, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:57:02,388 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 4.179e+02 6.821e+02 1.321e+03 5.191e+03, threshold=1.364e+03, percent-clipped=27.0 2023-12-03 22:57:19,352 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=4733.333333333333, ans=0.04694444444444445 2023-12-03 22:57:26,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=4800.0, ans=0.275 2023-12-03 22:57:28,509 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.36 vs. limit=6.2 2023-12-03 22:57:31,234 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=5.92 2023-12-03 22:57:31,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=4800.0, ans=0.272 2023-12-03 22:57:56,678 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=4933.333333333333, ans=8.083333333333332 2023-12-03 22:57:59,225 INFO [train.py:1087] (2/4) Epoch 1, batch 750, loss[loss=0.5267, simple_loss=0.4882, pruned_loss=0.2972, over 24279.00 frames. ], tot_loss[loss=0.7216, simple_loss=0.6351, pruned_loss=0.5332, over 4689097.49 frames. ], batch size: 79, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:58:18,075 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=11.3 2023-12-03 22:58:39,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=5200.0, ans=0.25625 2023-12-03 22:58:39,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=5200.0, ans=0.802 2023-12-03 22:58:50,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=5266.666666666667, ans=0.253125 2023-12-03 22:58:52,594 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=11.45 2023-12-03 22:58:59,892 INFO [train.py:1087] (2/4) Epoch 1, batch 800, loss[loss=0.4934, simple_loss=0.4677, pruned_loss=0.2592, over 24790.00 frames. ], tot_loss[loss=0.6757, simple_loss=0.6015, pruned_loss=0.4763, over 4713613.39 frames. ], batch size: 72, lr: 4.49e-02, grad_scale: 16.0 2023-12-03 22:59:03,287 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 4.513e+02 8.468e+02 1.309e+03 3.907e+03, threshold=1.694e+03, percent-clipped=23.0 2023-12-03 22:59:05,158 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=9.5 2023-12-03 22:59:08,693 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.67 vs. limit=6.333333333333333 2023-12-03 22:59:18,403 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5400.0, ans=0.246 2023-12-03 22:59:24,978 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=5466.666666666667, ans=0.7086666666666667 2023-12-03 22:59:29,269 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=5466.666666666667, ans=0.009681159420289855 2023-12-03 22:59:37,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=5533.333333333333, ans=0.09899494936611666 2023-12-03 22:59:38,154 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=11.65 2023-12-03 22:59:47,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.77 vs. limit=11.7 2023-12-03 22:59:50,220 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.76 vs. limit=11.7 2023-12-03 22:59:54,875 INFO [train.py:1087] (2/4) Epoch 1, batch 850, loss[loss=0.4585, simple_loss=0.4421, pruned_loss=0.2286, over 24766.00 frames. ], tot_loss[loss=0.6354, simple_loss=0.5722, pruned_loss=0.4276, over 4721805.32 frames. ], batch size: 65, lr: 4.49e-02, grad_scale: 16.0 2023-12-03 23:00:05,066 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=11.8 2023-12-03 23:00:07,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=5733.333333333333, ans=0.23125 2023-12-03 23:00:07,416 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.27 vs. limit=9.65 2023-12-03 23:00:14,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5733.333333333333, ans=0.24266666666666667 2023-12-03 23:00:34,246 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=9.7 2023-12-03 23:00:39,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=5933.333333333333, ans=0.221875 2023-12-03 23:00:59,322 INFO [train.py:1087] (2/4) Epoch 2, batch 0, loss[loss=0.4373, simple_loss=0.428, pruned_loss=0.2089, over 24792.00 frames. ], tot_loss[loss=0.4373, simple_loss=0.428, pruned_loss=0.2089, over 24792.00 frames. ], batch size: 62, lr: 4.40e-02, grad_scale: 32.0 2023-12-03 23:00:59,323 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-03 23:01:11,640 INFO [train.py:1119] (2/4) Epoch 2, validation: loss=0.4023, simple_loss=0.4118, pruned_loss=0.1645, over 944034.00 frames. 2023-12-03 23:01:11,641 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-03 23:01:11,808 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=5966.666666666667, ans=0.22031250000000002 2023-12-03 23:01:18,590 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:01:21,659 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 3.803e+02 6.586e+02 1.143e+03 2.069e+03, threshold=1.317e+03, percent-clipped=6.0 2023-12-03 23:01:22,299 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.81 vs. limit=6.508333333333333 2023-12-03 23:01:36,016 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=6100.0, ans=0.2140625 2023-12-03 23:01:37,089 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=6100.0, ans=0.2140625 2023-12-03 23:01:47,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=6166.666666666667, ans=0.2109375 2023-12-03 23:02:09,421 INFO [train.py:1087] (2/4) Epoch 2, batch 50, loss[loss=0.4231, simple_loss=0.4183, pruned_loss=0.1974, over 24151.00 frames. ], tot_loss[loss=0.4628, simple_loss=0.449, pruned_loss=0.2279, over 1066694.37 frames. ], batch size: 82, lr: 4.40e-02, grad_scale: 16.0 2023-12-03 23:02:15,306 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:02:20,169 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=4.686e-02 2023-12-03 23:02:30,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=6366.666666666667, ans=0.04013888888888889 2023-12-03 23:02:41,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=6433.333333333333, ans=0.07 2023-12-03 23:03:04,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.58 vs. limit=8.283333333333333 2023-12-03 23:03:08,039 INFO [train.py:1087] (2/4) Epoch 2, batch 100, loss[loss=0.413, simple_loss=0.4128, pruned_loss=0.1879, over 24742.00 frames. ], tot_loss[loss=0.4465, simple_loss=0.4379, pruned_loss=0.2138, over 1910862.55 frames. ], batch size: 63, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:03:10,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=6633.333333333333, ans=0.009427536231884057 2023-12-03 23:03:17,207 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6633.333333333333, ans=0.23366666666666666 2023-12-03 23:03:20,239 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 4.752e+02 7.477e+02 1.066e+03 3.809e+03, threshold=1.495e+03, percent-clipped=14.0 2023-12-03 23:03:23,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=6700.0, ans=0.0290625 2023-12-03 23:03:34,886 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=10.0375 2023-12-03 23:03:38,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=6766.666666666667, ans=0.03847222222222223 2023-12-03 23:03:42,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=6766.666666666667, ans=0.1828125 2023-12-03 23:03:46,918 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=6833.333333333333, ans=0.03819444444444445 2023-12-03 23:04:00,307 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.69 vs. limit=8.45 2023-12-03 23:04:07,461 INFO [train.py:1087] (2/4) Epoch 2, batch 150, loss[loss=0.4256, simple_loss=0.4266, pruned_loss=0.1938, over 24584.00 frames. ], tot_loss[loss=0.4403, simple_loss=0.4337, pruned_loss=0.2088, over 2550052.58 frames. ], batch size: 65, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:04:09,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=6966.666666666667, ans=0.009355072463768117 2023-12-03 23:04:16,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6966.666666666667, ans=0.17343750000000002 2023-12-03 23:04:33,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=7100.0, ans=0.1671875 2023-12-03 23:04:34,772 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.38 vs. limit=10.1625 2023-12-03 23:04:38,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=7100.0, ans=0.6515 2023-12-03 23:04:42,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=7166.666666666667, ans=0.1640625 2023-12-03 23:04:42,447 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.62 vs. limit=6.791666666666667 2023-12-03 23:04:45,485 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:05:04,337 INFO [train.py:1087] (2/4) Epoch 2, batch 200, loss[loss=0.3848, simple_loss=0.3944, pruned_loss=0.166, over 24575.00 frames. ], tot_loss[loss=0.4316, simple_loss=0.4279, pruned_loss=0.2019, over 3044563.57 frames. ], batch size: 65, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:05:15,603 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 4.144e+02 5.908e+02 9.300e+02 2.212e+03, threshold=1.182e+03, percent-clipped=6.0 2023-12-03 23:05:52,998 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7566.666666666667, ans=0.22433333333333333 2023-12-03 23:06:01,447 INFO [train.py:1087] (2/4) Epoch 2, batch 250, loss[loss=0.3947, simple_loss=0.4007, pruned_loss=0.1768, over 24551.00 frames. ], tot_loss[loss=0.4244, simple_loss=0.4234, pruned_loss=0.1962, over 3441639.25 frames. ], batch size: 66, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:06:06,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=7633.333333333333, ans=0.22366666666666668 2023-12-03 23:06:08,765 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=10.3625 2023-12-03 23:06:26,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=7766.666666666667, ans=0.1359375 2023-12-03 23:06:36,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=7833.333333333333, ans=0.6258333333333334 2023-12-03 23:06:47,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=7900.0, ans=0.009152173913043479 2023-12-03 23:06:50,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7900.0, ans=0.221 2023-12-03 23:06:57,912 INFO [train.py:1087] (2/4) Epoch 2, batch 300, loss[loss=0.4293, simple_loss=0.4269, pruned_loss=0.2037, over 21515.00 frames. ], tot_loss[loss=0.4205, simple_loss=0.4212, pruned_loss=0.1933, over 3720076.00 frames. ], batch size: 128, lr: 4.38e-02, grad_scale: 8.0 2023-12-03 23:07:06,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7966.666666666667, ans=0.0 2023-12-03 23:07:09,119 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.778e+02 5.285e+02 1.000e+03 3.398e+03, threshold=1.057e+03, percent-clipped=16.0 2023-12-03 23:07:11,579 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=8033.333333333333, ans=0.6188333333333333 2023-12-03 23:07:34,129 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=8166.666666666667, ans=0.009094202898550724 2023-12-03 23:07:44,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8233.333333333334, ans=0.21766666666666665 2023-12-03 23:07:47,117 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:07:53,742 INFO [train.py:1087] (2/4) Epoch 2, batch 350, loss[loss=0.3719, simple_loss=0.3899, pruned_loss=0.1566, over 24750.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4157, pruned_loss=0.1873, over 3957790.58 frames. ], batch size: 63, lr: 4.38e-02, grad_scale: 8.0 2023-12-03 23:08:03,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8300.0, ans=0.217 2023-12-03 23:08:03,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=8300.0, ans=0.009065217391304349 2023-12-03 23:08:12,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=8366.666666666666, ans=0.00905072463768116 2023-12-03 23:08:15,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=8433.333333333334, ans=0.125 2023-12-03 23:08:18,746 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=10.662500000000001 2023-12-03 23:08:29,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=8500.0, ans=0.125 2023-12-03 23:08:41,918 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:08:46,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=8566.666666666666, ans=0.125 2023-12-03 23:08:49,247 INFO [train.py:1087] (2/4) Epoch 2, batch 400, loss[loss=0.36, simple_loss=0.3822, pruned_loss=0.1488, over 24572.00 frames. ], tot_loss[loss=0.4043, simple_loss=0.4111, pruned_loss=0.1813, over 4144815.22 frames. ], batch size: 65, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:08:54,736 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8633.333333333334, ans=0.125 2023-12-03 23:09:00,210 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 4.190e+02 6.859e+02 9.810e+02 1.648e+03, threshold=1.372e+03, percent-clipped=19.0 2023-12-03 23:09:06,516 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=7.23 vs. limit=7.48 2023-12-03 23:09:19,127 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=8766.666666666666, ans=0.030138888888888892 2023-12-03 23:09:21,635 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.30 vs. limit=7.208333333333334 2023-12-03 23:09:23,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8833.333333333334, ans=0.21166666666666667 2023-12-03 23:09:27,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=8833.333333333334, ans=0.5908333333333333 2023-12-03 23:09:43,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=8966.666666666666, ans=10.8625 2023-12-03 23:09:44,534 INFO [train.py:1087] (2/4) Epoch 2, batch 450, loss[loss=0.385, simple_loss=0.3982, pruned_loss=0.1712, over 23452.00 frames. ], tot_loss[loss=0.3973, simple_loss=0.407, pruned_loss=0.1762, over 4299429.56 frames. ], batch size: 94, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:10:07,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=9100.0, ans=0.2 2023-12-03 23:10:09,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=9100.0, ans=0.5815 2023-12-03 23:10:18,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=9166.666666666666, ans=0.0 2023-12-03 23:10:21,024 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.06 vs. limit=14.375 2023-12-03 23:10:28,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=9233.333333333334, ans=0.035 2023-12-03 23:10:28,284 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=9233.333333333334, ans=0.125 2023-12-03 23:10:41,087 INFO [train.py:1087] (2/4) Epoch 2, batch 500, loss[loss=0.374, simple_loss=0.4002, pruned_loss=0.1558, over 24722.00 frames. ], tot_loss[loss=0.3896, simple_loss=0.4025, pruned_loss=0.171, over 4422375.26 frames. ], batch size: 63, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:10:44,032 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.91 vs. limit=14.475 2023-12-03 23:10:45,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=9300.0, ans=0.0 2023-12-03 23:10:50,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=9300.0, ans=0.02791666666666667 2023-12-03 23:10:52,175 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.311e+02 5.496e+02 8.640e+02 1.749e+03, threshold=1.099e+03, percent-clipped=4.0 2023-12-03 23:10:54,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=9366.666666666666, ans=0.008833333333333334 2023-12-03 23:11:32,715 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.13 vs. limit=14.675 2023-12-03 23:11:36,804 INFO [train.py:1087] (2/4) Epoch 2, batch 550, loss[loss=0.4073, simple_loss=0.4158, pruned_loss=0.1889, over 21229.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.3994, pruned_loss=0.1668, over 4514877.67 frames. ], batch size: 127, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:11:49,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=9700.0, ans=0.008760869565217391 2023-12-03 23:12:00,763 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=11.1625 2023-12-03 23:12:02,986 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=11.1625 2023-12-03 23:12:04,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9766.666666666666, ans=0.125 2023-12-03 23:12:32,795 INFO [train.py:1087] (2/4) Epoch 2, batch 600, loss[loss=0.4048, simple_loss=0.4179, pruned_loss=0.1852, over 21455.00 frames. ], tot_loss[loss=0.378, simple_loss=0.3964, pruned_loss=0.163, over 4577513.97 frames. ], batch size: 128, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:12:44,643 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 3.326e+02 4.968e+02 8.215e+02 1.747e+03, threshold=9.937e+02, percent-clipped=13.0 2023-12-03 23:12:47,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=10033.333333333334, ans=0.008688405797101449 2023-12-03 23:12:50,238 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=10033.333333333334, ans=0.5488333333333334 2023-12-03 23:12:59,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=10100.0, ans=0.125 2023-12-03 23:13:02,594 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=10100.0, ans=0.125 2023-12-03 23:13:12,208 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=10166.666666666666, ans=0.02430555555555556 2023-12-03 23:13:23,155 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.95 vs. limit=10.116666666666667 2023-12-03 23:13:29,360 INFO [train.py:1087] (2/4) Epoch 2, batch 650, loss[loss=0.3427, simple_loss=0.3787, pruned_loss=0.1387, over 24492.00 frames. ], tot_loss[loss=0.3715, simple_loss=0.3928, pruned_loss=0.1589, over 4628877.08 frames. ], batch size: 75, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:13:39,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=10366.666666666666, ans=0.125 2023-12-03 23:13:52,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=10433.333333333334, ans=0.5348333333333334 2023-12-03 23:14:07,377 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.37 vs. limit=10.25 2023-12-03 23:14:24,815 INFO [train.py:1087] (2/4) Epoch 2, batch 700, loss[loss=0.3433, simple_loss=0.3822, pruned_loss=0.1389, over 24213.00 frames. ], tot_loss[loss=0.3649, simple_loss=0.3894, pruned_loss=0.1547, over 4678301.79 frames. ], batch size: 82, lr: 4.36e-02, grad_scale: 16.0 2023-12-03 23:14:25,119 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=10633.333333333334, ans=0.125 2023-12-03 23:14:36,324 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 3.790e+02 5.279e+02 7.116e+02 1.447e+03, threshold=1.056e+03, percent-clipped=14.0 2023-12-03 23:14:36,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=10700.0, ans=0.125 2023-12-03 23:14:45,081 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=10700.0, ans=0.125 2023-12-03 23:15:02,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=10833.333333333334, ans=0.125 2023-12-03 23:15:03,821 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=10833.333333333334, ans=0.5208333333333334 2023-12-03 23:15:21,821 INFO [train.py:1087] (2/4) Epoch 2, batch 750, loss[loss=0.334, simple_loss=0.3718, pruned_loss=0.137, over 24727.00 frames. ], tot_loss[loss=0.3594, simple_loss=0.3866, pruned_loss=0.1515, over 4700692.95 frames. ], batch size: 67, lr: 4.36e-02, grad_scale: 16.0 2023-12-03 23:15:22,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=10966.666666666666, ans=0.020972222222222225 2023-12-03 23:15:23,540 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.83 vs. limit=10.483333333333333 2023-12-03 23:15:41,482 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.42 vs. limit=11.6375 2023-12-03 23:15:50,925 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.72 vs. limit=11.6625 2023-12-03 23:16:00,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=11166.666666666666, ans=0.05 2023-12-03 23:16:17,669 INFO [train.py:1087] (2/4) Epoch 2, batch 800, loss[loss=0.3272, simple_loss=0.3729, pruned_loss=0.1302, over 24709.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3825, pruned_loss=0.1468, over 4737474.00 frames. ], batch size: 69, lr: 4.36e-02, grad_scale: 32.0 2023-12-03 23:16:29,209 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 3.331e+02 5.421e+02 7.994e+02 1.422e+03, threshold=1.084e+03, percent-clipped=13.0 2023-12-03 23:16:33,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=11366.666666666666, ans=0.01930555555555556 2023-12-03 23:16:39,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=11433.333333333334, ans=0.008384057971014493 2023-12-03 23:16:55,253 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.67 vs. limit=11.8125 2023-12-03 23:17:00,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=11566.666666666666, ans=0.4951666666666667 2023-12-03 23:17:02,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11566.666666666666, ans=0.125 2023-12-03 23:17:09,950 INFO [train.py:1087] (2/4) Epoch 2, batch 850, loss[loss=0.3475, simple_loss=0.3862, pruned_loss=0.1466, over 21459.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3802, pruned_loss=0.144, over 4764708.46 frames. ], batch size: 128, lr: 4.35e-02, grad_scale: 32.0 2023-12-03 23:17:12,429 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=11633.333333333334, ans=0.008340579710144928 2023-12-03 23:17:15,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=11633.333333333334, ans=0.008340579710144928 2023-12-03 23:17:21,526 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:17:42,864 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.54 vs. limit=16.375 2023-12-03 23:17:49,752 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.98 vs. limit=16.375 2023-12-03 23:18:12,792 INFO [train.py:1087] (2/4) Epoch 3, batch 0, loss[loss=0.3362, simple_loss=0.3777, pruned_loss=0.1409, over 21066.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3777, pruned_loss=0.1409, over 21066.00 frames. ], batch size: 127, lr: 4.14e-02, grad_scale: 32.0 2023-12-03 23:18:12,793 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-03 23:18:25,044 INFO [train.py:1119] (2/4) Epoch 3, validation: loss=0.2657, simple_loss=0.3425, pruned_loss=0.08453, over 944034.00 frames. 2023-12-03 23:18:25,045 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-03 23:18:29,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=7.983333333333333 2023-12-03 23:18:35,368 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.20 vs. limit=12.0 2023-12-03 23:18:42,342 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.023e+02 4.473e+02 6.097e+02 1.021e+03, threshold=8.947e+02, percent-clipped=0.0 2023-12-03 23:18:44,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=12000.0, ans=0.48000000000000004 2023-12-03 23:18:52,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=12066.666666666666, ans=0.008246376811594203 2023-12-03 23:19:04,204 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=12.05 2023-12-03 23:19:22,056 INFO [train.py:1087] (2/4) Epoch 3, batch 50, loss[loss=0.3319, simple_loss=0.3791, pruned_loss=0.1371, over 24797.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3696, pruned_loss=0.1304, over 1084582.98 frames. ], batch size: 73, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:19:40,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=12333.333333333334, ans=0.015277777777777772 2023-12-03 23:19:44,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=12400.0, ans=0.176 2023-12-03 23:19:46,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=12400.0, ans=0.125 2023-12-03 23:20:18,446 INFO [train.py:1087] (2/4) Epoch 3, batch 100, loss[loss=0.3129, simple_loss=0.365, pruned_loss=0.1266, over 24439.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3678, pruned_loss=0.1286, over 1910619.18 frames. ], batch size: 77, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:20:35,463 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.710e+02 4.012e+02 6.179e+02 1.254e+03, threshold=8.023e+02, percent-clipped=3.0 2023-12-03 23:20:36,998 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.52 vs. limit=17.0 2023-12-03 23:20:39,238 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.35 vs. limit=17.0 2023-12-03 23:20:39,331 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.66 vs. limit=8.166666666666666 2023-12-03 23:20:49,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=12733.333333333334, ans=0.008101449275362318 2023-12-03 23:20:51,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=12800.0, ans=0.013333333333333336 2023-12-03 23:21:14,484 INFO [train.py:1087] (2/4) Epoch 3, batch 150, loss[loss=0.3189, simple_loss=0.3687, pruned_loss=0.1326, over 24499.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3673, pruned_loss=0.1277, over 2556337.26 frames. ], batch size: 75, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:21:19,769 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.15 vs. limit=17.2 2023-12-03 23:21:28,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.86 vs. limit=17.25 2023-12-03 23:21:29,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13000.0, ans=0.16999999999999998 2023-12-03 23:21:37,993 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.44 vs. limit=12.4 2023-12-03 23:21:47,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=13133.333333333334, ans=0.4403333333333333 2023-12-03 23:21:56,445 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:22:02,147 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13200.0, ans=0.0 2023-12-03 23:22:04,533 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=13200.0, ans=0.125 2023-12-03 23:22:11,029 INFO [train.py:1087] (2/4) Epoch 3, batch 200, loss[loss=0.2953, simple_loss=0.3637, pruned_loss=0.1131, over 24703.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.366, pruned_loss=0.1266, over 3055229.32 frames. ], batch size: 69, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:22:12,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=13266.666666666666, ans=0.125 2023-12-03 23:22:29,350 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 2.686e+02 4.511e+02 7.045e+02 2.264e+03, threshold=9.022e+02, percent-clipped=20.0 2023-12-03 23:22:33,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=13400.0, ans=0.007956521739130435 2023-12-03 23:22:52,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=13466.666666666666, ans=0.4286666666666667 2023-12-03 23:23:07,705 INFO [train.py:1087] (2/4) Epoch 3, batch 250, loss[loss=0.3528, simple_loss=0.3996, pruned_loss=0.1531, over 21442.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3662, pruned_loss=0.1265, over 3436223.05 frames. ], batch size: 127, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:23:29,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=13733.333333333334, ans=0.009444444444444443 2023-12-03 23:23:30,519 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.41 vs. limit=17.8 2023-12-03 23:23:33,032 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.10 vs. limit=17.8 2023-12-03 23:23:41,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=13800.0, ans=0.125 2023-12-03 23:24:00,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=13866.666666666666, ans=0.125 2023-12-03 23:24:03,050 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=13933.333333333334, ans=0.125 2023-12-03 23:24:03,783 INFO [train.py:1087] (2/4) Epoch 3, batch 300, loss[loss=0.2809, simple_loss=0.3487, pruned_loss=0.1065, over 24602.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3647, pruned_loss=0.1248, over 3748342.74 frames. ], batch size: 68, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:24:08,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13933.333333333334, ans=0.125 2023-12-03 23:24:09,701 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=13933.333333333334, ans=0.007840579710144928 2023-12-03 23:24:09,838 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:24:15,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=14000.0, ans=0.125 2023-12-03 23:24:16,761 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=18.0 2023-12-03 23:24:21,873 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.699e+02 3.552e+02 4.986e+02 1.435e+03, threshold=7.104e+02, percent-clipped=8.0 2023-12-03 23:24:25,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=14066.666666666666, ans=0.40766666666666673 2023-12-03 23:24:25,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=14066.666666666666, ans=0.008055555555555559 2023-12-03 23:24:36,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=14066.666666666666, ans=0.125 2023-12-03 23:24:45,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=14133.333333333334, ans=0.007797101449275362 2023-12-03 23:24:59,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14266.666666666666, ans=0.125 2023-12-03 23:24:59,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=14266.666666666666, ans=0.00722222222222222 2023-12-03 23:24:59,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=14266.666666666666, ans=0.00722222222222222 2023-12-03 23:25:00,008 INFO [train.py:1087] (2/4) Epoch 3, batch 350, loss[loss=0.2688, simple_loss=0.3371, pruned_loss=0.1002, over 24557.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.363, pruned_loss=0.123, over 3990940.98 frames. ], batch size: 64, lr: 4.11e-02, grad_scale: 16.0 2023-12-03 23:25:21,809 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:25:24,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=14400.0, ans=0.125 2023-12-03 23:25:24,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=14400.0, ans=0.125 2023-12-03 23:25:51,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=14533.333333333334, ans=0.3913333333333333 2023-12-03 23:25:54,220 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.95 2023-12-03 23:25:56,775 INFO [train.py:1087] (2/4) Epoch 3, batch 400, loss[loss=0.2681, simple_loss=0.3342, pruned_loss=0.101, over 24578.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.361, pruned_loss=0.1212, over 4182705.10 frames. ], batch size: 64, lr: 4.11e-02, grad_scale: 32.0 2023-12-03 23:25:58,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=14600.0, ans=0.125 2023-12-03 23:26:01,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=14600.0, ans=0.389 2023-12-03 23:26:05,528 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=14600.0, ans=0.154 2023-12-03 23:26:16,108 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.839e+02 4.294e+02 6.758e+02 1.838e+03, threshold=8.588e+02, percent-clipped=21.0 2023-12-03 23:26:16,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.51 vs. limit=18.5 2023-12-03 23:26:22,335 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.00 vs. limit=13.025 2023-12-03 23:26:42,253 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.50 vs. limit=18.65 2023-12-03 23:26:43,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14866.666666666666, ans=0.15133333333333335 2023-12-03 23:26:53,945 INFO [train.py:1087] (2/4) Epoch 3, batch 450, loss[loss=0.2631, simple_loss=0.3328, pruned_loss=0.09669, over 24779.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3596, pruned_loss=0.1193, over 4335975.74 frames. ], batch size: 71, lr: 4.10e-02, grad_scale: 32.0 2023-12-03 23:26:56,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=14933.333333333334, ans=0.125 2023-12-03 23:27:00,112 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.15 vs. limit=8.733333333333334 2023-12-03 23:27:09,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=15000.0, ans=0.125 2023-12-03 23:27:12,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15000.0, ans=0.15000000000000002 2023-12-03 23:27:28,441 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=13.175 2023-12-03 23:27:46,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=15200.0, ans=0.0 2023-12-03 23:27:50,284 INFO [train.py:1087] (2/4) Epoch 3, batch 500, loss[loss=0.4055, simple_loss=0.4184, pruned_loss=0.1963, over 16985.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3596, pruned_loss=0.119, over 4433501.34 frames. ], batch size: 177, lr: 4.10e-02, grad_scale: 16.0 2023-12-03 23:27:52,616 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=15266.666666666666, ans=0.125 2023-12-03 23:28:08,697 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.776e+02 4.021e+02 5.372e+02 1.015e+03, threshold=8.041e+02, percent-clipped=4.0 2023-12-03 23:28:16,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=15400.0, ans=0.125 2023-12-03 23:28:20,254 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=13.275 2023-12-03 23:28:34,264 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.61 vs. limit=13.325 2023-12-03 23:28:37,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15533.333333333334, ans=0.14466666666666667 2023-12-03 23:28:41,169 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.71 vs. limit=19.15 2023-12-03 23:28:47,159 INFO [train.py:1087] (2/4) Epoch 3, batch 550, loss[loss=0.2869, simple_loss=0.3501, pruned_loss=0.1118, over 24482.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3593, pruned_loss=0.1188, over 4489109.27 frames. ], batch size: 75, lr: 4.10e-02, grad_scale: 16.0 2023-12-03 23:29:22,338 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.81 vs. limit=19.35 2023-12-03 23:29:42,465 INFO [train.py:1087] (2/4) Epoch 3, batch 600, loss[loss=0.2672, simple_loss=0.3381, pruned_loss=0.09821, over 24813.00 frames. ], tot_loss[loss=0.2967, simple_loss=0.358, pruned_loss=0.1174, over 4563724.93 frames. ], batch size: 72, lr: 4.09e-02, grad_scale: 16.0 2023-12-03 23:30:02,675 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.514e+02 3.492e+02 5.338e+02 1.061e+03, threshold=6.983e+02, percent-clipped=7.0 2023-12-03 23:30:09,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=16066.666666666666, ans=0.125 2023-12-03 23:30:39,889 INFO [train.py:1087] (2/4) Epoch 3, batch 650, loss[loss=0.2702, simple_loss=0.3465, pruned_loss=0.09692, over 24802.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3574, pruned_loss=0.1166, over 4617685.19 frames. ], batch size: 62, lr: 4.09e-02, grad_scale: 16.0 2023-12-03 23:30:54,449 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=13.625 2023-12-03 23:31:09,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=16400.0, ans=0.0 2023-12-03 23:31:17,338 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=16466.666666666668, ans=0.125 2023-12-03 23:31:19,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=16466.666666666668, ans=0.0 2023-12-03 23:31:20,922 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=16466.666666666668, ans=0.125 2023-12-03 23:31:24,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=16533.333333333332, ans=0.125 2023-12-03 23:31:36,836 INFO [train.py:1087] (2/4) Epoch 3, batch 700, loss[loss=0.2777, simple_loss=0.3519, pruned_loss=0.1018, over 24550.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3568, pruned_loss=0.116, over 4653079.70 frames. ], batch size: 66, lr: 4.08e-02, grad_scale: 16.0 2023-12-03 23:31:43,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=16600.0, ans=0.125 2023-12-03 23:31:47,934 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=13.333333333333334 2023-12-03 23:31:55,284 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.863e+02 3.533e+02 4.418e+02 1.002e+03, threshold=7.066e+02, percent-clipped=8.0 2023-12-03 23:32:18,862 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=16800.0, ans=0.0 2023-12-03 23:32:32,785 INFO [train.py:1087] (2/4) Epoch 3, batch 750, loss[loss=0.2815, simple_loss=0.3487, pruned_loss=0.1072, over 24707.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.355, pruned_loss=0.1143, over 4706961.08 frames. ], batch size: 74, lr: 4.08e-02, grad_scale: 16.0 2023-12-03 23:32:43,119 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=17000.0, ans=0.125 2023-12-03 23:32:57,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17066.666666666668, ans=0.12933333333333333 2023-12-03 23:32:59,712 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=17066.666666666668, ans=0.125 2023-12-03 23:33:06,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=17133.333333333332, ans=0.30033333333333345 2023-12-03 23:33:28,567 INFO [train.py:1087] (2/4) Epoch 3, batch 800, loss[loss=0.3364, simple_loss=0.3832, pruned_loss=0.1448, over 21531.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.354, pruned_loss=0.113, over 4735222.62 frames. ], batch size: 127, lr: 4.08e-02, grad_scale: 32.0 2023-12-03 23:33:46,739 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.911e+02 4.063e+02 5.816e+02 1.939e+03, threshold=8.126e+02, percent-clipped=11.0 2023-12-03 23:34:01,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=17466.666666666668, ans=0.007072463768115942 2023-12-03 23:34:11,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=17533.333333333332, ans=0.28633333333333344 2023-12-03 23:34:12,357 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:34:12,600 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.52 vs. limit=20.65 2023-12-03 23:34:20,175 INFO [train.py:1087] (2/4) Epoch 3, batch 850, loss[loss=0.2795, simple_loss=0.3435, pruned_loss=0.1077, over 24508.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3529, pruned_loss=0.1121, over 4757833.71 frames. ], batch size: 75, lr: 4.07e-02, grad_scale: 32.0 2023-12-03 23:34:55,079 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.36 vs. limit=13.9 2023-12-03 23:34:56,788 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:35:22,592 INFO [train.py:1087] (2/4) Epoch 4, batch 0, loss[loss=0.2673, simple_loss=0.3397, pruned_loss=0.09749, over 24709.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3397, pruned_loss=0.09749, over 24709.00 frames. ], batch size: 67, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:35:22,593 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-03 23:35:34,670 INFO [train.py:1119] (2/4) Epoch 4, validation: loss=0.2286, simple_loss=0.324, pruned_loss=0.06665, over 944034.00 frames. 2023-12-03 23:35:34,671 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-03 23:35:55,609 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.86 vs. limit=14.2375 2023-12-03 23:35:59,282 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.897e+02 3.931e+02 5.307e+02 8.427e+02, threshold=7.861e+02, percent-clipped=1.0 2023-12-03 23:36:02,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=18033.333333333332, ans=0.035 2023-12-03 23:36:11,373 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=18100.0, ans=0.125 2023-12-03 23:36:14,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=18100.0, ans=10.0 2023-12-03 23:36:22,756 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=18166.666666666668, ans=0.0 2023-12-03 23:36:29,870 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.21 vs. limit=9.558333333333334 2023-12-03 23:36:30,365 INFO [train.py:1087] (2/4) Epoch 4, batch 50, loss[loss=0.2694, simple_loss=0.3355, pruned_loss=0.1017, over 24805.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3481, pruned_loss=0.1062, over 1084595.37 frames. ], batch size: 73, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:36:36,299 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=14.3375 2023-12-03 23:37:04,630 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=14.4125 2023-12-03 23:37:17,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18500.0, ans=0.125 2023-12-03 23:37:25,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=18566.666666666668, ans=0.0 2023-12-03 23:37:26,360 INFO [train.py:1087] (2/4) Epoch 4, batch 100, loss[loss=0.2768, simple_loss=0.3459, pruned_loss=0.1038, over 24483.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3459, pruned_loss=0.1051, over 1914370.99 frames. ], batch size: 75, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:37:30,838 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18566.666666666668, ans=0.11433333333333331 2023-12-03 23:37:30,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=18566.666666666668, ans=0.125 2023-12-03 23:37:33,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18566.666666666668, ans=0.11433333333333331 2023-12-03 23:37:39,647 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=18633.333333333332, ans=0.0 2023-12-03 23:37:44,194 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=18633.333333333332, ans=0.006818840579710146 2023-12-03 23:37:50,377 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 2.445e+02 3.359e+02 4.351e+02 7.950e+02, threshold=6.718e+02, percent-clipped=1.0 2023-12-03 23:37:56,372 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=18700.0, ans=0.025 2023-12-03 23:38:04,630 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=14.5375 2023-12-03 23:38:21,118 INFO [train.py:1087] (2/4) Epoch 4, batch 150, loss[loss=0.2697, simple_loss=0.3375, pruned_loss=0.1009, over 24144.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3447, pruned_loss=0.104, over 2567390.78 frames. ], batch size: 82, lr: 3.79e-02, grad_scale: 32.0 2023-12-03 23:38:39,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=18966.666666666668, ans=0.125 2023-12-03 23:38:39,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=18966.666666666668, ans=0.4845 2023-12-03 23:38:46,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=19033.333333333332, ans=0.09899494936611666 2023-12-03 23:38:50,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=19033.333333333332, ans=0.125 2023-12-03 23:38:54,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=19100.0, ans=0.23150000000000004 2023-12-03 23:38:59,856 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=14.6625 2023-12-03 23:39:02,375 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.60 vs. limit=9.775 2023-12-03 23:39:04,596 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=14.6625 2023-12-03 23:39:10,689 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=19166.666666666668, ans=0.125 2023-12-03 23:39:14,913 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=19166.666666666668, ans=0.125 2023-12-03 23:39:16,829 INFO [train.py:1087] (2/4) Epoch 4, batch 200, loss[loss=0.2397, simple_loss=0.3166, pruned_loss=0.08137, over 24757.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3441, pruned_loss=0.103, over 3080593.10 frames. ], batch size: 65, lr: 3.79e-02, grad_scale: 32.0 2023-12-03 23:39:23,523 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:39:23,632 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=19233.333333333332, ans=0.125 2023-12-03 23:39:27,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=19300.0, ans=0.025 2023-12-03 23:39:36,200 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=19300.0, ans=0.125 2023-12-03 23:39:41,566 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.310e+02 2.925e+02 3.881e+02 7.060e+02, threshold=5.850e+02, percent-clipped=2.0 2023-12-03 23:39:45,133 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=19366.666666666668, ans=0.22216666666666673 2023-12-03 23:39:50,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=19433.333333333332, ans=0.125 2023-12-03 23:39:57,231 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=14.7875 2023-12-03 23:40:00,004 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=19500.0, ans=0.0 2023-12-03 23:40:00,389 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.32 vs. limit=22.125 2023-12-03 23:40:04,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=19500.0, ans=0.006630434782608696 2023-12-03 23:40:10,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=19500.0, ans=0.006630434782608696 2023-12-03 23:40:13,194 INFO [train.py:1087] (2/4) Epoch 4, batch 250, loss[loss=0.2657, simple_loss=0.3406, pruned_loss=0.09542, over 24461.00 frames. ], tot_loss[loss=0.275, simple_loss=0.344, pruned_loss=0.103, over 3471260.51 frames. ], batch size: 77, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:40:28,555 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=14.8625 2023-12-03 23:40:52,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19766.666666666668, ans=0.0 2023-12-03 23:41:07,343 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=14.9375 2023-12-03 23:41:08,725 INFO [train.py:1087] (2/4) Epoch 4, batch 300, loss[loss=0.2856, simple_loss=0.3485, pruned_loss=0.1113, over 23515.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3436, pruned_loss=0.1031, over 3755799.10 frames. ], batch size: 94, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:41:13,401 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19900.0, ans=0.101 2023-12-03 23:41:32,927 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.490e+02 3.248e+02 4.505e+02 1.296e+03, threshold=6.496e+02, percent-clipped=13.0 2023-12-03 23:41:45,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=20100.0, ans=0.07 2023-12-03 23:42:03,849 INFO [train.py:1087] (2/4) Epoch 4, batch 350, loss[loss=0.2677, simple_loss=0.3388, pruned_loss=0.09835, over 24748.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3437, pruned_loss=0.1033, over 3976305.55 frames. ], batch size: 65, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:42:08,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20233.333333333332, ans=0.1 2023-12-03 23:42:09,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=20233.333333333332, ans=0.07 2023-12-03 23:42:18,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=20300.0, ans=0.07 2023-12-03 23:42:51,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=20500.0, ans=0.2 2023-12-03 23:42:59,463 INFO [train.py:1087] (2/4) Epoch 4, batch 400, loss[loss=0.2658, simple_loss=0.337, pruned_loss=0.09733, over 23765.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3431, pruned_loss=0.1029, over 4146907.49 frames. ], batch size: 57, lr: 3.77e-02, grad_scale: 32.0 2023-12-03 23:43:23,887 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.420e+02 3.127e+02 4.015e+02 7.262e+02, threshold=6.254e+02, percent-clipped=2.0 2023-12-03 23:43:27,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=20700.0, ans=0.2 2023-12-03 23:43:34,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=20766.666666666668, ans=0.125 2023-12-03 23:43:47,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=20833.333333333332, ans=0.0 2023-12-03 23:43:55,281 INFO [train.py:1087] (2/4) Epoch 4, batch 450, loss[loss=0.2615, simple_loss=0.335, pruned_loss=0.09402, over 24560.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3415, pruned_loss=0.1015, over 4299447.13 frames. ], batch size: 63, lr: 3.77e-02, grad_scale: 32.0 2023-12-03 23:43:55,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=20900.0, ans=0.125 2023-12-03 23:44:01,620 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.50 vs. limit=22.5 2023-12-03 23:44:24,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=21033.333333333332, ans=0.006297101449275363 2023-12-03 23:44:33,215 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.59 vs. limit=22.5 2023-12-03 23:44:37,760 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-12-03 23:44:38,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=21166.666666666668, ans=0.125 2023-12-03 23:44:40,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21166.666666666668, ans=0.1 2023-12-03 23:44:47,100 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.33 vs. limit=15.0 2023-12-03 23:44:50,929 INFO [train.py:1087] (2/4) Epoch 4, batch 500, loss[loss=0.265, simple_loss=0.3424, pruned_loss=0.0938, over 21713.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3407, pruned_loss=0.1011, over 4412837.65 frames. ], batch size: 52, lr: 3.76e-02, grad_scale: 32.0 2023-12-03 23:44:52,353 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=21233.333333333332, ans=0.1 2023-12-03 23:44:55,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=21233.333333333332, ans=0.125 2023-12-03 23:44:55,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=21233.333333333332, ans=0.95 2023-12-03 23:44:59,713 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21233.333333333332, ans=0.125 2023-12-03 23:45:14,560 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.84 vs. limit=15.0 2023-12-03 23:45:15,021 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.343e+02 3.151e+02 5.447e+02 1.233e+03, threshold=6.302e+02, percent-clipped=16.0 2023-12-03 23:45:16,596 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-12-03 23:45:45,602 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=10.96 vs. limit=12.0 2023-12-03 23:45:46,093 INFO [train.py:1087] (2/4) Epoch 4, batch 550, loss[loss=0.3503, simple_loss=0.3912, pruned_loss=0.1547, over 16917.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3406, pruned_loss=0.1008, over 4490071.89 frames. ], batch size: 177, lr: 3.76e-02, grad_scale: 32.0 2023-12-03 23:45:49,624 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.24 vs. limit=12.0 2023-12-03 23:45:57,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=21633.333333333332, ans=0.0 2023-12-03 23:46:06,749 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.28 vs. limit=22.5 2023-12-03 23:46:11,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=21700.0, ans=0.125 2023-12-03 23:46:13,179 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.53 vs. limit=15.0 2023-12-03 23:46:16,238 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21700.0, ans=0.1 2023-12-03 23:46:41,160 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.96 vs. limit=12.0 2023-12-03 23:46:41,668 INFO [train.py:1087] (2/4) Epoch 4, batch 600, loss[loss=0.2772, simple_loss=0.3394, pruned_loss=0.1075, over 24046.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3394, pruned_loss=0.1002, over 4559248.78 frames. ], batch size: 87, lr: 3.75e-02, grad_scale: 16.0 2023-12-03 23:46:58,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21966.666666666668, ans=0.1 2023-12-03 23:47:07,296 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.361e+02 2.942e+02 4.171e+02 8.666e+02, threshold=5.884e+02, percent-clipped=10.0 2023-12-03 23:47:10,811 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=22033.333333333332, ans=0.125 2023-12-03 23:47:11,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=22033.333333333332, ans=0.0 2023-12-03 23:47:14,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=22100.0, ans=0.125 2023-12-03 23:47:27,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=22166.666666666668, ans=0.0 2023-12-03 23:47:37,235 INFO [train.py:1087] (2/4) Epoch 4, batch 650, loss[loss=0.2628, simple_loss=0.3339, pruned_loss=0.09586, over 24496.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3389, pruned_loss=0.09957, over 4626631.40 frames. ], batch size: 77, lr: 3.75e-02, grad_scale: 16.0 2023-12-03 23:47:43,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=22233.333333333332, ans=0.006036231884057971 2023-12-03 23:47:45,367 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2023-12-03 23:48:17,016 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=22433.333333333332, ans=0.125 2023-12-03 23:48:30,741 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=22500.0, ans=0.07 2023-12-03 23:48:32,503 INFO [train.py:1087] (2/4) Epoch 4, batch 700, loss[loss=0.2777, simple_loss=0.3497, pruned_loss=0.1029, over 24494.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.339, pruned_loss=0.09984, over 4646339.57 frames. ], batch size: 77, lr: 3.74e-02, grad_scale: 16.0 2023-12-03 23:48:41,473 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.70 vs. limit=22.5 2023-12-03 23:48:42,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=22633.333333333332, ans=0.125 2023-12-03 23:48:44,227 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=22633.333333333332, ans=0.2 2023-12-03 23:48:52,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22633.333333333332, ans=0.1 2023-12-03 23:48:58,105 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-12-03 23:48:58,806 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.583e+02 3.364e+02 4.437e+02 7.652e+02, threshold=6.728e+02, percent-clipped=5.0 2023-12-03 23:49:08,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=22766.666666666668, ans=0.125 2023-12-03 23:49:28,655 INFO [train.py:1087] (2/4) Epoch 4, batch 750, loss[loss=0.2665, simple_loss=0.3359, pruned_loss=0.09858, over 24684.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3381, pruned_loss=0.09908, over 4683176.27 frames. ], batch size: 74, lr: 3.74e-02, grad_scale: 16.0 2023-12-03 23:49:29,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=22900.0, ans=0.125 2023-12-03 23:49:29,936 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22900.0, ans=0.1 2023-12-03 23:50:21,975 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=23166.666666666668, ans=0.125 2023-12-03 23:50:23,863 INFO [train.py:1087] (2/4) Epoch 4, batch 800, loss[loss=0.2609, simple_loss=0.3375, pruned_loss=0.09213, over 24750.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3372, pruned_loss=0.0983, over 4724285.58 frames. ], batch size: 63, lr: 3.73e-02, grad_scale: 32.0 2023-12-03 23:50:30,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=23233.333333333332, ans=0.0 2023-12-03 23:50:32,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=23233.333333333332, ans=0.125 2023-12-03 23:50:43,170 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.91 vs. limit=22.5 2023-12-03 23:50:49,074 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.403e+02 2.805e+02 3.728e+02 6.833e+02, threshold=5.611e+02, percent-clipped=1.0 2023-12-03 23:51:10,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=15.0 2023-12-03 23:51:12,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=23500.0, ans=0.125 2023-12-03 23:51:16,401 INFO [train.py:1087] (2/4) Epoch 4, batch 850, loss[loss=0.2731, simple_loss=0.3487, pruned_loss=0.09878, over 24753.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3374, pruned_loss=0.09833, over 4743001.25 frames. ], batch size: 61, lr: 3.73e-02, grad_scale: 32.0 2023-12-03 23:51:16,981 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-12-03 23:51:27,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=23633.333333333332, ans=0.125 2023-12-03 23:51:38,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=23700.0, ans=0.07 2023-12-03 23:51:46,549 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=12.36 vs. limit=10.0 2023-12-03 23:51:59,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=23833.333333333332, ans=0.125 2023-12-03 23:52:18,339 INFO [train.py:1087] (2/4) Epoch 5, batch 0, loss[loss=0.2787, simple_loss=0.3448, pruned_loss=0.1063, over 23438.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3448, pruned_loss=0.1063, over 23438.00 frames. ], batch size: 94, lr: 3.47e-02, grad_scale: 32.0 2023-12-03 23:52:18,339 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-03 23:52:30,469 INFO [train.py:1119] (2/4) Epoch 5, validation: loss=0.2159, simple_loss=0.3139, pruned_loss=0.05896, over 944034.00 frames. 2023-12-03 23:52:30,470 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-03 23:52:41,507 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-12-03 23:52:43,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-12-03 23:52:51,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=24000.0, ans=15.0 2023-12-03 23:53:01,281 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.538e+02 3.142e+02 3.972e+02 7.221e+02, threshold=6.284e+02, percent-clipped=5.0 2023-12-03 23:53:02,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=24066.666666666668, ans=0.025 2023-12-03 23:53:07,469 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=24066.666666666668, ans=0.125 2023-12-03 23:53:26,116 INFO [train.py:1087] (2/4) Epoch 5, batch 50, loss[loss=0.2589, simple_loss=0.334, pruned_loss=0.09187, over 24455.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3383, pruned_loss=0.09857, over 1080814.22 frames. ], batch size: 77, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:53:27,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=24200.0, ans=0.2 2023-12-03 23:54:02,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24400.0, ans=0.125 2023-12-03 23:54:04,020 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-12-03 23:54:04,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24400.0, ans=0.1 2023-12-03 23:54:05,164 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.16 vs. limit=22.5 2023-12-03 23:54:09,536 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=24466.666666666668, ans=0.2 2023-12-03 23:54:21,771 INFO [train.py:1087] (2/4) Epoch 5, batch 100, loss[loss=0.2477, simple_loss=0.3197, pruned_loss=0.08785, over 24865.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3332, pruned_loss=0.0944, over 1919240.94 frames. ], batch size: 68, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:54:35,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=24600.0, ans=0.2 2023-12-03 23:54:35,257 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24600.0, ans=0.125 2023-12-03 23:54:40,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=24600.0, ans=0.125 2023-12-03 23:54:44,098 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=24666.666666666668, ans=0.05 2023-12-03 23:54:48,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=24666.666666666668, ans=0.125 2023-12-03 23:54:52,470 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 2.253e+02 2.574e+02 3.314e+02 5.061e+02, threshold=5.148e+02, percent-clipped=0.0 2023-12-03 23:54:57,938 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=24733.333333333332, ans=0.0 2023-12-03 23:55:00,181 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:55:02,279 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=24733.333333333332, ans=0.0 2023-12-03 23:55:17,315 INFO [train.py:1087] (2/4) Epoch 5, batch 150, loss[loss=0.3387, simple_loss=0.387, pruned_loss=0.1452, over 17169.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3333, pruned_loss=0.09461, over 2550777.80 frames. ], batch size: 177, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:55:23,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=24866.666666666668, ans=0.0 2023-12-03 23:55:28,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=24933.333333333332, ans=0.025 2023-12-03 23:55:36,387 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=24933.333333333332, ans=0.125 2023-12-03 23:55:37,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=24933.333333333332, ans=0.125 2023-12-03 23:55:46,306 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2023-12-03 23:55:55,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=25066.666666666668, ans=0.125 2023-12-03 23:56:12,425 INFO [train.py:1087] (2/4) Epoch 5, batch 200, loss[loss=0.234, simple_loss=0.3093, pruned_loss=0.07934, over 24553.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3323, pruned_loss=0.09393, over 3042207.67 frames. ], batch size: 63, lr: 3.45e-02, grad_scale: 32.0 2023-12-03 23:56:34,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=25333.333333333332, ans=0.1 2023-12-03 23:56:43,923 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 2.107e+02 2.550e+02 3.488e+02 6.968e+02, threshold=5.101e+02, percent-clipped=3.0 2023-12-03 23:56:44,651 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.49 vs. limit=15.0 2023-12-03 23:56:49,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=25400.0, ans=0.0 2023-12-03 23:57:08,974 INFO [train.py:1087] (2/4) Epoch 5, batch 250, loss[loss=0.2579, simple_loss=0.3298, pruned_loss=0.09298, over 24033.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3312, pruned_loss=0.09307, over 3440184.37 frames. ], batch size: 87, lr: 3.45e-02, grad_scale: 32.0 2023-12-03 23:57:53,158 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=25800.0, ans=0.125 2023-12-03 23:58:04,572 INFO [train.py:1087] (2/4) Epoch 5, batch 300, loss[loss=0.2724, simple_loss=0.3449, pruned_loss=0.09995, over 23065.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3309, pruned_loss=0.0924, over 3750858.89 frames. ], batch size: 106, lr: 3.44e-02, grad_scale: 32.0 2023-12-03 23:58:24,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=25933.333333333332, ans=0.125 2023-12-03 23:58:35,989 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 2.281e+02 3.318e+02 4.636e+02 8.422e+02, threshold=6.635e+02, percent-clipped=20.0 2023-12-03 23:58:59,493 INFO [train.py:1087] (2/4) Epoch 5, batch 350, loss[loss=0.2521, simple_loss=0.3216, pruned_loss=0.09125, over 24523.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3304, pruned_loss=0.09227, over 3985109.73 frames. ], batch size: 75, lr: 3.44e-02, grad_scale: 32.0 2023-12-03 23:58:59,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=26200.0, ans=0.125 2023-12-03 23:59:12,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=26266.666666666668, ans=0.125 2023-12-03 23:59:13,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=26266.666666666668, ans=0.025 2023-12-03 23:59:19,905 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=26266.666666666668, ans=0.125 2023-12-03 23:59:19,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=26266.666666666668, ans=0.2 2023-12-03 23:59:26,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=26333.333333333332, ans=0.125 2023-12-03 23:59:27,459 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26333.333333333332, ans=0.125 2023-12-03 23:59:54,410 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.61 vs. limit=22.5 2023-12-03 23:59:54,876 INFO [train.py:1087] (2/4) Epoch 5, batch 400, loss[loss=0.2746, simple_loss=0.3437, pruned_loss=0.1028, over 21692.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.33, pruned_loss=0.09173, over 4184776.47 frames. ], batch size: 127, lr: 3.43e-02, grad_scale: 32.0 2023-12-03 23:59:58,363 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:00:04,958 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=26600.0, ans=0.125 2023-12-04 00:00:30,023 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.259e+02 2.965e+02 3.727e+02 6.747e+02, threshold=5.929e+02, percent-clipped=2.0 2023-12-04 00:00:48,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=26800.0, ans=0.005043478260869565 2023-12-04 00:00:53,792 INFO [train.py:1087] (2/4) Epoch 5, batch 450, loss[loss=0.2532, simple_loss=0.3271, pruned_loss=0.08962, over 23613.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3295, pruned_loss=0.09118, over 4325261.33 frames. ], batch size: 94, lr: 3.43e-02, grad_scale: 32.0 2023-12-04 00:01:07,931 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=26933.333333333332, ans=0.0 2023-12-04 00:01:27,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27066.666666666668, ans=0.125 2023-12-04 00:01:49,302 INFO [train.py:1087] (2/4) Epoch 5, batch 500, loss[loss=0.2403, simple_loss=0.3166, pruned_loss=0.08205, over 24762.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3284, pruned_loss=0.09029, over 4450594.88 frames. ], batch size: 70, lr: 3.42e-02, grad_scale: 32.0 2023-12-04 00:01:50,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=27200.0, ans=0.004956521739130435 2023-12-04 00:01:56,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=27200.0, ans=0.125 2023-12-04 00:02:01,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=27266.666666666668, ans=0.0 2023-12-04 00:02:09,078 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=27266.666666666668, ans=0.004942028985507246 2023-12-04 00:02:18,865 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-12-04 00:02:21,286 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 2.293e+02 2.878e+02 3.785e+02 5.915e+02, threshold=5.756e+02, percent-clipped=0.0 2023-12-04 00:02:25,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=27400.0, ans=0.07 2023-12-04 00:02:28,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=27400.0, ans=0.00491304347826087 2023-12-04 00:02:33,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=27466.666666666668, ans=0.0 2023-12-04 00:02:35,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=27466.666666666668, ans=0.2 2023-12-04 00:02:41,152 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=27466.666666666668, ans=0.2 2023-12-04 00:02:42,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=27466.666666666668, ans=0.04949747468305833 2023-12-04 00:02:43,381 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=27533.333333333332, ans=0.0 2023-12-04 00:02:44,174 INFO [train.py:1087] (2/4) Epoch 5, batch 550, loss[loss=0.2312, simple_loss=0.3061, pruned_loss=0.07815, over 24753.00 frames. ], tot_loss[loss=0.254, simple_loss=0.328, pruned_loss=0.08997, over 4543391.36 frames. ], batch size: 66, lr: 3.42e-02, grad_scale: 32.0 2023-12-04 00:02:49,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=27533.333333333332, ans=0.1 2023-12-04 00:02:59,482 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=27600.0, ans=0.125 2023-12-04 00:03:08,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=27666.666666666668, ans=0.125 2023-12-04 00:03:15,562 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.74 vs. limit=5.0 2023-12-04 00:03:41,068 INFO [train.py:1087] (2/4) Epoch 5, batch 600, loss[loss=0.2286, simple_loss=0.3118, pruned_loss=0.07269, over 24845.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3282, pruned_loss=0.08991, over 4597795.94 frames. ], batch size: 68, lr: 3.41e-02, grad_scale: 32.0 2023-12-04 00:03:56,379 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=27933.333333333332, ans=0.125 2023-12-04 00:04:09,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=28000.0, ans=0.05 2023-12-04 00:04:13,739 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.232e+02 2.803e+02 3.888e+02 7.159e+02, threshold=5.606e+02, percent-clipped=4.0 2023-12-04 00:04:26,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=28133.333333333332, ans=0.0 2023-12-04 00:04:29,147 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=28133.333333333332, ans=0.125 2023-12-04 00:04:33,979 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-12-04 00:04:37,502 INFO [train.py:1087] (2/4) Epoch 5, batch 650, loss[loss=0.2663, simple_loss=0.3372, pruned_loss=0.09766, over 24300.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3282, pruned_loss=0.08986, over 4653828.85 frames. ], batch size: 79, lr: 3.41e-02, grad_scale: 32.0 2023-12-04 00:04:41,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=28200.0, ans=0.07 2023-12-04 00:04:50,736 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=28266.666666666668, ans=0.0 2023-12-04 00:05:03,881 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.96 vs. limit=22.5 2023-12-04 00:05:07,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=28333.333333333332, ans=0.2 2023-12-04 00:05:30,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28466.666666666668, ans=0.1 2023-12-04 00:05:33,778 INFO [train.py:1087] (2/4) Epoch 5, batch 700, loss[loss=0.2561, simple_loss=0.3313, pruned_loss=0.09049, over 24777.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3279, pruned_loss=0.08996, over 4678432.92 frames. ], batch size: 62, lr: 3.40e-02, grad_scale: 32.0 2023-12-04 00:05:43,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=28600.0, ans=0.125 2023-12-04 00:05:47,247 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28600.0, ans=0.125 2023-12-04 00:06:06,254 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.233e+02 2.774e+02 3.473e+02 7.233e+02, threshold=5.549e+02, percent-clipped=3.0 2023-12-04 00:06:15,091 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=28733.333333333332, ans=0.05 2023-12-04 00:06:23,052 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=28800.0, ans=0.004608695652173913 2023-12-04 00:06:29,838 INFO [train.py:1087] (2/4) Epoch 5, batch 750, loss[loss=0.2343, simple_loss=0.3124, pruned_loss=0.07807, over 24790.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3278, pruned_loss=0.09, over 4710289.96 frames. ], batch size: 62, lr: 3.40e-02, grad_scale: 32.0 2023-12-04 00:06:43,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=28933.333333333332, ans=0.0 2023-12-04 00:07:07,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=29066.666666666668, ans=0.125 2023-12-04 00:07:13,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=29133.333333333332, ans=0.004536231884057972 2023-12-04 00:07:14,706 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:07:25,086 INFO [train.py:1087] (2/4) Epoch 5, batch 800, loss[loss=0.2434, simple_loss=0.3182, pruned_loss=0.08434, over 24806.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3272, pruned_loss=0.08945, over 4731526.96 frames. ], batch size: 62, lr: 3.39e-02, grad_scale: 32.0 2023-12-04 00:07:30,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=29200.0, ans=0.09899494936611666 2023-12-04 00:07:31,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=29200.0, ans=0.125 2023-12-04 00:07:51,007 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.95 vs. limit=15.0 2023-12-04 00:07:56,521 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 2.263e+02 2.786e+02 3.430e+02 5.590e+02, threshold=5.572e+02, percent-clipped=1.0 2023-12-04 00:08:17,893 INFO [train.py:1087] (2/4) Epoch 5, batch 850, loss[loss=0.2651, simple_loss=0.3387, pruned_loss=0.09574, over 23449.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.327, pruned_loss=0.08923, over 4757940.38 frames. ], batch size: 94, lr: 3.39e-02, grad_scale: 32.0 2023-12-04 00:09:19,640 INFO [train.py:1087] (2/4) Epoch 6, batch 0, loss[loss=0.2626, simple_loss=0.3319, pruned_loss=0.09664, over 24769.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3319, pruned_loss=0.09664, over 24769.00 frames. ], batch size: 64, lr: 3.16e-02, grad_scale: 32.0 2023-12-04 00:09:19,641 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 00:09:31,938 INFO [train.py:1119] (2/4) Epoch 6, validation: loss=0.2086, simple_loss=0.3076, pruned_loss=0.05475, over 944034.00 frames. 2023-12-04 00:09:31,938 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 00:09:50,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29900.0, ans=0.0 2023-12-04 00:09:57,374 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2023-12-04 00:10:00,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=29966.666666666668, ans=0.95 2023-12-04 00:10:11,102 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.341e+02 2.934e+02 3.953e+02 6.299e+02, threshold=5.868e+02, percent-clipped=2.0 2023-12-04 00:10:16,734 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=30100.0, ans=0.125 2023-12-04 00:10:27,933 INFO [train.py:1087] (2/4) Epoch 6, batch 50, loss[loss=0.2411, simple_loss=0.3232, pruned_loss=0.07952, over 24790.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3257, pruned_loss=0.0878, over 1079184.81 frames. ], batch size: 62, lr: 3.15e-02, grad_scale: 32.0 2023-12-04 00:10:28,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=30166.666666666668, ans=0.125 2023-12-04 00:10:28,464 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.04 vs. limit=15.0 2023-12-04 00:10:30,527 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=15.0 2023-12-04 00:10:36,858 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30166.666666666668, ans=0.1 2023-12-04 00:10:41,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=30233.333333333332, ans=0.0 2023-12-04 00:10:48,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=30300.0, ans=0.125 2023-12-04 00:10:56,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=30300.0, ans=0.125 2023-12-04 00:11:10,565 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=30366.666666666668, ans=0.125 2023-12-04 00:11:23,791 INFO [train.py:1087] (2/4) Epoch 6, batch 100, loss[loss=0.2482, simple_loss=0.3253, pruned_loss=0.08557, over 24613.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3246, pruned_loss=0.08657, over 1914318.30 frames. ], batch size: 68, lr: 3.15e-02, grad_scale: 32.0 2023-12-04 00:11:41,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30566.666666666668, ans=0.1 2023-12-04 00:12:03,694 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.998e+02 2.383e+02 2.874e+02 5.461e+02, threshold=4.765e+02, percent-clipped=0.0 2023-12-04 00:12:06,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=30700.0, ans=0.125 2023-12-04 00:12:19,879 INFO [train.py:1087] (2/4) Epoch 6, batch 150, loss[loss=0.2364, simple_loss=0.3131, pruned_loss=0.07989, over 24545.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3238, pruned_loss=0.08645, over 2556656.44 frames. ], batch size: 62, lr: 3.14e-02, grad_scale: 32.0 2023-12-04 00:12:24,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=30833.333333333332, ans=0.125 2023-12-04 00:12:59,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=31033.333333333332, ans=0.004123188405797102 2023-12-04 00:13:14,651 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31100.0, ans=0.125 2023-12-04 00:13:15,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=31166.666666666668, ans=0.125 2023-12-04 00:13:16,522 INFO [train.py:1087] (2/4) Epoch 6, batch 200, loss[loss=0.2205, simple_loss=0.3014, pruned_loss=0.06982, over 24774.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3227, pruned_loss=0.08568, over 3061073.55 frames. ], batch size: 73, lr: 3.14e-02, grad_scale: 32.0 2023-12-04 00:13:23,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=31166.666666666668, ans=0.125 2023-12-04 00:13:38,639 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.05 vs. limit=12.0 2023-12-04 00:13:43,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=31300.0, ans=0.004065217391304348 2023-12-04 00:13:51,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=31366.666666666668, ans=0.02 2023-12-04 00:13:55,243 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.132e+02 2.539e+02 3.184e+02 5.927e+02, threshold=5.079e+02, percent-clipped=2.0 2023-12-04 00:13:57,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=31366.666666666668, ans=0.125 2023-12-04 00:14:12,883 INFO [train.py:1087] (2/4) Epoch 6, batch 250, loss[loss=0.2262, simple_loss=0.3119, pruned_loss=0.07023, over 24769.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3224, pruned_loss=0.08544, over 3449212.24 frames. ], batch size: 71, lr: 3.13e-02, grad_scale: 32.0 2023-12-04 00:14:16,748 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-12-04 00:14:18,498 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=31500.0, ans=0.2 2023-12-04 00:14:23,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=31566.666666666668, ans=0.0 2023-12-04 00:14:42,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=31633.333333333332, ans=0.0 2023-12-04 00:14:44,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=31633.333333333332, ans=0.2 2023-12-04 00:14:48,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31700.0, ans=0.125 2023-12-04 00:14:55,372 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=31700.0, ans=0.125 2023-12-04 00:15:04,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31766.666666666668, ans=0.1 2023-12-04 00:15:09,560 INFO [train.py:1087] (2/4) Epoch 6, batch 300, loss[loss=0.2376, simple_loss=0.3145, pruned_loss=0.08034, over 24745.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.322, pruned_loss=0.08521, over 3756299.10 frames. ], batch size: 63, lr: 3.13e-02, grad_scale: 32.0 2023-12-04 00:15:14,050 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=31833.333333333332, ans=0.125 2023-12-04 00:15:20,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=31900.0, ans=0.0 2023-12-04 00:15:28,476 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-12-04 00:15:48,194 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=32033.333333333332, ans=0.2 2023-12-04 00:15:48,953 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.044e+02 2.473e+02 2.984e+02 6.405e+02, threshold=4.945e+02, percent-clipped=3.0 2023-12-04 00:15:51,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=32033.333333333332, ans=0.125 2023-12-04 00:15:56,099 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.46 vs. limit=15.0 2023-12-04 00:15:57,092 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-12-04 00:15:59,014 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=32100.0, ans=0.125 2023-12-04 00:16:05,113 INFO [train.py:1087] (2/4) Epoch 6, batch 350, loss[loss=0.2542, simple_loss=0.326, pruned_loss=0.09117, over 24759.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3221, pruned_loss=0.08562, over 3981261.77 frames. ], batch size: 65, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:16:09,257 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=32166.666666666668, ans=0.125 2023-12-04 00:16:17,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32233.333333333332, ans=0.125 2023-12-04 00:16:21,380 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-12-04 00:16:24,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=32233.333333333332, ans=0.125 2023-12-04 00:16:25,550 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-12-04 00:16:38,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32366.666666666668, ans=0.1 2023-12-04 00:16:38,616 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=32366.666666666668, ans=0.125 2023-12-04 00:17:01,505 INFO [train.py:1087] (2/4) Epoch 6, batch 400, loss[loss=0.2431, simple_loss=0.3221, pruned_loss=0.08205, over 24790.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3228, pruned_loss=0.08582, over 4159023.49 frames. ], batch size: 71, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:17:02,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=32500.0, ans=0.0 2023-12-04 00:17:15,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=32566.666666666668, ans=0.125 2023-12-04 00:17:27,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32633.333333333332, ans=0.125 2023-12-04 00:17:31,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=32633.333333333332, ans=0.125 2023-12-04 00:17:40,917 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 2.034e+02 2.326e+02 2.944e+02 4.882e+02, threshold=4.653e+02, percent-clipped=0.0 2023-12-04 00:17:57,681 INFO [train.py:1087] (2/4) Epoch 6, batch 450, loss[loss=0.22, simple_loss=0.3036, pruned_loss=0.06822, over 24755.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3225, pruned_loss=0.08556, over 4307141.00 frames. ], batch size: 66, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:18:08,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=32900.0, ans=0.125 2023-12-04 00:18:16,018 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:18:23,891 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=32966.666666666664, ans=0.05 2023-12-04 00:18:29,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=32966.666666666664, ans=0.003702898550724639 2023-12-04 00:18:48,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=33100.0, ans=0.125 2023-12-04 00:18:50,462 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-12-04 00:18:53,925 INFO [train.py:1087] (2/4) Epoch 6, batch 500, loss[loss=0.2848, simple_loss=0.3502, pruned_loss=0.1097, over 21566.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3222, pruned_loss=0.08552, over 4405565.81 frames. ], batch size: 127, lr: 3.11e-02, grad_scale: 32.0 2023-12-04 00:18:59,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=33166.666666666664, ans=0.2 2023-12-04 00:19:10,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=33233.333333333336, ans=0.09899494936611666 2023-12-04 00:19:15,383 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=33300.0, ans=0.125 2023-12-04 00:19:27,680 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.57 vs. limit=22.5 2023-12-04 00:19:33,510 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.953e+02 2.208e+02 2.592e+02 4.594e+02, threshold=4.417e+02, percent-clipped=0.0 2023-12-04 00:19:44,123 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.33 vs. limit=15.0 2023-12-04 00:19:50,850 INFO [train.py:1087] (2/4) Epoch 6, batch 550, loss[loss=0.2583, simple_loss=0.3343, pruned_loss=0.09117, over 24039.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3212, pruned_loss=0.08486, over 4507230.74 frames. ], batch size: 87, lr: 3.11e-02, grad_scale: 32.0 2023-12-04 00:20:08,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=33566.666666666664, ans=0.125 2023-12-04 00:20:19,094 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.63 vs. limit=22.5 2023-12-04 00:20:26,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=33700.0, ans=0.2 2023-12-04 00:20:32,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33700.0, ans=0.1 2023-12-04 00:20:47,009 INFO [train.py:1087] (2/4) Epoch 6, batch 600, loss[loss=0.2478, simple_loss=0.3247, pruned_loss=0.08541, over 24569.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3217, pruned_loss=0.08534, over 4560344.91 frames. ], batch size: 65, lr: 3.10e-02, grad_scale: 32.0 2023-12-04 00:20:49,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=33833.333333333336, ans=0.125 2023-12-04 00:20:56,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=33833.333333333336, ans=22.5 2023-12-04 00:20:58,022 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33900.0, ans=0.125 2023-12-04 00:21:00,205 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=33900.0, ans=0.07 2023-12-04 00:21:09,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=33966.666666666664, ans=0.125 2023-12-04 00:21:10,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=33966.666666666664, ans=0.0 2023-12-04 00:21:26,787 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 2.169e+02 2.721e+02 3.398e+02 7.625e+02, threshold=5.441e+02, percent-clipped=15.0 2023-12-04 00:21:33,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=34100.0, ans=15.0 2023-12-04 00:21:43,043 INFO [train.py:1087] (2/4) Epoch 6, batch 650, loss[loss=0.2508, simple_loss=0.3235, pruned_loss=0.08902, over 24761.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3205, pruned_loss=0.08464, over 4607225.42 frames. ], batch size: 66, lr: 3.10e-02, grad_scale: 32.0 2023-12-04 00:21:46,500 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=34166.666666666664, ans=0.2 2023-12-04 00:22:13,378 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34300.0, ans=0.0 2023-12-04 00:22:19,090 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-12-04 00:22:36,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=34433.333333333336, ans=0.125 2023-12-04 00:22:39,142 INFO [train.py:1087] (2/4) Epoch 6, batch 700, loss[loss=0.2555, simple_loss=0.3314, pruned_loss=0.08981, over 24452.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3198, pruned_loss=0.08368, over 4668589.17 frames. ], batch size: 77, lr: 3.09e-02, grad_scale: 32.0 2023-12-04 00:23:01,921 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=34633.333333333336, ans=0.125 2023-12-04 00:23:18,442 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.071e+02 2.301e+02 2.749e+02 5.150e+02, threshold=4.601e+02, percent-clipped=0.0 2023-12-04 00:23:26,279 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34766.666666666664, ans=0.1 2023-12-04 00:23:32,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34766.666666666664, ans=0.1 2023-12-04 00:23:35,732 INFO [train.py:1087] (2/4) Epoch 6, batch 750, loss[loss=0.2338, simple_loss=0.3136, pruned_loss=0.07702, over 24811.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3198, pruned_loss=0.08374, over 4687929.78 frames. ], batch size: 73, lr: 3.09e-02, grad_scale: 32.0 2023-12-04 00:23:54,397 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.47 vs. limit=10.0 2023-12-04 00:24:07,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=34966.666666666664, ans=0.2 2023-12-04 00:24:19,316 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=6.0 2023-12-04 00:24:26,074 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2023-12-04 00:24:26,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=35100.0, ans=0.04949747468305833 2023-12-04 00:24:30,995 INFO [train.py:1087] (2/4) Epoch 6, batch 800, loss[loss=0.2457, simple_loss=0.3222, pruned_loss=0.08464, over 24570.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3194, pruned_loss=0.08344, over 4716992.70 frames. ], batch size: 64, lr: 3.08e-02, grad_scale: 32.0 2023-12-04 00:24:39,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=35166.666666666664, ans=0.125 2023-12-04 00:24:39,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=35166.666666666664, ans=0.0032246376811594212 2023-12-04 00:24:41,101 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.61 vs. limit=15.0 2023-12-04 00:25:07,853 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 2.093e+02 2.608e+02 3.306e+02 8.466e+02, threshold=5.216e+02, percent-clipped=6.0 2023-12-04 00:25:20,816 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.33 vs. limit=6.0 2023-12-04 00:25:22,465 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-12-04 00:25:23,101 INFO [train.py:1087] (2/4) Epoch 6, batch 850, loss[loss=0.2282, simple_loss=0.3063, pruned_loss=0.07498, over 24576.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3196, pruned_loss=0.08362, over 4735276.04 frames. ], batch size: 64, lr: 3.08e-02, grad_scale: 32.0 2023-12-04 00:25:44,493 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:25:45,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-12-04 00:25:55,434 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-12-04 00:26:02,014 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=35700.0, ans=0.1 2023-12-04 00:26:04,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=35766.666666666664, ans=0.05 2023-12-04 00:26:04,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=35766.666666666664, ans=0.125 2023-12-04 00:26:25,684 INFO [train.py:1087] (2/4) Epoch 7, batch 0, loss[loss=0.2297, simple_loss=0.3122, pruned_loss=0.07357, over 24556.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3122, pruned_loss=0.07357, over 24556.00 frames. ], batch size: 62, lr: 2.88e-02, grad_scale: 32.0 2023-12-04 00:26:25,685 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 00:26:37,873 INFO [train.py:1119] (2/4) Epoch 7, validation: loss=0.199, simple_loss=0.2994, pruned_loss=0.0493, over 944034.00 frames. 2023-12-04 00:26:37,874 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 00:26:40,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=35800.0, ans=0.0 2023-12-04 00:26:51,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=35866.666666666664, ans=0.0 2023-12-04 00:27:16,782 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36000.0, ans=0.1 2023-12-04 00:27:21,669 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 2.041e+02 2.433e+02 2.875e+02 3.924e+02, threshold=4.865e+02, percent-clipped=0.0 2023-12-04 00:27:33,030 INFO [train.py:1087] (2/4) Epoch 7, batch 50, loss[loss=0.2219, simple_loss=0.3047, pruned_loss=0.06953, over 24742.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3186, pruned_loss=0.08228, over 1088987.92 frames. ], batch size: 63, lr: 2.88e-02, grad_scale: 32.0 2023-12-04 00:27:33,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=36133.333333333336, ans=0.015 2023-12-04 00:28:12,456 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=36333.333333333336, ans=0.125 2023-12-04 00:28:16,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=36400.0, ans=0.2 2023-12-04 00:28:28,807 INFO [train.py:1087] (2/4) Epoch 7, batch 100, loss[loss=0.2466, simple_loss=0.3204, pruned_loss=0.08637, over 23983.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3185, pruned_loss=0.08249, over 1910567.16 frames. ], batch size: 87, lr: 2.87e-02, grad_scale: 32.0 2023-12-04 00:28:52,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=36600.0, ans=0.0 2023-12-04 00:29:08,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=36666.666666666664, ans=0.125 2023-12-04 00:29:12,623 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.963e+02 2.300e+02 3.064e+02 5.989e+02, threshold=4.600e+02, percent-clipped=1.0 2023-12-04 00:29:22,078 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=15.0 2023-12-04 00:29:23,687 INFO [train.py:1087] (2/4) Epoch 7, batch 150, loss[loss=0.222, simple_loss=0.3021, pruned_loss=0.07098, over 24752.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3174, pruned_loss=0.08208, over 2549034.26 frames. ], batch size: 66, lr: 2.87e-02, grad_scale: 32.0 2023-12-04 00:29:25,976 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=36800.0, ans=0.2 2023-12-04 00:29:35,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=36866.666666666664, ans=0.125 2023-12-04 00:29:49,960 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=36933.333333333336, ans=0.0 2023-12-04 00:30:01,001 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=25.86 vs. limit=22.5 2023-12-04 00:30:06,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=37000.0, ans=0.0028260869565217396 2023-12-04 00:30:19,461 INFO [train.py:1087] (2/4) Epoch 7, batch 200, loss[loss=0.24, simple_loss=0.3161, pruned_loss=0.08196, over 24540.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3157, pruned_loss=0.08037, over 3072600.62 frames. ], batch size: 62, lr: 2.86e-02, grad_scale: 32.0 2023-12-04 00:30:20,820 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=37133.333333333336, ans=0.0 2023-12-04 00:30:30,352 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=37200.0, ans=0.0027826086956521745 2023-12-04 00:30:35,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=37200.0, ans=0.125 2023-12-04 00:31:03,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=37400.0, ans=10.0 2023-12-04 00:31:04,585 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 2.199e+02 2.530e+02 3.188e+02 4.211e+02, threshold=5.060e+02, percent-clipped=0.0 2023-12-04 00:31:08,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=37400.0, ans=0.0027391304347826086 2023-12-04 00:31:10,129 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-12-04 00:31:13,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=37400.0, ans=0.0 2023-12-04 00:31:15,242 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=37466.666666666664, ans=0.125 2023-12-04 00:31:15,987 INFO [train.py:1087] (2/4) Epoch 7, batch 250, loss[loss=0.2957, simple_loss=0.3542, pruned_loss=0.1186, over 17200.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3158, pruned_loss=0.08104, over 3444758.57 frames. ], batch size: 176, lr: 2.86e-02, grad_scale: 32.0 2023-12-04 00:31:39,182 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=12.0 2023-12-04 00:31:41,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=37600.0, ans=0.0 2023-12-04 00:31:47,499 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:31:54,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=37666.666666666664, ans=0.125 2023-12-04 00:32:09,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=37733.333333333336, ans=0.0 2023-12-04 00:32:10,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=37800.0, ans=0.125 2023-12-04 00:32:11,624 INFO [train.py:1087] (2/4) Epoch 7, batch 300, loss[loss=0.2387, simple_loss=0.3125, pruned_loss=0.08244, over 24541.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.316, pruned_loss=0.08094, over 3744405.29 frames. ], batch size: 63, lr: 2.85e-02, grad_scale: 32.0 2023-12-04 00:32:16,244 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37800.0, ans=0.1 2023-12-04 00:32:40,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=37933.333333333336, ans=0.0 2023-12-04 00:32:42,448 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=37933.333333333336, ans=0.002623188405797101 2023-12-04 00:32:54,965 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.970e+02 2.253e+02 2.674e+02 4.249e+02, threshold=4.505e+02, percent-clipped=0.0 2023-12-04 00:33:03,533 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.88 vs. limit=15.0 2023-12-04 00:33:05,988 INFO [train.py:1087] (2/4) Epoch 7, batch 350, loss[loss=0.2813, simple_loss=0.3492, pruned_loss=0.1067, over 21488.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3162, pruned_loss=0.08123, over 3972684.29 frames. ], batch size: 128, lr: 2.85e-02, grad_scale: 32.0 2023-12-04 00:33:06,779 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=15.0 2023-12-04 00:33:17,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38200.0, ans=0.125 2023-12-04 00:33:18,016 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2023-12-04 00:33:21,670 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=38200.0, ans=0.2 2023-12-04 00:33:53,080 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-12-04 00:34:00,825 INFO [train.py:1087] (2/4) Epoch 7, batch 400, loss[loss=0.221, simple_loss=0.3057, pruned_loss=0.0682, over 24764.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3156, pruned_loss=0.08074, over 4150569.99 frames. ], batch size: 70, lr: 2.84e-02, grad_scale: 32.0 2023-12-04 00:34:15,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38533.333333333336, ans=0.1 2023-12-04 00:34:22,621 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-12-04 00:34:27,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=38600.0, ans=0.125 2023-12-04 00:34:35,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=38666.666666666664, ans=0.125 2023-12-04 00:34:45,276 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 2.109e+02 2.328e+02 2.740e+02 4.257e+02, threshold=4.655e+02, percent-clipped=0.0 2023-12-04 00:34:47,650 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=38733.333333333336, ans=0.0 2023-12-04 00:34:49,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=38733.333333333336, ans=0.125 2023-12-04 00:34:54,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38733.333333333336, ans=0.1 2023-12-04 00:34:55,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=38800.0, ans=0.04949747468305833 2023-12-04 00:34:55,956 INFO [train.py:1087] (2/4) Epoch 7, batch 450, loss[loss=0.2409, simple_loss=0.3166, pruned_loss=0.08256, over 24573.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.314, pruned_loss=0.07968, over 4301170.51 frames. ], batch size: 64, lr: 2.84e-02, grad_scale: 32.0 2023-12-04 00:35:04,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=38800.0, ans=0.95 2023-12-04 00:35:12,148 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.81 vs. limit=6.0 2023-12-04 00:35:14,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=38866.666666666664, ans=0.125 2023-12-04 00:35:23,572 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38933.333333333336, ans=0.1 2023-12-04 00:35:27,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=39000.0, ans=0.0 2023-12-04 00:35:35,368 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:35:51,173 INFO [train.py:1087] (2/4) Epoch 7, batch 500, loss[loss=0.2338, simple_loss=0.3149, pruned_loss=0.07641, over 24753.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3144, pruned_loss=0.08014, over 4403679.13 frames. ], batch size: 66, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:35:52,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=39133.333333333336, ans=0.05 2023-12-04 00:35:53,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=39133.333333333336, ans=0.0023623188405797095 2023-12-04 00:35:54,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=39133.333333333336, ans=0.0023623188405797095 2023-12-04 00:36:03,476 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.64 vs. limit=12.0 2023-12-04 00:36:08,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=39200.0, ans=0.2 2023-12-04 00:36:21,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=39266.666666666664, ans=0.125 2023-12-04 00:36:30,376 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=15.0 2023-12-04 00:36:33,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=39333.333333333336, ans=0.125 2023-12-04 00:36:35,206 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.950e+02 2.350e+02 2.840e+02 4.583e+02, threshold=4.699e+02, percent-clipped=0.0 2023-12-04 00:36:42,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=39400.0, ans=0.0 2023-12-04 00:36:46,545 INFO [train.py:1087] (2/4) Epoch 7, batch 550, loss[loss=0.222, simple_loss=0.3052, pruned_loss=0.06942, over 24791.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3141, pruned_loss=0.07994, over 4498338.10 frames. ], batch size: 71, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:36:59,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=39533.333333333336, ans=0.125 2023-12-04 00:37:01,517 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.68 vs. limit=22.5 2023-12-04 00:37:08,928 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.45 vs. limit=22.5 2023-12-04 00:37:10,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39600.0, ans=0.1 2023-12-04 00:37:23,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=39666.666666666664, ans=0.125 2023-12-04 00:37:24,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=39666.666666666664, ans=0.2 2023-12-04 00:37:30,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=39733.333333333336, ans=0.125 2023-12-04 00:37:31,445 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.45 vs. limit=22.5 2023-12-04 00:37:40,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=39800.0, ans=0.0 2023-12-04 00:37:41,618 INFO [train.py:1087] (2/4) Epoch 7, batch 600, loss[loss=0.2364, simple_loss=0.3158, pruned_loss=0.07852, over 24317.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3138, pruned_loss=0.07946, over 4577428.45 frames. ], batch size: 79, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:37:49,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=39800.0, ans=0.125 2023-12-04 00:37:53,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=39866.666666666664, ans=0.0022028985507246386 2023-12-04 00:38:02,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39866.666666666664, ans=0.1 2023-12-04 00:38:15,304 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40000.0, ans=0.1 2023-12-04 00:38:21,513 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40000.0, ans=0.1 2023-12-04 00:38:26,563 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 2.019e+02 2.331e+02 2.930e+02 4.659e+02, threshold=4.662e+02, percent-clipped=0.0 2023-12-04 00:38:37,664 INFO [train.py:1087] (2/4) Epoch 7, batch 650, loss[loss=0.2532, simple_loss=0.3271, pruned_loss=0.08966, over 24794.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3132, pruned_loss=0.07899, over 4626591.79 frames. ], batch size: 73, lr: 2.82e-02, grad_scale: 32.0 2023-12-04 00:38:39,074 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=40133.333333333336, ans=0.125 2023-12-04 00:39:04,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=40266.666666666664, ans=0.125 2023-12-04 00:39:29,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=40400.0, ans=0.125 2023-12-04 00:39:34,254 INFO [train.py:1087] (2/4) Epoch 7, batch 700, loss[loss=0.2221, simple_loss=0.3031, pruned_loss=0.07059, over 24796.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3128, pruned_loss=0.07884, over 4665977.66 frames. ], batch size: 71, lr: 2.82e-02, grad_scale: 32.0 2023-12-04 00:39:53,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=40533.333333333336, ans=0.0 2023-12-04 00:40:04,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=40600.0, ans=0.2 2023-12-04 00:40:18,499 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.925e+02 2.239e+02 2.672e+02 4.779e+02, threshold=4.479e+02, percent-clipped=1.0 2023-12-04 00:40:22,916 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.57 vs. limit=15.0 2023-12-04 00:40:30,505 INFO [train.py:1087] (2/4) Epoch 7, batch 750, loss[loss=0.2199, simple_loss=0.2988, pruned_loss=0.07049, over 24783.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3123, pruned_loss=0.07825, over 4705144.53 frames. ], batch size: 73, lr: 2.81e-02, grad_scale: 32.0 2023-12-04 00:40:32,940 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=40800.0, ans=0.125 2023-12-04 00:40:40,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40866.666666666664, ans=0.1 2023-12-04 00:40:41,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=40866.666666666664, ans=0.0 2023-12-04 00:41:07,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=41000.0, ans=0.001956521739130435 2023-12-04 00:41:09,526 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:41:26,405 INFO [train.py:1087] (2/4) Epoch 7, batch 800, loss[loss=0.2298, simple_loss=0.3159, pruned_loss=0.07184, over 24601.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3113, pruned_loss=0.07725, over 4749403.19 frames. ], batch size: 68, lr: 2.81e-02, grad_scale: 32.0 2023-12-04 00:41:26,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=41133.333333333336, ans=0.0 2023-12-04 00:41:30,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=41133.333333333336, ans=0.125 2023-12-04 00:42:07,420 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 2.055e+02 2.330e+02 2.823e+02 4.114e+02, threshold=4.659e+02, percent-clipped=0.0 2023-12-04 00:42:11,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=41400.0, ans=0.0 2023-12-04 00:42:15,906 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41400.0, ans=0.125 2023-12-04 00:42:17,743 INFO [train.py:1087] (2/4) Epoch 7, batch 850, loss[loss=0.2315, simple_loss=0.3081, pruned_loss=0.07743, over 24522.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3116, pruned_loss=0.07779, over 4755133.39 frames. ], batch size: 75, lr: 2.80e-02, grad_scale: 32.0 2023-12-04 00:42:25,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=41466.666666666664, ans=0.2 2023-12-04 00:42:44,885 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=41600.0, ans=0.035 2023-12-04 00:42:45,031 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:42:55,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=41666.666666666664, ans=0.0018115942028985518 2023-12-04 00:42:58,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=41733.333333333336, ans=0.0 2023-12-04 00:42:58,686 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.60 vs. limit=12.0 2023-12-04 00:42:59,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=41733.333333333336, ans=0.125 2023-12-04 00:43:20,715 INFO [train.py:1087] (2/4) Epoch 8, batch 0, loss[loss=0.1996, simple_loss=0.2879, pruned_loss=0.05562, over 24796.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2879, pruned_loss=0.05562, over 24796.00 frames. ], batch size: 71, lr: 2.64e-02, grad_scale: 32.0 2023-12-04 00:43:20,716 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 00:43:32,846 INFO [train.py:1119] (2/4) Epoch 8, validation: loss=0.1925, simple_loss=0.2931, pruned_loss=0.04594, over 944034.00 frames. 2023-12-04 00:43:32,847 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 00:44:03,706 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.19 vs. limit=10.0 2023-12-04 00:44:15,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=42033.333333333336, ans=0.2 2023-12-04 00:44:22,024 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.906e+02 2.199e+02 2.514e+02 5.001e+02, threshold=4.398e+02, percent-clipped=1.0 2023-12-04 00:44:25,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=42033.333333333336, ans=0.0017318840579710146 2023-12-04 00:44:27,267 INFO [train.py:1087] (2/4) Epoch 8, batch 50, loss[loss=0.2098, simple_loss=0.2925, pruned_loss=0.06353, over 24837.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3101, pruned_loss=0.07578, over 1095221.00 frames. ], batch size: 68, lr: 2.63e-02, grad_scale: 32.0 2023-12-04 00:44:32,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42100.0, ans=0.125 2023-12-04 00:44:51,768 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-12-04 00:45:07,184 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.55 vs. limit=15.0 2023-12-04 00:45:21,889 INFO [train.py:1087] (2/4) Epoch 8, batch 100, loss[loss=0.229, simple_loss=0.3067, pruned_loss=0.07564, over 24710.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3091, pruned_loss=0.07559, over 1917721.73 frames. ], batch size: 74, lr: 2.63e-02, grad_scale: 32.0 2023-12-04 00:45:31,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42433.333333333336, ans=0.125 2023-12-04 00:46:11,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=42700.0, ans=0.0 2023-12-04 00:46:13,386 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.850e+02 2.203e+02 2.662e+02 5.574e+02, threshold=4.406e+02, percent-clipped=2.0 2023-12-04 00:46:17,670 INFO [train.py:1087] (2/4) Epoch 8, batch 150, loss[loss=0.2273, simple_loss=0.3074, pruned_loss=0.07367, over 24819.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3088, pruned_loss=0.07544, over 2558013.86 frames. ], batch size: 73, lr: 2.62e-02, grad_scale: 16.0 2023-12-04 00:46:29,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=42833.333333333336, ans=0.125 2023-12-04 00:46:36,599 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=42833.333333333336, ans=0.125 2023-12-04 00:46:46,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=42900.0, ans=0.2 2023-12-04 00:46:47,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=42900.0, ans=0.125 2023-12-04 00:47:13,942 INFO [train.py:1087] (2/4) Epoch 8, batch 200, loss[loss=0.2382, simple_loss=0.3138, pruned_loss=0.08132, over 24853.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3092, pruned_loss=0.0759, over 3050718.34 frames. ], batch size: 68, lr: 2.62e-02, grad_scale: 16.0 2023-12-04 00:47:14,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=43100.0, ans=0.0014999999999999996 2023-12-04 00:47:36,271 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=43233.333333333336, ans=0.0 2023-12-04 00:47:36,407 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=43233.333333333336, ans=0.0014710144927536223 2023-12-04 00:47:40,977 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43233.333333333336, ans=0.0 2023-12-04 00:47:49,562 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=43300.0, ans=0.125 2023-12-04 00:47:52,885 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=15.0 2023-12-04 00:48:00,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=43366.666666666664, ans=0.1 2023-12-04 00:48:04,957 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.946e+02 2.243e+02 2.682e+02 4.927e+02, threshold=4.486e+02, percent-clipped=1.0 2023-12-04 00:48:09,310 INFO [train.py:1087] (2/4) Epoch 8, batch 250, loss[loss=0.2375, simple_loss=0.3163, pruned_loss=0.07935, over 24556.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3092, pruned_loss=0.07593, over 3433787.07 frames. ], batch size: 63, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:48:39,402 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=43566.666666666664, ans=0.125 2023-12-04 00:49:03,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=43700.0, ans=0.05 2023-12-04 00:49:05,044 INFO [train.py:1087] (2/4) Epoch 8, batch 300, loss[loss=0.2177, simple_loss=0.3018, pruned_loss=0.06682, over 24706.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3084, pruned_loss=0.07493, over 3757509.00 frames. ], batch size: 69, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:49:10,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=43766.666666666664, ans=0.125 2023-12-04 00:49:47,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=43966.666666666664, ans=0.2 2023-12-04 00:49:53,169 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=44033.333333333336, ans=0.125 2023-12-04 00:49:56,102 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.870e+02 2.164e+02 2.569e+02 7.767e+02, threshold=4.327e+02, percent-clipped=1.0 2023-12-04 00:50:00,874 INFO [train.py:1087] (2/4) Epoch 8, batch 350, loss[loss=0.2071, simple_loss=0.2935, pruned_loss=0.06031, over 24732.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3085, pruned_loss=0.07467, over 3999737.79 frames. ], batch size: 63, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:50:01,485 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:50:37,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=44300.0, ans=0.0 2023-12-04 00:50:44,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=44366.666666666664, ans=0.125 2023-12-04 00:50:48,490 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.99 vs. limit=15.0 2023-12-04 00:50:51,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=44366.666666666664, ans=0.125 2023-12-04 00:50:56,837 INFO [train.py:1087] (2/4) Epoch 8, batch 400, loss[loss=0.2227, simple_loss=0.3054, pruned_loss=0.07004, over 24418.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3082, pruned_loss=0.07476, over 4186806.46 frames. ], batch size: 77, lr: 2.60e-02, grad_scale: 32.0 2023-12-04 00:51:07,975 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-12-04 00:51:08,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=44500.0, ans=0.0011956521739130439 2023-12-04 00:51:22,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=44566.666666666664, ans=0.125 2023-12-04 00:51:22,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=44566.666666666664, ans=0.001181159420289856 2023-12-04 00:51:23,119 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-12-04 00:51:42,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44700.0, ans=0.1 2023-12-04 00:51:48,401 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.929e+02 2.224e+02 2.693e+02 3.819e+02, threshold=4.448e+02, percent-clipped=0.0 2023-12-04 00:51:52,692 INFO [train.py:1087] (2/4) Epoch 8, batch 450, loss[loss=0.2381, simple_loss=0.3133, pruned_loss=0.08149, over 24591.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.308, pruned_loss=0.0747, over 4327693.81 frames. ], batch size: 64, lr: 2.60e-02, grad_scale: 32.0 2023-12-04 00:52:19,103 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=44900.0, ans=0.0 2023-12-04 00:52:21,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=44900.0, ans=0.125 2023-12-04 00:52:27,714 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=44966.666666666664, ans=0.09899494936611666 2023-12-04 00:52:43,084 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=15.0 2023-12-04 00:52:48,933 INFO [train.py:1087] (2/4) Epoch 8, batch 500, loss[loss=0.2114, simple_loss=0.296, pruned_loss=0.06338, over 24556.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3079, pruned_loss=0.07489, over 4424848.78 frames. ], batch size: 66, lr: 2.59e-02, grad_scale: 32.0 2023-12-04 00:52:55,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=45100.0, ans=0.125 2023-12-04 00:53:09,901 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45233.333333333336, ans=0.125 2023-12-04 00:53:22,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=45300.0, ans=0.2 2023-12-04 00:53:39,030 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.796e+02 2.227e+02 2.542e+02 4.822e+02, threshold=4.453e+02, percent-clipped=1.0 2023-12-04 00:53:44,483 INFO [train.py:1087] (2/4) Epoch 8, batch 550, loss[loss=0.2358, simple_loss=0.3096, pruned_loss=0.08101, over 24806.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3075, pruned_loss=0.07454, over 4514584.21 frames. ], batch size: 72, lr: 2.59e-02, grad_scale: 32.0 2023-12-04 00:53:53,613 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=45433.333333333336, ans=0.04949747468305833 2023-12-04 00:54:14,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=45566.666666666664, ans=0.125 2023-12-04 00:54:36,603 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-12-04 00:54:38,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=45766.666666666664, ans=0.125 2023-12-04 00:54:39,354 INFO [train.py:1087] (2/4) Epoch 8, batch 600, loss[loss=0.2391, simple_loss=0.3164, pruned_loss=0.08091, over 24784.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3071, pruned_loss=0.0741, over 4589302.27 frames. ], batch size: 73, lr: 2.58e-02, grad_scale: 32.0 2023-12-04 00:55:07,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=45900.0, ans=0.0008913043478260864 2023-12-04 00:55:31,382 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.959e+02 2.312e+02 2.796e+02 4.187e+02, threshold=4.624e+02, percent-clipped=0.0 2023-12-04 00:55:35,667 INFO [train.py:1087] (2/4) Epoch 8, batch 650, loss[loss=0.2117, simple_loss=0.2966, pruned_loss=0.06339, over 24742.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3072, pruned_loss=0.07428, over 4637804.48 frames. ], batch size: 63, lr: 2.58e-02, grad_scale: 16.0 2023-12-04 00:55:43,639 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-12-04 00:55:57,142 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-12-04 00:56:09,812 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.18 vs. limit=12.0 2023-12-04 00:56:10,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=46300.0, ans=0.1 2023-12-04 00:56:32,274 INFO [train.py:1087] (2/4) Epoch 8, batch 700, loss[loss=0.2244, simple_loss=0.3051, pruned_loss=0.07181, over 24305.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3071, pruned_loss=0.07399, over 4684005.85 frames. ], batch size: 79, lr: 2.58e-02, grad_scale: 16.0 2023-12-04 00:56:52,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=46500.0, ans=0.125 2023-12-04 00:56:58,058 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=46566.666666666664, ans=0.125 2023-12-04 00:57:04,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=46633.333333333336, ans=0.125 2023-12-04 00:57:05,180 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.90 vs. limit=15.0 2023-12-04 00:57:24,715 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.828e+02 2.074e+02 2.357e+02 1.042e+03, threshold=4.149e+02, percent-clipped=1.0 2023-12-04 00:57:26,411 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2023-12-04 00:57:27,945 INFO [train.py:1087] (2/4) Epoch 8, batch 750, loss[loss=0.2088, simple_loss=0.2899, pruned_loss=0.06386, over 24544.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.307, pruned_loss=0.07377, over 4716267.60 frames. ], batch size: 62, lr: 2.57e-02, grad_scale: 16.0 2023-12-04 00:58:02,944 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.35 vs. limit=12.0 2023-12-04 00:58:03,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=46966.666666666664, ans=0.125 2023-12-04 00:58:09,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=46966.666666666664, ans=15.0 2023-12-04 00:58:23,760 INFO [train.py:1087] (2/4) Epoch 8, batch 800, loss[loss=0.2272, simple_loss=0.3089, pruned_loss=0.07276, over 24855.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3071, pruned_loss=0.07394, over 4731837.04 frames. ], batch size: 68, lr: 2.57e-02, grad_scale: 32.0 2023-12-04 00:59:09,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=47366.666666666664, ans=0.125 2023-12-04 00:59:11,665 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.773e+02 2.012e+02 2.288e+02 3.220e+02, threshold=4.025e+02, percent-clipped=0.0 2023-12-04 00:59:14,706 INFO [train.py:1087] (2/4) Epoch 8, batch 850, loss[loss=0.2422, simple_loss=0.3209, pruned_loss=0.08174, over 24291.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3067, pruned_loss=0.07376, over 4747063.05 frames. ], batch size: 79, lr: 2.56e-02, grad_scale: 32.0 2023-12-04 00:59:15,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=47433.333333333336, ans=0.0005579710144927533 2023-12-04 00:59:18,352 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.98 vs. limit=22.5 2023-12-04 00:59:20,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=47433.333333333336, ans=0.125 2023-12-04 00:59:42,978 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=47566.666666666664, ans=0.09899494936611666 2023-12-04 00:59:47,204 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=47633.333333333336, ans=0.125 2023-12-04 01:00:16,323 INFO [train.py:1087] (2/4) Epoch 9, batch 0, loss[loss=0.2089, simple_loss=0.2979, pruned_loss=0.06, over 24786.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2979, pruned_loss=0.06, over 24786.00 frames. ], batch size: 71, lr: 2.42e-02, grad_scale: 32.0 2023-12-04 01:00:16,323 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 01:00:28,728 INFO [train.py:1119] (2/4) Epoch 9, validation: loss=0.1852, simple_loss=0.2876, pruned_loss=0.04143, over 944034.00 frames. 2023-12-04 01:00:28,729 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 01:00:45,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=47800.0, ans=0.125 2023-12-04 01:00:49,464 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=47800.0, ans=0.125 2023-12-04 01:01:00,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=47866.666666666664, ans=0.125 2023-12-04 01:01:02,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=47933.333333333336, ans=0.0004492753623188406 2023-12-04 01:01:13,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=48000.0, ans=0.125 2023-12-04 01:01:24,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=48066.666666666664, ans=0.000420289855072465 2023-12-04 01:01:25,137 INFO [train.py:1087] (2/4) Epoch 9, batch 50, loss[loss=0.2239, simple_loss=0.304, pruned_loss=0.07194, over 20942.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3056, pruned_loss=0.07183, over 1083812.82 frames. ], batch size: 50, lr: 2.42e-02, grad_scale: 32.0 2023-12-04 01:01:27,192 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.995e+02 2.275e+02 2.569e+02 4.783e+02, threshold=4.550e+02, percent-clipped=4.0 2023-12-04 01:01:32,619 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=48066.666666666664, ans=0.2 2023-12-04 01:01:55,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=48200.0, ans=0.0 2023-12-04 01:02:15,798 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.77 vs. limit=12.0 2023-12-04 01:02:16,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=48333.333333333336, ans=0.125 2023-12-04 01:02:20,330 INFO [train.py:1087] (2/4) Epoch 9, batch 100, loss[loss=0.2029, simple_loss=0.2923, pruned_loss=0.05673, over 24792.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3038, pruned_loss=0.07086, over 1927840.32 frames. ], batch size: 73, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:02:43,403 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48533.333333333336, ans=0.1 2023-12-04 01:02:49,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=48533.333333333336, ans=0.125 2023-12-04 01:02:54,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=48600.0, ans=0.125 2023-12-04 01:02:58,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=48600.0, ans=0.1 2023-12-04 01:03:03,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=48666.666666666664, ans=0.125 2023-12-04 01:03:07,824 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=48666.666666666664, ans=0.125 2023-12-04 01:03:14,896 INFO [train.py:1087] (2/4) Epoch 9, batch 150, loss[loss=0.216, simple_loss=0.2986, pruned_loss=0.06668, over 24555.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3034, pruned_loss=0.07086, over 2570546.15 frames. ], batch size: 66, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:03:17,017 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.784e+02 1.989e+02 2.181e+02 3.017e+02, threshold=3.978e+02, percent-clipped=0.0 2023-12-04 01:03:22,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=48733.333333333336, ans=0.125 2023-12-04 01:03:42,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=48866.666666666664, ans=0.125 2023-12-04 01:04:07,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=49000.0, ans=0.0 2023-12-04 01:04:10,641 INFO [train.py:1087] (2/4) Epoch 9, batch 200, loss[loss=0.219, simple_loss=0.3017, pruned_loss=0.06818, over 24494.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3039, pruned_loss=0.07119, over 3061720.84 frames. ], batch size: 75, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:04:19,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=49066.666666666664, ans=0.00020289855072463947 2023-12-04 01:04:23,546 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=49133.333333333336, ans=0.0 2023-12-04 01:04:37,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=49200.0, ans=0.0 2023-12-04 01:04:40,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=49200.0, ans=0.0 2023-12-04 01:04:50,703 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-12-04 01:04:53,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=49333.333333333336, ans=0.0 2023-12-04 01:05:02,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-12-04 01:05:06,622 INFO [train.py:1087] (2/4) Epoch 9, batch 250, loss[loss=0.2176, simple_loss=0.2971, pruned_loss=0.06907, over 24498.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3039, pruned_loss=0.07151, over 3450445.77 frames. ], batch size: 77, lr: 2.40e-02, grad_scale: 32.0 2023-12-04 01:05:08,695 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.304e+02 1.787e+02 2.020e+02 2.356e+02 3.922e+02, threshold=4.039e+02, percent-clipped=0.0 2023-12-04 01:05:29,674 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:05:37,078 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49533.333333333336, ans=0.1 2023-12-04 01:05:41,939 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=22.5 2023-12-04 01:06:02,140 INFO [train.py:1087] (2/4) Epoch 9, batch 300, loss[loss=0.2083, simple_loss=0.2927, pruned_loss=0.06192, over 24774.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3037, pruned_loss=0.07139, over 3743054.47 frames. ], batch size: 73, lr: 2.40e-02, grad_scale: 32.0 2023-12-04 01:06:34,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=49933.333333333336, ans=0.125 2023-12-04 01:06:37,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=49933.333333333336, ans=0.125 2023-12-04 01:06:45,477 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.97 vs. limit=22.5 2023-12-04 01:06:47,066 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50000.0, ans=0.1 2023-12-04 01:06:53,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=50000.0, ans=0.0 2023-12-04 01:06:53,572 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=50000.0, ans=0.125 2023-12-04 01:06:54,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50000.0, ans=0.125 2023-12-04 01:06:55,606 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50000.0, ans=0.1 2023-12-04 01:06:55,743 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:06:57,517 INFO [train.py:1087] (2/4) Epoch 9, batch 350, loss[loss=0.2321, simple_loss=0.3096, pruned_loss=0.07733, over 21375.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3034, pruned_loss=0.07128, over 3975385.74 frames. ], batch size: 127, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:06:58,854 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=50066.666666666664, ans=0.125 2023-12-04 01:06:59,592 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.894e+02 2.166e+02 2.413e+02 4.274e+02, threshold=4.333e+02, percent-clipped=1.0 2023-12-04 01:07:07,490 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50066.666666666664, ans=0.125 2023-12-04 01:07:22,730 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:07:23,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=50200.0, ans=0.0 2023-12-04 01:07:31,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=50266.666666666664, ans=0.125 2023-12-04 01:07:39,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=50266.666666666664, ans=0.125 2023-12-04 01:07:53,119 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50400.0, ans=0.1 2023-12-04 01:07:53,830 INFO [train.py:1087] (2/4) Epoch 9, batch 400, loss[loss=0.2132, simple_loss=0.2982, pruned_loss=0.06409, over 24330.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.303, pruned_loss=0.07111, over 4157855.07 frames. ], batch size: 79, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:08:01,686 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2023-12-04 01:08:13,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=50466.666666666664, ans=0.125 2023-12-04 01:08:19,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=50533.333333333336, ans=0.125 2023-12-04 01:08:27,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=50600.0, ans=0.0 2023-12-04 01:08:30,818 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-12-04 01:08:43,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=50666.666666666664, ans=0.125 2023-12-04 01:08:43,523 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.68 vs. limit=10.0 2023-12-04 01:08:49,401 INFO [train.py:1087] (2/4) Epoch 9, batch 450, loss[loss=0.2239, simple_loss=0.3033, pruned_loss=0.07229, over 24768.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3028, pruned_loss=0.07086, over 4316161.11 frames. ], batch size: 73, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:08:50,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=50733.333333333336, ans=0.125 2023-12-04 01:08:51,475 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.723e+02 1.991e+02 2.313e+02 3.429e+02, threshold=3.983e+02, percent-clipped=0.0 2023-12-04 01:09:09,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=50800.0, ans=0.0 2023-12-04 01:09:15,641 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=50866.666666666664, ans=0.1 2023-12-04 01:09:31,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=50933.333333333336, ans=0.0 2023-12-04 01:09:45,257 INFO [train.py:1087] (2/4) Epoch 9, batch 500, loss[loss=0.2083, simple_loss=0.288, pruned_loss=0.06428, over 24802.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3031, pruned_loss=0.07131, over 4406976.88 frames. ], batch size: 73, lr: 2.38e-02, grad_scale: 32.0 2023-12-04 01:09:46,793 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-12-04 01:09:49,123 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-12-04 01:09:57,214 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=51133.333333333336, ans=0.125 2023-12-04 01:10:00,463 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=51133.333333333336, ans=0.125 2023-12-04 01:10:40,040 INFO [train.py:1087] (2/4) Epoch 9, batch 550, loss[loss=0.2235, simple_loss=0.3043, pruned_loss=0.07134, over 24695.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3024, pruned_loss=0.07088, over 4507003.27 frames. ], batch size: 74, lr: 2.38e-02, grad_scale: 32.0 2023-12-04 01:10:42,510 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.788e+02 2.021e+02 2.273e+02 3.294e+02, threshold=4.042e+02, percent-clipped=0.0 2023-12-04 01:10:50,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=51466.666666666664, ans=0.125 2023-12-04 01:11:00,072 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=51466.666666666664, ans=0.0 2023-12-04 01:11:12,406 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=12.0 2023-12-04 01:11:14,899 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.77 vs. limit=15.0 2023-12-04 01:11:19,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=51600.0, ans=0.125 2023-12-04 01:11:34,958 INFO [train.py:1087] (2/4) Epoch 9, batch 600, loss[loss=0.2036, simple_loss=0.2841, pruned_loss=0.06161, over 24710.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3016, pruned_loss=0.07017, over 4579781.32 frames. ], batch size: 69, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:11:38,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=51733.333333333336, ans=0.0 2023-12-04 01:11:48,405 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.98 vs. limit=15.0 2023-12-04 01:11:58,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=51866.666666666664, ans=0.2 2023-12-04 01:12:30,796 INFO [train.py:1087] (2/4) Epoch 9, batch 650, loss[loss=0.2373, simple_loss=0.3149, pruned_loss=0.0798, over 22654.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3022, pruned_loss=0.07073, over 4612539.80 frames. ], batch size: 106, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:12:32,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=52066.666666666664, ans=0.05 2023-12-04 01:12:32,884 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.872e+02 2.101e+02 2.365e+02 3.683e+02, threshold=4.202e+02, percent-clipped=0.0 2023-12-04 01:12:33,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=52066.666666666664, ans=0.07 2023-12-04 01:12:34,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52066.666666666664, ans=0.125 2023-12-04 01:12:35,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=52066.666666666664, ans=0.125 2023-12-04 01:12:36,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52066.666666666664, ans=0.125 2023-12-04 01:12:44,272 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=52133.333333333336, ans=0.125 2023-12-04 01:12:55,596 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=52200.0, ans=0.125 2023-12-04 01:12:57,736 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=52200.0, ans=0.125 2023-12-04 01:13:21,877 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=52333.333333333336, ans=0.125 2023-12-04 01:13:26,271 INFO [train.py:1087] (2/4) Epoch 9, batch 700, loss[loss=0.2098, simple_loss=0.2921, pruned_loss=0.06375, over 24757.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3013, pruned_loss=0.06972, over 4677150.48 frames. ], batch size: 65, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:13:37,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=52466.666666666664, ans=0.125 2023-12-04 01:14:21,183 INFO [train.py:1087] (2/4) Epoch 9, batch 750, loss[loss=0.2083, simple_loss=0.2951, pruned_loss=0.06073, over 24555.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3012, pruned_loss=0.06956, over 4715285.48 frames. ], batch size: 62, lr: 2.36e-02, grad_scale: 32.0 2023-12-04 01:14:23,656 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.728e+02 1.920e+02 2.174e+02 3.625e+02, threshold=3.841e+02, percent-clipped=0.0 2023-12-04 01:14:45,459 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=52866.666666666664, ans=0.0 2023-12-04 01:14:46,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52866.666666666664, ans=0.125 2023-12-04 01:15:09,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=53000.0, ans=0.125 2023-12-04 01:15:11,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=53000.0, ans=0.125 2023-12-04 01:15:15,727 INFO [train.py:1087] (2/4) Epoch 9, batch 800, loss[loss=0.2044, simple_loss=0.2837, pruned_loss=0.06257, over 24557.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.301, pruned_loss=0.06957, over 4729168.80 frames. ], batch size: 63, lr: 2.36e-02, grad_scale: 32.0 2023-12-04 01:15:45,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=53200.0, ans=0.0 2023-12-04 01:15:55,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=53266.666666666664, ans=0.125 2023-12-04 01:16:10,157 INFO [train.py:1087] (2/4) Epoch 9, batch 850, loss[loss=0.2361, simple_loss=0.3151, pruned_loss=0.07855, over 24772.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3016, pruned_loss=0.07007, over 4751245.31 frames. ], batch size: 70, lr: 2.36e-02, grad_scale: 16.0 2023-12-04 01:16:13,090 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.782e+02 1.992e+02 2.397e+02 4.275e+02, threshold=3.984e+02, percent-clipped=1.0 2023-12-04 01:16:33,429 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=53533.333333333336, ans=0.0 2023-12-04 01:17:12,464 INFO [train.py:1087] (2/4) Epoch 10, batch 0, loss[loss=0.2063, simple_loss=0.2964, pruned_loss=0.0581, over 24798.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2964, pruned_loss=0.0581, over 24798.00 frames. ], batch size: 72, lr: 2.24e-02, grad_scale: 32.0 2023-12-04 01:17:12,465 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 01:17:24,573 INFO [train.py:1119] (2/4) Epoch 10, validation: loss=0.1824, simple_loss=0.2846, pruned_loss=0.0401, over 944034.00 frames. 2023-12-04 01:17:24,574 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 01:17:36,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=53766.666666666664, ans=0.02 2023-12-04 01:17:42,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=53766.666666666664, ans=0.1 2023-12-04 01:18:09,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=53966.666666666664, ans=0.0 2023-12-04 01:18:20,177 INFO [train.py:1087] (2/4) Epoch 10, batch 50, loss[loss=0.2259, simple_loss=0.3079, pruned_loss=0.07199, over 24765.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2999, pruned_loss=0.06787, over 1086450.22 frames. ], batch size: 65, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:18:29,052 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.771e+02 2.016e+02 2.500e+02 4.270e+02, threshold=4.032e+02, percent-clipped=1.0 2023-12-04 01:18:37,916 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=54100.0, ans=0.125 2023-12-04 01:18:37,941 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=54100.0, ans=0.0 2023-12-04 01:18:41,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=54166.666666666664, ans=0.07 2023-12-04 01:18:41,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=54166.666666666664, ans=0.125 2023-12-04 01:18:45,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=54166.666666666664, ans=0.125 2023-12-04 01:19:15,726 INFO [train.py:1087] (2/4) Epoch 10, batch 100, loss[loss=0.2397, simple_loss=0.3165, pruned_loss=0.08151, over 24475.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3013, pruned_loss=0.06919, over 1907842.80 frames. ], batch size: 77, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:19:17,765 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.41 vs. limit=15.0 2023-12-04 01:19:26,849 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=54433.333333333336, ans=0.95 2023-12-04 01:19:29,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=54433.333333333336, ans=0.1 2023-12-04 01:19:38,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=54500.0, ans=0.125 2023-12-04 01:19:56,562 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=54566.666666666664, ans=10.0 2023-12-04 01:19:59,606 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=54633.333333333336, ans=0.125 2023-12-04 01:20:10,800 INFO [train.py:1087] (2/4) Epoch 10, batch 150, loss[loss=0.2271, simple_loss=0.305, pruned_loss=0.07465, over 24494.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3024, pruned_loss=0.07038, over 2526906.09 frames. ], batch size: 75, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:20:20,273 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.785e+02 2.047e+02 2.385e+02 4.094e+02, threshold=4.093e+02, percent-clipped=1.0 2023-12-04 01:20:24,906 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.88 vs. limit=22.5 2023-12-04 01:20:39,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=54833.333333333336, ans=0.125 2023-12-04 01:21:05,654 INFO [train.py:1087] (2/4) Epoch 10, batch 200, loss[loss=0.2094, simple_loss=0.2955, pruned_loss=0.06163, over 24689.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3009, pruned_loss=0.06937, over 3039370.36 frames. ], batch size: 74, lr: 2.22e-02, grad_scale: 32.0 2023-12-04 01:21:09,374 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=55033.333333333336, ans=0.125 2023-12-04 01:21:09,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=55033.333333333336, ans=0.0 2023-12-04 01:21:19,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=55100.0, ans=0.125 2023-12-04 01:21:22,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55100.0, ans=0.0 2023-12-04 01:21:25,488 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=55100.0, ans=0.125 2023-12-04 01:21:33,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=55166.666666666664, ans=0.125 2023-12-04 01:21:39,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55233.333333333336, ans=0.125 2023-12-04 01:21:40,397 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:21:45,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55233.333333333336, ans=0.1 2023-12-04 01:21:55,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=55300.0, ans=0.125 2023-12-04 01:22:01,732 INFO [train.py:1087] (2/4) Epoch 10, batch 250, loss[loss=0.2232, simple_loss=0.3018, pruned_loss=0.07229, over 24794.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3004, pruned_loss=0.06918, over 3430944.76 frames. ], batch size: 71, lr: 2.22e-02, grad_scale: 32.0 2023-12-04 01:22:10,176 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.785e+02 2.066e+02 2.411e+02 3.389e+02, threshold=4.132e+02, percent-clipped=0.0 2023-12-04 01:22:12,517 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=55433.333333333336, ans=0.0 2023-12-04 01:22:32,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=55500.0, ans=10.0 2023-12-04 01:22:32,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55500.0, ans=0.125 2023-12-04 01:22:36,898 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=55566.666666666664, ans=0.125 2023-12-04 01:22:57,210 INFO [train.py:1087] (2/4) Epoch 10, batch 300, loss[loss=0.2077, simple_loss=0.2932, pruned_loss=0.06114, over 24801.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3003, pruned_loss=0.06936, over 3721981.99 frames. ], batch size: 73, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:22:58,588 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=55700.0, ans=0.0 2023-12-04 01:23:18,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=55833.333333333336, ans=0.1 2023-12-04 01:23:36,755 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=55900.0, ans=0.125 2023-12-04 01:23:37,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=55900.0, ans=0.125 2023-12-04 01:23:50,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=56033.333333333336, ans=0.125 2023-12-04 01:23:51,962 INFO [train.py:1087] (2/4) Epoch 10, batch 350, loss[loss=0.2407, simple_loss=0.3178, pruned_loss=0.08184, over 22726.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3006, pruned_loss=0.06953, over 3947307.07 frames. ], batch size: 106, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:24:00,756 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.739e+02 1.916e+02 2.181e+02 3.253e+02, threshold=3.831e+02, percent-clipped=0.0 2023-12-04 01:24:06,460 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=56100.0, ans=0.125 2023-12-04 01:24:09,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=56100.0, ans=0.125 2023-12-04 01:24:19,689 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-12-04 01:24:20,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=56166.666666666664, ans=0.0 2023-12-04 01:24:46,835 INFO [train.py:1087] (2/4) Epoch 10, batch 400, loss[loss=0.2212, simple_loss=0.3019, pruned_loss=0.07019, over 24577.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3003, pruned_loss=0.06933, over 4124638.46 frames. ], batch size: 65, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:25:00,236 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=56433.333333333336, ans=0.09899494936611666 2023-12-04 01:25:05,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=56433.333333333336, ans=0.0 2023-12-04 01:25:09,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=56500.0, ans=0.125 2023-12-04 01:25:11,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=56500.0, ans=0.035 2023-12-04 01:25:17,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=56500.0, ans=0.125 2023-12-04 01:25:33,170 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56633.333333333336, ans=0.1 2023-12-04 01:25:36,339 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56633.333333333336, ans=0.1 2023-12-04 01:25:38,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=56633.333333333336, ans=0.125 2023-12-04 01:25:42,326 INFO [train.py:1087] (2/4) Epoch 10, batch 450, loss[loss=0.216, simple_loss=0.2985, pruned_loss=0.0667, over 20873.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2995, pruned_loss=0.06871, over 4272444.87 frames. ], batch size: 50, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:25:42,884 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=15.0 2023-12-04 01:25:47,166 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-12-04 01:25:50,718 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.719e+02 1.900e+02 2.177e+02 5.504e+02, threshold=3.801e+02, percent-clipped=1.0 2023-12-04 01:26:22,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=56900.0, ans=0.2 2023-12-04 01:26:37,598 INFO [train.py:1087] (2/4) Epoch 10, batch 500, loss[loss=0.2408, simple_loss=0.3198, pruned_loss=0.0809, over 22815.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2996, pruned_loss=0.06871, over 4379458.70 frames. ], batch size: 106, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:26:57,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=57100.0, ans=0.125 2023-12-04 01:26:58,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=57166.666666666664, ans=0.2 2023-12-04 01:27:03,374 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57166.666666666664, ans=0.1 2023-12-04 01:27:03,488 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=57166.666666666664, ans=0.2 2023-12-04 01:27:16,798 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2023-12-04 01:27:28,932 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-12-04 01:27:33,193 INFO [train.py:1087] (2/4) Epoch 10, batch 550, loss[loss=0.2392, simple_loss=0.317, pruned_loss=0.08074, over 24070.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2993, pruned_loss=0.06841, over 4471930.01 frames. ], batch size: 87, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:27:34,690 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=15.0 2023-12-04 01:27:39,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=57366.666666666664, ans=0.125 2023-12-04 01:27:41,656 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.828e+02 1.979e+02 2.335e+02 3.865e+02, threshold=3.957e+02, percent-clipped=2.0 2023-12-04 01:27:47,304 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=57433.333333333336, ans=0.125 2023-12-04 01:27:51,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=57433.333333333336, ans=0.09899494936611666 2023-12-04 01:28:05,417 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.84 vs. limit=15.0 2023-12-04 01:28:08,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=57566.666666666664, ans=0.125 2023-12-04 01:28:10,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57566.666666666664, ans=0.1 2023-12-04 01:28:28,433 INFO [train.py:1087] (2/4) Epoch 10, batch 600, loss[loss=0.2189, simple_loss=0.2968, pruned_loss=0.07049, over 24547.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2986, pruned_loss=0.06804, over 4554847.23 frames. ], batch size: 62, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:28:38,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=57700.0, ans=0.0 2023-12-04 01:28:41,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=57766.666666666664, ans=0.2 2023-12-04 01:29:09,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=57900.0, ans=0.0 2023-12-04 01:29:09,836 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=57900.0, ans=0.2 2023-12-04 01:29:24,554 INFO [train.py:1087] (2/4) Epoch 10, batch 650, loss[loss=0.2077, simple_loss=0.2964, pruned_loss=0.05953, over 24680.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2982, pruned_loss=0.06748, over 4623010.95 frames. ], batch size: 74, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:29:24,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58033.333333333336, ans=0.1 2023-12-04 01:29:31,530 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=58033.333333333336, ans=0.0 2023-12-04 01:29:33,305 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.737e+02 1.939e+02 2.220e+02 3.387e+02, threshold=3.878e+02, percent-clipped=0.0 2023-12-04 01:29:54,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=58166.666666666664, ans=0.125 2023-12-04 01:29:56,057 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=15.0 2023-12-04 01:29:57,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=58233.333333333336, ans=0.125 2023-12-04 01:30:20,098 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.67 vs. limit=6.0 2023-12-04 01:30:20,538 INFO [train.py:1087] (2/4) Epoch 10, batch 700, loss[loss=0.2036, simple_loss=0.291, pruned_loss=0.05813, over 24778.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2976, pruned_loss=0.06694, over 4672604.63 frames. ], batch size: 70, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:30:27,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=58366.666666666664, ans=0.125 2023-12-04 01:30:30,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=58433.333333333336, ans=0.0 2023-12-04 01:30:45,679 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=58500.0, ans=0.0 2023-12-04 01:30:56,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=58566.666666666664, ans=0.125 2023-12-04 01:31:01,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58566.666666666664, ans=0.1 2023-12-04 01:31:15,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=58633.333333333336, ans=0.125 2023-12-04 01:31:17,316 INFO [train.py:1087] (2/4) Epoch 10, batch 750, loss[loss=0.2103, simple_loss=0.2953, pruned_loss=0.06267, over 24805.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2975, pruned_loss=0.06711, over 4687904.83 frames. ], batch size: 62, lr: 2.18e-02, grad_scale: 32.0 2023-12-04 01:31:17,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=58700.0, ans=0.0 2023-12-04 01:31:22,186 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.57 vs. limit=22.5 2023-12-04 01:31:25,742 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.642e+02 1.916e+02 2.183e+02 2.930e+02, threshold=3.832e+02, percent-clipped=0.0 2023-12-04 01:31:51,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58900.0, ans=0.125 2023-12-04 01:32:11,835 INFO [train.py:1087] (2/4) Epoch 10, batch 800, loss[loss=0.2033, simple_loss=0.2874, pruned_loss=0.05956, over 24713.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2969, pruned_loss=0.06662, over 4724977.58 frames. ], batch size: 74, lr: 2.18e-02, grad_scale: 32.0 2023-12-04 01:32:41,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=59166.666666666664, ans=0.0 2023-12-04 01:33:03,321 INFO [train.py:1087] (2/4) Epoch 10, batch 850, loss[loss=0.2307, simple_loss=0.3045, pruned_loss=0.07842, over 22873.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2968, pruned_loss=0.06668, over 4734327.09 frames. ], batch size: 106, lr: 2.17e-02, grad_scale: 32.0 2023-12-04 01:33:09,967 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.80 vs. limit=15.0 2023-12-04 01:33:11,374 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.702e+02 1.889e+02 2.161e+02 3.150e+02, threshold=3.778e+02, percent-clipped=0.0 2023-12-04 01:33:17,884 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=12.0 2023-12-04 01:33:22,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=59500.0, ans=0.0 2023-12-04 01:33:42,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=59566.666666666664, ans=0.125 2023-12-04 01:33:46,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=59633.333333333336, ans=0.0 2023-12-04 01:34:04,727 INFO [train.py:1087] (2/4) Epoch 11, batch 0, loss[loss=0.2049, simple_loss=0.2904, pruned_loss=0.05972, over 24750.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2904, pruned_loss=0.05972, over 24750.00 frames. ], batch size: 70, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:34:04,728 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 01:34:16,791 INFO [train.py:1119] (2/4) Epoch 11, validation: loss=0.1777, simple_loss=0.28, pruned_loss=0.03772, over 944034.00 frames. 2023-12-04 01:34:16,792 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 01:34:22,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=59666.666666666664, ans=0.125 2023-12-04 01:34:40,456 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-12-04 01:34:47,854 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59800.0, ans=0.1 2023-12-04 01:34:48,959 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=59866.666666666664, ans=0.2 2023-12-04 01:34:50,057 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=59866.666666666664, ans=10.0 2023-12-04 01:35:12,536 INFO [train.py:1087] (2/4) Epoch 11, batch 50, loss[loss=0.2019, simple_loss=0.2822, pruned_loss=0.06076, over 24726.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2927, pruned_loss=0.06333, over 1100284.16 frames. ], batch size: 61, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:35:18,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60000.0, ans=0.125 2023-12-04 01:35:20,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=60000.0, ans=0.125 2023-12-04 01:35:21,544 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=60000.0, ans=0.0 2023-12-04 01:35:26,543 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.726e+02 1.897e+02 2.276e+02 3.906e+02, threshold=3.795e+02, percent-clipped=1.0 2023-12-04 01:35:28,260 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2023-12-04 01:35:32,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=60066.666666666664, ans=0.125 2023-12-04 01:35:35,277 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=60133.333333333336, ans=0.125 2023-12-04 01:35:44,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60200.0, ans=0.125 2023-12-04 01:35:56,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=60266.666666666664, ans=0.02 2023-12-04 01:35:56,782 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=60266.666666666664, ans=0.125 2023-12-04 01:36:07,140 INFO [train.py:1087] (2/4) Epoch 11, batch 100, loss[loss=0.2127, simple_loss=0.2991, pruned_loss=0.06308, over 24792.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2945, pruned_loss=0.0639, over 1926596.73 frames. ], batch size: 62, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:36:12,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=60333.333333333336, ans=0.2 2023-12-04 01:36:26,287 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.989e-03 2023-12-04 01:36:28,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=60466.666666666664, ans=0.125 2023-12-04 01:37:03,001 INFO [train.py:1087] (2/4) Epoch 11, batch 150, loss[loss=0.2362, simple_loss=0.3145, pruned_loss=0.07897, over 22934.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2953, pruned_loss=0.06488, over 2563716.03 frames. ], batch size: 106, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:37:10,124 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.78 vs. limit=22.5 2023-12-04 01:37:17,694 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.734e+02 1.958e+02 2.233e+02 3.565e+02, threshold=3.917e+02, percent-clipped=0.0 2023-12-04 01:37:41,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-12-04 01:37:44,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=60866.666666666664, ans=0.0 2023-12-04 01:37:53,145 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=60933.333333333336, ans=0.2 2023-12-04 01:37:54,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=60933.333333333336, ans=0.0 2023-12-04 01:37:58,249 INFO [train.py:1087] (2/4) Epoch 11, batch 200, loss[loss=0.2245, simple_loss=0.31, pruned_loss=0.06944, over 24569.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2951, pruned_loss=0.06507, over 3058114.61 frames. ], batch size: 63, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:38:08,838 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.30 vs. limit=10.0 2023-12-04 01:38:24,386 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=61133.333333333336, ans=0.125 2023-12-04 01:38:29,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=61133.333333333336, ans=0.0 2023-12-04 01:38:30,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=61200.0, ans=0.2 2023-12-04 01:38:31,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=61200.0, ans=0.125 2023-12-04 01:38:54,397 INFO [train.py:1087] (2/4) Epoch 11, batch 250, loss[loss=0.2181, simple_loss=0.2971, pruned_loss=0.06953, over 24480.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2953, pruned_loss=0.06526, over 3450273.26 frames. ], batch size: 75, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:38:55,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=61333.333333333336, ans=0.1 2023-12-04 01:39:08,132 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.716e+02 2.037e+02 2.457e+02 3.688e+02, threshold=4.073e+02, percent-clipped=0.0 2023-12-04 01:39:12,951 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61400.0, ans=0.1 2023-12-04 01:39:19,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=61466.666666666664, ans=0.09899494936611666 2023-12-04 01:39:20,530 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=61466.666666666664, ans=10.0 2023-12-04 01:39:29,355 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-12-04 01:39:47,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=61600.0, ans=0.125 2023-12-04 01:39:48,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=61666.666666666664, ans=0.0 2023-12-04 01:39:49,379 INFO [train.py:1087] (2/4) Epoch 11, batch 300, loss[loss=0.2312, simple_loss=0.3064, pruned_loss=0.07802, over 24613.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2951, pruned_loss=0.06494, over 3749380.10 frames. ], batch size: 68, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:39:51,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61666.666666666664, ans=0.1 2023-12-04 01:40:07,344 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=61733.333333333336, ans=0.0 2023-12-04 01:40:11,606 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=61800.0, ans=0.0 2023-12-04 01:40:21,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=61800.0, ans=0.125 2023-12-04 01:40:44,749 INFO [train.py:1087] (2/4) Epoch 11, batch 350, loss[loss=0.1993, simple_loss=0.2865, pruned_loss=0.05601, over 24697.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2961, pruned_loss=0.0655, over 3969723.31 frames. ], batch size: 74, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:40:58,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=62066.666666666664, ans=0.125 2023-12-04 01:40:59,579 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.779e+02 2.043e+02 2.515e+02 5.209e+02, threshold=4.086e+02, percent-clipped=2.0 2023-12-04 01:41:00,926 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=62066.666666666664, ans=0.125 2023-12-04 01:41:08,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=62133.333333333336, ans=0.125 2023-12-04 01:41:13,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=62133.333333333336, ans=0.2 2023-12-04 01:41:23,089 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=62200.0, ans=0.125 2023-12-04 01:41:40,177 INFO [train.py:1087] (2/4) Epoch 11, batch 400, loss[loss=0.2158, simple_loss=0.2956, pruned_loss=0.06805, over 24799.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.295, pruned_loss=0.06492, over 4157641.88 frames. ], batch size: 71, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:41:44,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=62333.333333333336, ans=0.0 2023-12-04 01:41:52,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=62400.0, ans=0.07 2023-12-04 01:41:59,787 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=62400.0, ans=0.035 2023-12-04 01:42:22,219 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=62533.333333333336, ans=0.125 2023-12-04 01:42:27,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=62600.0, ans=0.125 2023-12-04 01:42:36,132 INFO [train.py:1087] (2/4) Epoch 11, batch 450, loss[loss=0.1946, simple_loss=0.2839, pruned_loss=0.05263, over 24568.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2941, pruned_loss=0.06406, over 4314462.88 frames. ], batch size: 65, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:42:45,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=62733.333333333336, ans=0.05 2023-12-04 01:42:49,896 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.649e+02 1.867e+02 2.132e+02 3.251e+02, threshold=3.733e+02, percent-clipped=0.0 2023-12-04 01:42:58,483 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=62800.0, ans=0.125 2023-12-04 01:42:59,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=62800.0, ans=0.125 2023-12-04 01:43:31,098 INFO [train.py:1087] (2/4) Epoch 11, batch 500, loss[loss=0.2176, simple_loss=0.2997, pruned_loss=0.06774, over 24604.00 frames. ], tot_loss[loss=0.211, simple_loss=0.294, pruned_loss=0.06399, over 4422491.94 frames. ], batch size: 68, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:43:50,910 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=24.10 vs. limit=22.5 2023-12-04 01:44:25,529 INFO [train.py:1087] (2/4) Epoch 11, batch 550, loss[loss=0.1947, simple_loss=0.2781, pruned_loss=0.05564, over 24751.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2934, pruned_loss=0.06358, over 4519167.50 frames. ], batch size: 70, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:44:29,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=63333.333333333336, ans=0.125 2023-12-04 01:44:30,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=63333.333333333336, ans=0.0 2023-12-04 01:44:34,141 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.48 vs. limit=10.0 2023-12-04 01:44:39,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=63400.0, ans=0.125 2023-12-04 01:44:39,893 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.675e+02 1.832e+02 2.050e+02 3.249e+02, threshold=3.664e+02, percent-clipped=0.0 2023-12-04 01:44:40,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=63400.0, ans=0.125 2023-12-04 01:44:44,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63400.0, ans=0.1 2023-12-04 01:44:47,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=63466.666666666664, ans=0.2 2023-12-04 01:44:48,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:45:02,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=63533.333333333336, ans=0.0 2023-12-04 01:45:08,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=63533.333333333336, ans=0.1 2023-12-04 01:45:20,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63666.666666666664, ans=0.1 2023-12-04 01:45:21,181 INFO [train.py:1087] (2/4) Epoch 11, batch 600, loss[loss=0.1852, simple_loss=0.2715, pruned_loss=0.04944, over 24570.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2931, pruned_loss=0.06336, over 4599911.84 frames. ], batch size: 64, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:45:40,619 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63733.333333333336, ans=0.1 2023-12-04 01:45:44,200 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-12-04 01:45:54,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=63866.666666666664, ans=0.0 2023-12-04 01:46:17,155 INFO [train.py:1087] (2/4) Epoch 11, batch 650, loss[loss=0.2037, simple_loss=0.2898, pruned_loss=0.05883, over 24782.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2933, pruned_loss=0.06354, over 4644712.58 frames. ], batch size: 71, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:46:18,414 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64000.0, ans=0.1 2023-12-04 01:46:31,349 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.714e+02 1.930e+02 2.208e+02 4.063e+02, threshold=3.860e+02, percent-clipped=1.0 2023-12-04 01:46:39,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=64133.333333333336, ans=0.0 2023-12-04 01:46:39,247 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64133.333333333336, ans=0.125 2023-12-04 01:46:40,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=64133.333333333336, ans=0.125 2023-12-04 01:46:51,038 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.00 vs. limit=15.0 2023-12-04 01:46:59,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64200.0, ans=0.125 2023-12-04 01:47:03,741 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64266.666666666664, ans=0.125 2023-12-04 01:47:12,590 INFO [train.py:1087] (2/4) Epoch 11, batch 700, loss[loss=0.2035, simple_loss=0.296, pruned_loss=0.05548, over 24795.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2927, pruned_loss=0.06343, over 4680826.12 frames. ], batch size: 71, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:47:19,493 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-12-04 01:47:46,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=64533.333333333336, ans=0.125 2023-12-04 01:47:49,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64533.333333333336, ans=0.125 2023-12-04 01:47:53,880 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.72 vs. limit=6.0 2023-12-04 01:47:57,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64600.0, ans=0.1 2023-12-04 01:48:07,709 INFO [train.py:1087] (2/4) Epoch 11, batch 750, loss[loss=0.2114, simple_loss=0.2976, pruned_loss=0.06256, over 24582.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2926, pruned_loss=0.06312, over 4726573.72 frames. ], batch size: 64, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:48:12,658 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64666.666666666664, ans=0.1 2023-12-04 01:48:21,833 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.768e+02 1.995e+02 2.303e+02 4.532e+02, threshold=3.990e+02, percent-clipped=1.0 2023-12-04 01:48:39,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=64800.0, ans=0.125 2023-12-04 01:48:45,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64866.666666666664, ans=0.125 2023-12-04 01:48:50,638 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.78 vs. limit=12.0 2023-12-04 01:48:53,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=64933.333333333336, ans=0.125 2023-12-04 01:49:02,918 INFO [train.py:1087] (2/4) Epoch 11, batch 800, loss[loss=0.2062, simple_loss=0.2936, pruned_loss=0.05936, over 24721.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2927, pruned_loss=0.06319, over 4733745.71 frames. ], batch size: 69, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:49:06,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=65000.0, ans=0.125 2023-12-04 01:49:09,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=65000.0, ans=0.125 2023-12-04 01:49:13,796 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=65066.666666666664, ans=0.0 2023-12-04 01:49:22,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=65066.666666666664, ans=0.0 2023-12-04 01:49:23,208 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-12-04 01:49:24,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=65133.333333333336, ans=0.0 2023-12-04 01:49:31,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65133.333333333336, ans=0.1 2023-12-04 01:49:47,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=65266.666666666664, ans=0.125 2023-12-04 01:49:54,689 INFO [train.py:1087] (2/4) Epoch 11, batch 850, loss[loss=0.2051, simple_loss=0.2889, pruned_loss=0.06065, over 24787.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2933, pruned_loss=0.06371, over 4738961.12 frames. ], batch size: 62, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:50:06,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=65400.0, ans=0.0 2023-12-04 01:50:07,559 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.714e+02 1.902e+02 2.245e+02 3.869e+02, threshold=3.803e+02, percent-clipped=0.0 2023-12-04 01:50:09,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-12-04 01:50:24,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=65533.333333333336, ans=0.2 2023-12-04 01:50:36,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=65600.0, ans=0.0 2023-12-04 01:50:56,254 INFO [train.py:1087] (2/4) Epoch 12, batch 0, loss[loss=0.2, simple_loss=0.287, pruned_loss=0.05648, over 24723.00 frames. ], tot_loss[loss=0.2, simple_loss=0.287, pruned_loss=0.05648, over 24723.00 frames. ], batch size: 67, lr: 1.93e-02, grad_scale: 32.0 2023-12-04 01:50:56,254 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 01:51:08,731 INFO [train.py:1119] (2/4) Epoch 12, validation: loss=0.1762, simple_loss=0.2782, pruned_loss=0.03709, over 944034.00 frames. 2023-12-04 01:51:08,732 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 01:51:21,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=65700.0, ans=0.09899494936611666 2023-12-04 01:51:22,274 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-12-04 01:51:22,819 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=65700.0, ans=0.0 2023-12-04 01:51:34,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=65766.66666666667, ans=0.125 2023-12-04 01:51:36,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65766.66666666667, ans=0.125 2023-12-04 01:51:54,601 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=65900.0, ans=0.2 2023-12-04 01:51:57,301 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.19 vs. limit=15.0 2023-12-04 01:52:03,489 INFO [train.py:1087] (2/4) Epoch 12, batch 50, loss[loss=0.2356, simple_loss=0.3176, pruned_loss=0.07684, over 24458.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2925, pruned_loss=0.06413, over 1076719.33 frames. ], batch size: 75, lr: 1.93e-02, grad_scale: 32.0 2023-12-04 01:52:03,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65966.66666666667, ans=0.125 2023-12-04 01:52:04,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.06 vs. limit=15.0 2023-12-04 01:52:12,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=65966.66666666667, ans=0.125 2023-12-04 01:52:16,534 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=66033.33333333333, ans=0.125 2023-12-04 01:52:21,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=66033.33333333333, ans=0.2 2023-12-04 01:52:22,784 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.664e+02 1.876e+02 2.080e+02 3.636e+02, threshold=3.752e+02, percent-clipped=0.0 2023-12-04 01:52:36,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=66166.66666666667, ans=0.125 2023-12-04 01:52:37,958 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=66166.66666666667, ans=0.0 2023-12-04 01:52:53,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=66233.33333333333, ans=0.2 2023-12-04 01:52:58,358 INFO [train.py:1087] (2/4) Epoch 12, batch 100, loss[loss=0.1955, simple_loss=0.2851, pruned_loss=0.05289, over 24760.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2924, pruned_loss=0.06302, over 1906682.52 frames. ], batch size: 61, lr: 1.92e-02, grad_scale: 32.0 2023-12-04 01:52:58,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=66300.0, ans=0.125 2023-12-04 01:53:08,428 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=66366.66666666667, ans=0.2 2023-12-04 01:53:13,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=66366.66666666667, ans=0.1 2023-12-04 01:53:22,618 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.00 vs. limit=22.5 2023-12-04 01:53:44,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=66566.66666666667, ans=0.125 2023-12-04 01:53:52,731 INFO [train.py:1087] (2/4) Epoch 12, batch 150, loss[loss=0.2029, simple_loss=0.2823, pruned_loss=0.06173, over 24737.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2913, pruned_loss=0.0619, over 2568371.97 frames. ], batch size: 63, lr: 1.92e-02, grad_scale: 64.0 2023-12-04 01:54:06,045 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-12-04 01:54:07,347 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-12-04 01:54:13,085 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.585e+02 1.776e+02 2.069e+02 2.709e+02, threshold=3.552e+02, percent-clipped=0.0 2023-12-04 01:54:21,224 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66766.66666666667, ans=0.1 2023-12-04 01:54:23,766 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.43 vs. limit=22.5 2023-12-04 01:54:35,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66833.33333333333, ans=0.1 2023-12-04 01:54:48,679 INFO [train.py:1087] (2/4) Epoch 12, batch 200, loss[loss=0.2085, simple_loss=0.2935, pruned_loss=0.06176, over 24487.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.291, pruned_loss=0.06202, over 3067658.32 frames. ], batch size: 77, lr: 1.92e-02, grad_scale: 64.0 2023-12-04 01:54:57,456 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=66966.66666666667, ans=0.0 2023-12-04 01:55:03,524 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67033.33333333333, ans=0.125 2023-12-04 01:55:22,302 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-12-04 01:55:43,626 INFO [train.py:1087] (2/4) Epoch 12, batch 250, loss[loss=0.2102, simple_loss=0.2946, pruned_loss=0.06286, over 22475.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2911, pruned_loss=0.06242, over 3449051.33 frames. ], batch size: 54, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:56:03,354 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.629e+02 1.883e+02 2.332e+02 4.483e+02, threshold=3.767e+02, percent-clipped=5.0 2023-12-04 01:56:17,714 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=67500.0, ans=0.0 2023-12-04 01:56:17,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=67500.0, ans=0.0 2023-12-04 01:56:17,930 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2023-12-04 01:56:38,736 INFO [train.py:1087] (2/4) Epoch 12, batch 300, loss[loss=0.1873, simple_loss=0.2716, pruned_loss=0.05145, over 24756.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2908, pruned_loss=0.06198, over 3756368.94 frames. ], batch size: 66, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:56:39,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67633.33333333333, ans=0.1 2023-12-04 01:56:43,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=67633.33333333333, ans=0.125 2023-12-04 01:56:51,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=67700.0, ans=0.2 2023-12-04 01:57:18,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=67833.33333333333, ans=0.125 2023-12-04 01:57:27,091 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=67900.0, ans=0.125 2023-12-04 01:57:29,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=67900.0, ans=0.0 2023-12-04 01:57:33,579 INFO [train.py:1087] (2/4) Epoch 12, batch 350, loss[loss=0.2189, simple_loss=0.2935, pruned_loss=0.07217, over 24159.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2906, pruned_loss=0.06189, over 3996345.60 frames. ], batch size: 82, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:57:41,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67966.66666666667, ans=0.1 2023-12-04 01:57:51,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=68033.33333333333, ans=0.125 2023-12-04 01:57:53,625 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.644e+02 1.847e+02 2.070e+02 4.257e+02, threshold=3.695e+02, percent-clipped=1.0 2023-12-04 01:58:18,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68233.33333333333, ans=0.1 2023-12-04 01:58:21,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68233.33333333333, ans=0.125 2023-12-04 01:58:27,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=68300.0, ans=0.0 2023-12-04 01:58:28,587 INFO [train.py:1087] (2/4) Epoch 12, batch 400, loss[loss=0.232, simple_loss=0.3056, pruned_loss=0.07916, over 24508.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2901, pruned_loss=0.06179, over 4172484.61 frames. ], batch size: 77, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 01:58:29,380 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=22.5 2023-12-04 01:58:29,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=68300.0, ans=0.5 2023-12-04 01:58:38,637 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.32 vs. limit=12.0 2023-12-04 01:58:44,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=68366.66666666667, ans=0.125 2023-12-04 01:59:16,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=68566.66666666667, ans=0.0 2023-12-04 01:59:17,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=68566.66666666667, ans=0.05 2023-12-04 01:59:23,530 INFO [train.py:1087] (2/4) Epoch 12, batch 450, loss[loss=0.2729, simple_loss=0.3313, pruned_loss=0.1072, over 16835.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2898, pruned_loss=0.06159, over 4317915.13 frames. ], batch size: 177, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 01:59:31,509 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2023-12-04 01:59:43,216 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.590e+02 1.776e+02 1.985e+02 3.300e+02, threshold=3.553e+02, percent-clipped=0.0 2023-12-04 01:59:50,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=68766.66666666667, ans=0.07 2023-12-04 02:00:06,936 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:00:18,733 INFO [train.py:1087] (2/4) Epoch 12, batch 500, loss[loss=0.1884, simple_loss=0.2768, pruned_loss=0.05007, over 24768.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2906, pruned_loss=0.06223, over 4412586.46 frames. ], batch size: 65, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 02:00:26,798 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=6.0 2023-12-04 02:00:32,623 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69033.33333333333, ans=0.1 2023-12-04 02:00:39,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=69100.0, ans=0.2 2023-12-04 02:00:40,270 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=69100.0, ans=0.125 2023-12-04 02:01:12,874 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.11 vs. limit=15.0 2023-12-04 02:01:13,283 INFO [train.py:1087] (2/4) Epoch 12, batch 550, loss[loss=0.2004, simple_loss=0.2867, pruned_loss=0.05705, over 24581.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2914, pruned_loss=0.06277, over 4480886.97 frames. ], batch size: 64, lr: 1.90e-02, grad_scale: 32.0 2023-12-04 02:01:35,169 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.724e+02 2.043e+02 2.317e+02 3.835e+02, threshold=4.086e+02, percent-clipped=1.0 2023-12-04 02:01:39,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=69433.33333333333, ans=0.1 2023-12-04 02:01:45,366 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=12.0 2023-12-04 02:01:57,437 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=69566.66666666667, ans=0.125 2023-12-04 02:02:06,814 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.28 vs. limit=15.0 2023-12-04 02:02:09,169 INFO [train.py:1087] (2/4) Epoch 12, batch 600, loss[loss=0.1912, simple_loss=0.2789, pruned_loss=0.05171, over 24718.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2904, pruned_loss=0.06206, over 4556678.43 frames. ], batch size: 74, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:02:22,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69700.0, ans=0.1 2023-12-04 02:02:29,679 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=69700.0, ans=0.125 2023-12-04 02:02:53,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=69900.0, ans=0.0 2023-12-04 02:03:01,365 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=12.0 2023-12-04 02:03:04,484 INFO [train.py:1087] (2/4) Epoch 12, batch 650, loss[loss=0.2437, simple_loss=0.3209, pruned_loss=0.0833, over 21485.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.29, pruned_loss=0.06157, over 4621839.82 frames. ], batch size: 127, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:03:08,142 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-12-04 02:03:08,273 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.24 vs. limit=15.0 2023-12-04 02:03:16,714 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.21 vs. limit=6.0 2023-12-04 02:03:21,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=70033.33333333333, ans=0.125 2023-12-04 02:03:24,306 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-12-04 02:03:25,723 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.683e+02 1.982e+02 2.241e+02 4.584e+02, threshold=3.964e+02, percent-clipped=1.0 2023-12-04 02:03:33,553 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=70100.0, ans=0.2 2023-12-04 02:04:00,262 INFO [train.py:1087] (2/4) Epoch 12, batch 700, loss[loss=0.2063, simple_loss=0.286, pruned_loss=0.06327, over 23987.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2897, pruned_loss=0.0611, over 4667102.44 frames. ], batch size: 87, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:04:09,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=70300.0, ans=0.125 2023-12-04 02:04:09,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=70300.0, ans=0.125 2023-12-04 02:04:16,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=70366.66666666667, ans=0.0 2023-12-04 02:04:16,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70366.66666666667, ans=0.1 2023-12-04 02:04:18,069 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-12-04 02:04:20,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70433.33333333333, ans=0.125 2023-12-04 02:04:36,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=70500.0, ans=0.0 2023-12-04 02:04:43,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70566.66666666667, ans=0.1 2023-12-04 02:04:44,722 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-12-04 02:04:52,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70566.66666666667, ans=0.1 2023-12-04 02:04:55,254 INFO [train.py:1087] (2/4) Epoch 12, batch 750, loss[loss=0.1983, simple_loss=0.2828, pruned_loss=0.05689, over 24716.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2899, pruned_loss=0.06115, over 4692546.88 frames. ], batch size: 69, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:05:08,642 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=70700.0, ans=0.0 2023-12-04 02:05:12,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=70700.0, ans=0.125 2023-12-04 02:05:15,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=70700.0, ans=0.125 2023-12-04 02:05:16,172 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.641e+02 1.850e+02 2.061e+02 3.864e+02, threshold=3.700e+02, percent-clipped=0.0 2023-12-04 02:05:20,994 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.21 vs. limit=15.0 2023-12-04 02:05:23,734 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:05:24,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=70766.66666666667, ans=0.125 2023-12-04 02:05:28,755 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70833.33333333333, ans=0.125 2023-12-04 02:05:32,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=70833.33333333333, ans=0.0 2023-12-04 02:05:44,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70900.0, ans=0.1 2023-12-04 02:05:50,306 INFO [train.py:1087] (2/4) Epoch 12, batch 800, loss[loss=0.2107, simple_loss=0.2947, pruned_loss=0.06331, over 24701.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2891, pruned_loss=0.0606, over 4731749.83 frames. ], batch size: 74, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:05:54,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70966.66666666667, ans=0.125 2023-12-04 02:06:17,804 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:06:17,970 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-12-04 02:06:20,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=71166.66666666667, ans=0.125 2023-12-04 02:06:26,659 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=71166.66666666667, ans=0.125 2023-12-04 02:06:36,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=71233.33333333333, ans=0.2 2023-12-04 02:06:41,334 INFO [train.py:1087] (2/4) Epoch 12, batch 850, loss[loss=0.2024, simple_loss=0.2871, pruned_loss=0.05879, over 24557.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2894, pruned_loss=0.06073, over 4760946.28 frames. ], batch size: 63, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:06:56,199 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.95 vs. limit=6.0 2023-12-04 02:07:00,559 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.606e+02 1.785e+02 2.120e+02 3.462e+02, threshold=3.569e+02, percent-clipped=0.0 2023-12-04 02:07:01,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=71433.33333333333, ans=0.09899494936611666 2023-12-04 02:07:07,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=71433.33333333333, ans=0.2 2023-12-04 02:07:12,298 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.14 vs. limit=22.5 2023-12-04 02:07:42,593 INFO [train.py:1087] (2/4) Epoch 13, batch 0, loss[loss=0.2108, simple_loss=0.2969, pruned_loss=0.06239, over 24768.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2969, pruned_loss=0.06239, over 24768.00 frames. ], batch size: 65, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:07:42,593 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 02:07:54,988 INFO [train.py:1119] (2/4) Epoch 13, validation: loss=0.173, simple_loss=0.2751, pruned_loss=0.03551, over 944034.00 frames. 2023-12-04 02:07:54,988 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 02:07:58,505 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=71600.0, ans=0.125 2023-12-04 02:07:59,037 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-12-04 02:07:59,908 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.13 vs. limit=10.0 2023-12-04 02:08:15,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=71666.66666666667, ans=0.2 2023-12-04 02:08:22,982 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=71733.33333333333, ans=10.0 2023-12-04 02:08:23,881 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:08:24,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71733.33333333333, ans=0.1 2023-12-04 02:08:28,488 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-12-04 02:08:31,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=71800.0, ans=0.125 2023-12-04 02:08:34,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=71800.0, ans=0.0 2023-12-04 02:08:49,960 INFO [train.py:1087] (2/4) Epoch 13, batch 50, loss[loss=0.1929, simple_loss=0.2824, pruned_loss=0.05168, over 24721.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2908, pruned_loss=0.0608, over 1087395.48 frames. ], batch size: 74, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:09:09,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=72000.0, ans=0.125 2023-12-04 02:09:14,612 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=72066.66666666667, ans=0.125 2023-12-04 02:09:16,453 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.642e+02 1.871e+02 2.165e+02 4.330e+02, threshold=3.741e+02, percent-clipped=2.0 2023-12-04 02:09:17,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=72066.66666666667, ans=0.2 2023-12-04 02:09:37,096 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=72200.0, ans=0.05 2023-12-04 02:09:44,743 INFO [train.py:1087] (2/4) Epoch 13, batch 100, loss[loss=0.2202, simple_loss=0.3052, pruned_loss=0.06764, over 24508.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2881, pruned_loss=0.05966, over 1921791.06 frames. ], batch size: 75, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:09:49,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=72266.66666666667, ans=0.95 2023-12-04 02:09:49,752 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=12.0 2023-12-04 02:09:51,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=72266.66666666667, ans=0.125 2023-12-04 02:09:52,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=72266.66666666667, ans=0.07 2023-12-04 02:10:03,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=72333.33333333333, ans=0.0 2023-12-04 02:10:07,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=72400.0, ans=0.125 2023-12-04 02:10:19,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=72466.66666666667, ans=0.2 2023-12-04 02:10:24,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72466.66666666667, ans=0.125 2023-12-04 02:10:25,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=72466.66666666667, ans=0.2 2023-12-04 02:10:29,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=72533.33333333333, ans=0.0 2023-12-04 02:10:33,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=72533.33333333333, ans=0.125 2023-12-04 02:10:38,681 INFO [train.py:1087] (2/4) Epoch 13, batch 150, loss[loss=0.193, simple_loss=0.2777, pruned_loss=0.05409, over 24550.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2876, pruned_loss=0.05955, over 2559105.13 frames. ], batch size: 63, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:10:38,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=72600.0, ans=0.015 2023-12-04 02:10:41,333 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.37 vs. limit=12.0 2023-12-04 02:10:42,058 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=72600.0, ans=0.125 2023-12-04 02:10:44,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=72600.0, ans=0.125 2023-12-04 02:10:48,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=72666.66666666667, ans=0.125 2023-12-04 02:10:50,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=72666.66666666667, ans=0.125 2023-12-04 02:10:57,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.44 vs. limit=22.5 2023-12-04 02:11:04,796 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.578e+02 1.747e+02 1.985e+02 3.078e+02, threshold=3.494e+02, percent-clipped=0.0 2023-12-04 02:11:11,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72800.0, ans=0.1 2023-12-04 02:11:13,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=72800.0, ans=0.125 2023-12-04 02:11:20,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=72800.0, ans=0.2 2023-12-04 02:11:32,880 INFO [train.py:1087] (2/4) Epoch 13, batch 200, loss[loss=0.1951, simple_loss=0.282, pruned_loss=0.0541, over 24736.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2879, pruned_loss=0.05929, over 3071342.47 frames. ], batch size: 69, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:11:39,943 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:11:59,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=73066.66666666667, ans=0.0 2023-12-04 02:12:01,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=73066.66666666667, ans=0.125 2023-12-04 02:12:02,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=73066.66666666667, ans=0.1 2023-12-04 02:12:02,658 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:12:14,805 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:12:27,853 INFO [train.py:1087] (2/4) Epoch 13, batch 250, loss[loss=0.2178, simple_loss=0.3005, pruned_loss=0.06756, over 22929.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2874, pruned_loss=0.05914, over 3464128.01 frames. ], batch size: 106, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:12:44,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=73333.33333333333, ans=0.0 2023-12-04 02:12:45,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=73333.33333333333, ans=0.5 2023-12-04 02:12:53,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=73400.0, ans=0.2 2023-12-04 02:12:55,828 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.605e+02 1.874e+02 2.151e+02 3.142e+02, threshold=3.748e+02, percent-clipped=0.0 2023-12-04 02:12:59,573 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-12-04 02:13:00,369 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=73466.66666666667, ans=0.125 2023-12-04 02:13:08,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=73466.66666666667, ans=0.0 2023-12-04 02:13:11,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=73533.33333333333, ans=0.125 2023-12-04 02:13:18,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=73533.33333333333, ans=0.125 2023-12-04 02:13:23,424 INFO [train.py:1087] (2/4) Epoch 13, batch 300, loss[loss=0.1837, simple_loss=0.2732, pruned_loss=0.04713, over 24690.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2883, pruned_loss=0.0599, over 3737069.26 frames. ], batch size: 74, lr: 1.78e-02, grad_scale: 16.0 2023-12-04 02:13:30,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=73600.0, ans=0.5 2023-12-04 02:13:52,004 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=73733.33333333333, ans=0.125 2023-12-04 02:14:15,857 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-12-04 02:14:17,919 INFO [train.py:1087] (2/4) Epoch 13, batch 350, loss[loss=0.1966, simple_loss=0.2809, pruned_loss=0.05616, over 24566.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.288, pruned_loss=0.05977, over 3981223.53 frames. ], batch size: 64, lr: 1.78e-02, grad_scale: 16.0 2023-12-04 02:14:22,808 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73933.33333333333, ans=0.125 2023-12-04 02:14:26,030 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73933.33333333333, ans=0.1 2023-12-04 02:14:38,845 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74066.66666666667, ans=0.125 2023-12-04 02:14:42,922 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=74066.66666666667, ans=0.125 2023-12-04 02:14:44,777 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.629e+02 1.793e+02 1.917e+02 2.643e+02, threshold=3.587e+02, percent-clipped=0.0 2023-12-04 02:15:03,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=74200.0, ans=0.2 2023-12-04 02:15:12,034 INFO [train.py:1087] (2/4) Epoch 13, batch 400, loss[loss=0.1963, simple_loss=0.2828, pruned_loss=0.05491, over 24721.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2879, pruned_loss=0.05971, over 4164276.65 frames. ], batch size: 69, lr: 1.78e-02, grad_scale: 32.0 2023-12-04 02:15:29,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=74333.33333333333, ans=0.125 2023-12-04 02:15:34,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=74400.0, ans=0.0 2023-12-04 02:15:35,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:37,884 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74400.0, ans=0.1 2023-12-04 02:15:37,984 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:39,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:41,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:42,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:57,920 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-12-04 02:16:00,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74533.33333333333, ans=0.1 2023-12-04 02:16:06,898 INFO [train.py:1087] (2/4) Epoch 13, batch 450, loss[loss=0.2182, simple_loss=0.3006, pruned_loss=0.0679, over 21255.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2873, pruned_loss=0.05953, over 4298558.64 frames. ], batch size: 127, lr: 1.78e-02, grad_scale: 32.0 2023-12-04 02:16:16,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=74666.66666666667, ans=0.0 2023-12-04 02:16:34,977 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.725e+02 1.922e+02 2.219e+02 4.035e+02, threshold=3.844e+02, percent-clipped=3.0 2023-12-04 02:16:39,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=74800.0, ans=0.125 2023-12-04 02:16:50,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=74866.66666666667, ans=0.0 2023-12-04 02:16:56,320 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=12.0 2023-12-04 02:17:02,473 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.65 vs. limit=22.5 2023-12-04 02:17:02,968 INFO [train.py:1087] (2/4) Epoch 13, batch 500, loss[loss=0.205, simple_loss=0.2882, pruned_loss=0.06094, over 24008.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2872, pruned_loss=0.05944, over 4411015.13 frames. ], batch size: 87, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:17:03,504 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-12-04 02:17:06,430 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=74933.33333333333, ans=0.09899494936611666 2023-12-04 02:17:25,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=75066.66666666667, ans=0.125 2023-12-04 02:17:45,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75200.0, ans=0.1 2023-12-04 02:17:45,719 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75200.0, ans=0.125 2023-12-04 02:17:57,588 INFO [train.py:1087] (2/4) Epoch 13, batch 550, loss[loss=0.1977, simple_loss=0.2813, pruned_loss=0.05706, over 24773.00 frames. ], tot_loss[loss=0.203, simple_loss=0.287, pruned_loss=0.05944, over 4494038.29 frames. ], batch size: 71, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:18:00,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75266.66666666667, ans=0.1 2023-12-04 02:18:06,099 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=75266.66666666667, ans=0.125 2023-12-04 02:18:10,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=75333.33333333333, ans=0.125 2023-12-04 02:18:25,349 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.590e+02 1.790e+02 2.037e+02 3.276e+02, threshold=3.580e+02, percent-clipped=0.0 2023-12-04 02:18:29,595 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.75 vs. limit=15.0 2023-12-04 02:18:53,007 INFO [train.py:1087] (2/4) Epoch 13, batch 600, loss[loss=0.2161, simple_loss=0.2977, pruned_loss=0.06724, over 23407.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2866, pruned_loss=0.05915, over 4576461.54 frames. ], batch size: 94, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:18:54,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75600.0, ans=0.0 2023-12-04 02:18:54,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=75600.0, ans=0.2 2023-12-04 02:19:10,832 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.572e-03 2023-12-04 02:19:13,358 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=75666.66666666667, ans=0.2 2023-12-04 02:19:15,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=75733.33333333333, ans=0.2 2023-12-04 02:19:20,242 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-12-04 02:19:27,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75800.0, ans=0.125 2023-12-04 02:19:39,725 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:19:41,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=75866.66666666667, ans=0.0 2023-12-04 02:19:41,981 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-12-04 02:19:45,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75866.66666666667, ans=0.1 2023-12-04 02:19:48,262 INFO [train.py:1087] (2/4) Epoch 13, batch 650, loss[loss=0.2815, simple_loss=0.3361, pruned_loss=0.1134, over 16054.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2862, pruned_loss=0.05897, over 4631559.93 frames. ], batch size: 176, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:19:49,634 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=75933.33333333333, ans=0.125 2023-12-04 02:19:53,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=75933.33333333333, ans=0.0 2023-12-04 02:19:57,016 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.05 vs. limit=22.5 2023-12-04 02:20:06,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76000.0, ans=0.1 2023-12-04 02:20:16,052 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.248e+02 1.571e+02 1.754e+02 1.973e+02 3.590e+02, threshold=3.508e+02, percent-clipped=1.0 2023-12-04 02:20:16,818 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-12-04 02:20:27,763 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2023-12-04 02:20:37,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=76200.0, ans=0.125 2023-12-04 02:20:38,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=76200.0, ans=0.09899494936611666 2023-12-04 02:20:43,847 INFO [train.py:1087] (2/4) Epoch 13, batch 700, loss[loss=0.2164, simple_loss=0.2957, pruned_loss=0.0685, over 24231.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2863, pruned_loss=0.05886, over 4683367.02 frames. ], batch size: 82, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:20:45,519 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.11 vs. limit=12.0 2023-12-04 02:21:02,827 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.80 vs. limit=22.5 2023-12-04 02:21:03,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=76333.33333333333, ans=0.0 2023-12-04 02:21:20,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=76466.66666666667, ans=0.0 2023-12-04 02:21:20,360 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.81 vs. limit=15.0 2023-12-04 02:21:28,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=76533.33333333333, ans=0.0 2023-12-04 02:21:31,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=76533.33333333333, ans=0.125 2023-12-04 02:21:39,147 INFO [train.py:1087] (2/4) Epoch 13, batch 750, loss[loss=0.2099, simple_loss=0.2928, pruned_loss=0.06348, over 24798.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2862, pruned_loss=0.05874, over 4712310.04 frames. ], batch size: 71, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:21:42,833 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=15.0 2023-12-04 02:21:47,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=76600.0, ans=0.2 2023-12-04 02:21:47,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=76600.0, ans=10.0 2023-12-04 02:22:04,121 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.44 vs. limit=15.0 2023-12-04 02:22:05,602 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.679e+02 1.812e+02 2.035e+02 3.313e+02, threshold=3.625e+02, percent-clipped=0.0 2023-12-04 02:22:32,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=76933.33333333333, ans=0.07 2023-12-04 02:22:33,244 INFO [train.py:1087] (2/4) Epoch 13, batch 800, loss[loss=0.2099, simple_loss=0.2963, pruned_loss=0.06178, over 22791.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2863, pruned_loss=0.05882, over 4738586.01 frames. ], batch size: 106, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:22:43,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=77000.0, ans=0.2 2023-12-04 02:22:45,923 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2023-12-04 02:22:48,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.91 vs. limit=22.5 2023-12-04 02:22:52,967 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=15.0 2023-12-04 02:22:55,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=77066.66666666667, ans=0.125 2023-12-04 02:22:59,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=77066.66666666667, ans=0.07 2023-12-04 02:23:00,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77066.66666666667, ans=0.125 2023-12-04 02:23:02,837 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.18 vs. limit=12.0 2023-12-04 02:23:03,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=77133.33333333333, ans=0.2 2023-12-04 02:23:16,754 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-12-04 02:23:24,367 INFO [train.py:1087] (2/4) Epoch 13, batch 850, loss[loss=0.213, simple_loss=0.2989, pruned_loss=0.06354, over 24120.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2859, pruned_loss=0.05841, over 4777752.18 frames. ], batch size: 82, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:23:27,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=77266.66666666667, ans=0.2 2023-12-04 02:23:35,079 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77333.33333333333, ans=0.1 2023-12-04 02:23:47,565 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=15.0 2023-12-04 02:23:50,107 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.257e+02 1.571e+02 1.695e+02 1.880e+02 3.580e+02, threshold=3.390e+02, percent-clipped=0.0 2023-12-04 02:23:51,400 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:23:54,229 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=77466.66666666667, ans=0.0 2023-12-04 02:24:02,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-12-04 02:24:26,547 INFO [train.py:1087] (2/4) Epoch 14, batch 0, loss[loss=0.1954, simple_loss=0.2809, pruned_loss=0.05501, over 24766.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2809, pruned_loss=0.05501, over 24766.00 frames. ], batch size: 65, lr: 1.69e-02, grad_scale: 32.0 2023-12-04 02:24:26,547 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 02:24:38,625 INFO [train.py:1119] (2/4) Epoch 14, validation: loss=0.1708, simple_loss=0.273, pruned_loss=0.03427, over 944034.00 frames. 2023-12-04 02:24:38,625 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 02:24:42,345 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.69 vs. limit=15.0 2023-12-04 02:24:48,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=77633.33333333333, ans=0.04949747468305833 2023-12-04 02:24:48,356 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=77633.33333333333, ans=0.2 2023-12-04 02:24:53,787 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=77633.33333333333, ans=0.125 2023-12-04 02:25:08,542 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=77700.0, ans=0.125 2023-12-04 02:25:08,873 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.22 vs. limit=22.5 2023-12-04 02:25:15,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=77766.66666666667, ans=0.2 2023-12-04 02:25:33,286 INFO [train.py:1087] (2/4) Epoch 14, batch 50, loss[loss=0.1908, simple_loss=0.2711, pruned_loss=0.05528, over 24781.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.286, pruned_loss=0.05836, over 1101888.01 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 32.0 2023-12-04 02:25:38,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=77900.0, ans=0.125 2023-12-04 02:25:44,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=77966.66666666667, ans=0.125 2023-12-04 02:25:49,959 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=77966.66666666667, ans=0.0 2023-12-04 02:25:51,046 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=77966.66666666667, ans=0.125 2023-12-04 02:25:59,152 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.42 vs. limit=15.0 2023-12-04 02:26:06,277 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.648e+02 1.784e+02 1.938e+02 4.407e+02, threshold=3.569e+02, percent-clipped=1.0 2023-12-04 02:26:19,069 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-12-04 02:26:20,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=78166.66666666667, ans=15.0 2023-12-04 02:26:28,100 INFO [train.py:1087] (2/4) Epoch 14, batch 100, loss[loss=0.2025, simple_loss=0.287, pruned_loss=0.05898, over 24285.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2853, pruned_loss=0.05786, over 1931092.22 frames. ], batch size: 79, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:26:38,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=78300.0, ans=0.125 2023-12-04 02:27:17,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=78500.0, ans=15.0 2023-12-04 02:27:22,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=78566.66666666667, ans=0.125 2023-12-04 02:27:23,495 INFO [train.py:1087] (2/4) Epoch 14, batch 150, loss[loss=0.2432, simple_loss=0.3195, pruned_loss=0.08347, over 21539.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2853, pruned_loss=0.05831, over 2561950.64 frames. ], batch size: 128, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:27:26,854 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:27:27,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=78566.66666666667, ans=0.125 2023-12-04 02:27:43,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=78633.33333333333, ans=0.1 2023-12-04 02:27:45,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=78700.0, ans=0.0 2023-12-04 02:27:56,503 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.581e+02 1.793e+02 2.072e+02 3.114e+02, threshold=3.586e+02, percent-clipped=0.0 2023-12-04 02:28:15,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=78833.33333333333, ans=0.125 2023-12-04 02:28:18,741 INFO [train.py:1087] (2/4) Epoch 14, batch 200, loss[loss=0.1898, simple_loss=0.2775, pruned_loss=0.05103, over 24492.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2851, pruned_loss=0.05808, over 3042476.22 frames. ], batch size: 77, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:28:30,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=78966.66666666667, ans=0.125 2023-12-04 02:28:34,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=78966.66666666667, ans=0.125 2023-12-04 02:29:08,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=79166.66666666667, ans=0.125 2023-12-04 02:29:11,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=79166.66666666667, ans=0.125 2023-12-04 02:29:14,399 INFO [train.py:1087] (2/4) Epoch 14, batch 250, loss[loss=0.1852, simple_loss=0.2701, pruned_loss=0.0501, over 24768.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2847, pruned_loss=0.0577, over 3434484.31 frames. ], batch size: 65, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:29:23,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=79233.33333333333, ans=0.125 2023-12-04 02:29:47,208 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.189e+02 1.553e+02 1.699e+02 1.898e+02 2.927e+02, threshold=3.398e+02, percent-clipped=0.0 2023-12-04 02:29:47,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79433.33333333333, ans=0.1 2023-12-04 02:30:09,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=79566.66666666667, ans=0.0 2023-12-04 02:30:09,763 INFO [train.py:1087] (2/4) Epoch 14, batch 300, loss[loss=0.1942, simple_loss=0.2786, pruned_loss=0.05486, over 24692.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2842, pruned_loss=0.05751, over 3748779.45 frames. ], batch size: 74, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:30:09,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79566.66666666667, ans=0.125 2023-12-04 02:30:15,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=79566.66666666667, ans=0.0 2023-12-04 02:30:38,208 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=79700.0, ans=0.1 2023-12-04 02:31:04,243 INFO [train.py:1087] (2/4) Epoch 14, batch 350, loss[loss=0.1863, simple_loss=0.2763, pruned_loss=0.04814, over 24768.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2839, pruned_loss=0.05736, over 3976557.06 frames. ], batch size: 64, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:31:18,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=79966.66666666667, ans=0.125 2023-12-04 02:31:39,264 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.287e+02 1.517e+02 1.760e+02 2.062e+02 3.143e+02, threshold=3.521e+02, percent-clipped=0.0 2023-12-04 02:31:47,271 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80100.0, ans=0.125 2023-12-04 02:31:49,426 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=80166.66666666667, ans=0.0 2023-12-04 02:31:56,898 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80166.66666666667, ans=0.125 2023-12-04 02:32:00,862 INFO [train.py:1087] (2/4) Epoch 14, batch 400, loss[loss=0.2114, simple_loss=0.2969, pruned_loss=0.06294, over 24747.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.284, pruned_loss=0.05729, over 4165038.17 frames. ], batch size: 70, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:32:15,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=80300.0, ans=0.0 2023-12-04 02:32:26,133 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=80366.66666666667, ans=0.05 2023-12-04 02:32:34,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=80433.33333333333, ans=0.07 2023-12-04 02:32:34,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=80433.33333333333, ans=0.025 2023-12-04 02:32:52,820 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=80500.0, ans=0.09899494936611666 2023-12-04 02:32:55,649 INFO [train.py:1087] (2/4) Epoch 14, batch 450, loss[loss=0.1977, simple_loss=0.2852, pruned_loss=0.05513, over 24811.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2834, pruned_loss=0.05672, over 4333193.82 frames. ], batch size: 72, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:33:01,297 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80566.66666666667, ans=0.125 2023-12-04 02:33:06,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=80633.33333333333, ans=0.125 2023-12-04 02:33:12,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=80633.33333333333, ans=0.125 2023-12-04 02:33:28,840 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.570e+02 1.760e+02 1.959e+02 2.604e+02, threshold=3.520e+02, percent-clipped=0.0 2023-12-04 02:33:36,937 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.80 vs. limit=15.0 2023-12-04 02:33:51,222 INFO [train.py:1087] (2/4) Epoch 14, batch 500, loss[loss=0.1894, simple_loss=0.2793, pruned_loss=0.04975, over 24793.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2836, pruned_loss=0.05688, over 4431528.49 frames. ], batch size: 71, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:33:53,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=80900.0, ans=0.95 2023-12-04 02:33:55,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=80900.0, ans=0.125 2023-12-04 02:34:11,511 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=81033.33333333333, ans=0.125 2023-12-04 02:34:17,258 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.12 vs. limit=15.0 2023-12-04 02:34:44,308 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=81233.33333333333, ans=0.125 2023-12-04 02:34:45,062 INFO [train.py:1087] (2/4) Epoch 14, batch 550, loss[loss=0.1871, simple_loss=0.2704, pruned_loss=0.05191, over 24502.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2845, pruned_loss=0.05747, over 4510682.47 frames. ], batch size: 75, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:34:49,275 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81233.33333333333, ans=0.1 2023-12-04 02:34:53,811 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=15.0 2023-12-04 02:34:58,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=81300.0, ans=0.07 2023-12-04 02:35:11,828 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81366.66666666667, ans=0.125 2023-12-04 02:35:17,826 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.552e+02 1.720e+02 1.872e+02 2.900e+02, threshold=3.440e+02, percent-clipped=0.0 2023-12-04 02:35:40,197 INFO [train.py:1087] (2/4) Epoch 14, batch 600, loss[loss=0.2111, simple_loss=0.2956, pruned_loss=0.0633, over 24001.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2841, pruned_loss=0.05714, over 4576662.55 frames. ], batch size: 87, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:35:46,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=81566.66666666667, ans=0.125 2023-12-04 02:35:46,820 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=81566.66666666667, ans=0.125 2023-12-04 02:35:57,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=81633.33333333333, ans=0.0 2023-12-04 02:35:58,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=81633.33333333333, ans=0.125 2023-12-04 02:36:03,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=81700.0, ans=0.0 2023-12-04 02:36:07,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=81700.0, ans=0.07 2023-12-04 02:36:12,282 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=81766.66666666667, ans=0.05 2023-12-04 02:36:23,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81833.33333333333, ans=0.1 2023-12-04 02:36:29,460 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=81833.33333333333, ans=0.0 2023-12-04 02:36:35,520 INFO [train.py:1087] (2/4) Epoch 14, batch 650, loss[loss=0.2174, simple_loss=0.2966, pruned_loss=0.06914, over 24019.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2839, pruned_loss=0.057, over 4637563.19 frames. ], batch size: 87, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:36:42,134 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:36:59,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=82033.33333333333, ans=0.125 2023-12-04 02:37:07,924 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.536e+02 1.690e+02 1.852e+02 2.525e+02, threshold=3.381e+02, percent-clipped=0.0 2023-12-04 02:37:29,274 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=82233.33333333333, ans=0.125 2023-12-04 02:37:30,033 INFO [train.py:1087] (2/4) Epoch 14, batch 700, loss[loss=0.195, simple_loss=0.2805, pruned_loss=0.05473, over 24779.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2837, pruned_loss=0.05701, over 4678834.50 frames. ], batch size: 73, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:37:34,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=82233.33333333333, ans=0.125 2023-12-04 02:37:38,957 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82233.33333333333, ans=0.1 2023-12-04 02:37:43,155 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=82300.0, ans=0.125 2023-12-04 02:37:47,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82300.0, ans=0.1 2023-12-04 02:37:54,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=82366.66666666667, ans=0.125 2023-12-04 02:37:58,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=82366.66666666667, ans=0.125 2023-12-04 02:38:07,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=82433.33333333333, ans=0.0 2023-12-04 02:38:13,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82500.0, ans=0.1 2023-12-04 02:38:15,336 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=82500.0, ans=0.0 2023-12-04 02:38:17,366 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=82500.0, ans=0.125 2023-12-04 02:38:18,593 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=82500.0, ans=0.2 2023-12-04 02:38:24,632 INFO [train.py:1087] (2/4) Epoch 14, batch 750, loss[loss=0.2103, simple_loss=0.2923, pruned_loss=0.06415, over 24743.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2836, pruned_loss=0.05673, over 4717049.41 frames. ], batch size: 63, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:38:24,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82566.66666666667, ans=0.0 2023-12-04 02:38:31,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82566.66666666667, ans=0.125 2023-12-04 02:38:57,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=82766.66666666667, ans=0.125 2023-12-04 02:38:58,922 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.618e+02 1.933e+02 2.376e+02 4.506e+02, threshold=3.866e+02, percent-clipped=2.0 2023-12-04 02:39:00,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=82766.66666666667, ans=0.125 2023-12-04 02:39:02,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=82766.66666666667, ans=0.07 2023-12-04 02:39:17,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=82833.33333333333, ans=0.2 2023-12-04 02:39:20,997 INFO [train.py:1087] (2/4) Epoch 14, batch 800, loss[loss=0.1962, simple_loss=0.2784, pruned_loss=0.05699, over 24294.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2832, pruned_loss=0.05655, over 4736764.10 frames. ], batch size: 79, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:39:36,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=82966.66666666667, ans=0.1 2023-12-04 02:39:37,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=82966.66666666667, ans=0.0 2023-12-04 02:40:12,648 INFO [train.py:1087] (2/4) Epoch 14, batch 850, loss[loss=0.1907, simple_loss=0.2767, pruned_loss=0.0524, over 24759.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2835, pruned_loss=0.05676, over 4750886.36 frames. ], batch size: 64, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:40:19,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83233.33333333333, ans=0.125 2023-12-04 02:40:27,997 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=83300.0, ans=0.0 2023-12-04 02:40:41,344 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-12-04 02:40:42,923 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.548e+02 1.697e+02 1.848e+02 3.947e+02, threshold=3.394e+02, percent-clipped=1.0 2023-12-04 02:40:56,440 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2023-12-04 02:41:14,499 INFO [train.py:1087] (2/4) Epoch 15, batch 0, loss[loss=0.1942, simple_loss=0.2783, pruned_loss=0.05501, over 24568.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2783, pruned_loss=0.05501, over 24568.00 frames. ], batch size: 64, lr: 1.59e-02, grad_scale: 32.0 2023-12-04 02:41:14,500 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 02:41:26,634 INFO [train.py:1119] (2/4) Epoch 15, validation: loss=0.1681, simple_loss=0.2702, pruned_loss=0.03297, over 944034.00 frames. 2023-12-04 02:41:26,635 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 02:41:27,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83533.33333333333, ans=0.1 2023-12-04 02:41:28,990 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=83533.33333333333, ans=0.125 2023-12-04 02:41:29,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=83533.33333333333, ans=0.125 2023-12-04 02:41:50,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83666.66666666667, ans=0.0 2023-12-04 02:41:50,582 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.22 vs. limit=15.0 2023-12-04 02:42:19,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=83800.0, ans=0.0 2023-12-04 02:42:21,110 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-12-04 02:42:21,688 INFO [train.py:1087] (2/4) Epoch 15, batch 50, loss[loss=0.1859, simple_loss=0.2732, pruned_loss=0.04928, over 24788.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.283, pruned_loss=0.05711, over 1071557.24 frames. ], batch size: 71, lr: 1.59e-02, grad_scale: 32.0 2023-12-04 02:42:37,598 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-12-04 02:43:00,522 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.523e+02 1.656e+02 1.880e+02 3.765e+02, threshold=3.311e+02, percent-clipped=1.0 2023-12-04 02:43:10,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=84133.33333333333, ans=0.0 2023-12-04 02:43:16,501 INFO [train.py:1087] (2/4) Epoch 15, batch 100, loss[loss=0.1853, simple_loss=0.2708, pruned_loss=0.04987, over 24755.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2826, pruned_loss=0.05631, over 1889530.88 frames. ], batch size: 64, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:43:21,832 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.97 vs. limit=15.0 2023-12-04 02:43:22,427 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=84200.0, ans=0.2 2023-12-04 02:43:31,398 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=84266.66666666667, ans=0.125 2023-12-04 02:43:39,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=84333.33333333333, ans=0.0 2023-12-04 02:43:40,752 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=84333.33333333333, ans=0.0 2023-12-04 02:44:11,284 INFO [train.py:1087] (2/4) Epoch 15, batch 150, loss[loss=0.2111, simple_loss=0.294, pruned_loss=0.06407, over 24246.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2819, pruned_loss=0.05565, over 2547346.35 frames. ], batch size: 82, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:44:27,341 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84600.0, ans=0.1 2023-12-04 02:44:39,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=84666.66666666667, ans=0.2 2023-12-04 02:44:49,729 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.528e+02 1.638e+02 1.857e+02 3.053e+02, threshold=3.277e+02, percent-clipped=0.0 2023-12-04 02:44:58,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=84800.0, ans=0.035 2023-12-04 02:45:02,374 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84800.0, ans=0.1 2023-12-04 02:45:06,742 INFO [train.py:1087] (2/4) Epoch 15, batch 200, loss[loss=0.2053, simple_loss=0.2881, pruned_loss=0.06128, over 20918.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2816, pruned_loss=0.05555, over 3051783.83 frames. ], batch size: 50, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:45:16,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=84933.33333333333, ans=0.1 2023-12-04 02:45:26,421 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84933.33333333333, ans=0.1 2023-12-04 02:45:32,610 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=85000.0, ans=0.2 2023-12-04 02:45:56,089 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85133.33333333333, ans=0.1 2023-12-04 02:46:02,253 INFO [train.py:1087] (2/4) Epoch 15, batch 250, loss[loss=0.1856, simple_loss=0.2727, pruned_loss=0.04925, over 24577.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2813, pruned_loss=0.0552, over 3443091.90 frames. ], batch size: 64, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:46:16,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=85266.66666666667, ans=0.025 2023-12-04 02:46:40,746 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.213e+02 1.454e+02 1.592e+02 1.753e+02 2.965e+02, threshold=3.184e+02, percent-clipped=0.0 2023-12-04 02:46:47,971 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.55 vs. limit=22.5 2023-12-04 02:46:57,802 INFO [train.py:1087] (2/4) Epoch 15, batch 300, loss[loss=0.1816, simple_loss=0.268, pruned_loss=0.04761, over 24554.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2812, pruned_loss=0.0549, over 3757099.13 frames. ], batch size: 66, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:47:01,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=85533.33333333333, ans=0.125 2023-12-04 02:47:09,919 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2023-12-04 02:47:11,777 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=85600.0, ans=0.0 2023-12-04 02:47:12,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=85600.0, ans=0.125 2023-12-04 02:47:19,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=85666.66666666667, ans=0.0 2023-12-04 02:47:26,655 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-12-04 02:47:30,252 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:47:30,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=85733.33333333333, ans=12.0 2023-12-04 02:47:45,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=85800.0, ans=0.125 2023-12-04 02:47:49,861 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2023-12-04 02:47:52,370 INFO [train.py:1087] (2/4) Epoch 15, batch 350, loss[loss=0.2091, simple_loss=0.2916, pruned_loss=0.06332, over 23625.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2809, pruned_loss=0.05485, over 3988813.66 frames. ], batch size: 95, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:47:59,328 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.71 vs. limit=22.5 2023-12-04 02:48:18,393 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-12-04 02:48:31,959 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.237e+02 1.522e+02 1.709e+02 1.914e+02 2.589e+02, threshold=3.419e+02, percent-clipped=0.0 2023-12-04 02:48:32,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=86066.66666666667, ans=0.125 2023-12-04 02:48:37,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=86133.33333333333, ans=0.95 2023-12-04 02:48:45,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86133.33333333333, ans=0.1 2023-12-04 02:48:47,878 INFO [train.py:1087] (2/4) Epoch 15, batch 400, loss[loss=0.1888, simple_loss=0.2754, pruned_loss=0.0511, over 24548.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2811, pruned_loss=0.05519, over 4166210.51 frames. ], batch size: 62, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:49:07,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=86266.66666666667, ans=0.0 2023-12-04 02:49:12,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=86333.33333333333, ans=0.05 2023-12-04 02:49:18,056 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:49:21,530 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-12-04 02:49:24,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86400.0, ans=0.0 2023-12-04 02:49:43,579 INFO [train.py:1087] (2/4) Epoch 15, batch 450, loss[loss=0.192, simple_loss=0.2801, pruned_loss=0.05195, over 24556.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.281, pruned_loss=0.05511, over 4304115.94 frames. ], batch size: 63, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:49:45,353 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-12-04 02:50:08,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86666.66666666667, ans=0.125 2023-12-04 02:50:09,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=86666.66666666667, ans=0.2 2023-12-04 02:50:17,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=86733.33333333333, ans=0.1 2023-12-04 02:50:21,846 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.235e+02 1.620e+02 1.806e+02 1.993e+02 3.052e+02, threshold=3.612e+02, percent-clipped=0.0 2023-12-04 02:50:39,161 INFO [train.py:1087] (2/4) Epoch 15, batch 500, loss[loss=0.1906, simple_loss=0.277, pruned_loss=0.05205, over 24628.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2811, pruned_loss=0.05528, over 4420267.57 frames. ], batch size: 68, lr: 1.57e-02, grad_scale: 64.0 2023-12-04 02:51:01,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=87000.0, ans=0.125 2023-12-04 02:51:17,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.94 vs. limit=6.0 2023-12-04 02:51:21,955 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=87133.33333333333, ans=0.125 2023-12-04 02:51:25,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=87133.33333333333, ans=0.125 2023-12-04 02:51:29,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87133.33333333333, ans=0.1 2023-12-04 02:51:33,481 INFO [train.py:1087] (2/4) Epoch 15, batch 550, loss[loss=0.2007, simple_loss=0.2859, pruned_loss=0.0577, over 24780.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2812, pruned_loss=0.0551, over 4514193.90 frames. ], batch size: 62, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:51:33,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=87200.0, ans=0.2 2023-12-04 02:51:51,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=87266.66666666667, ans=0.2 2023-12-04 02:51:55,046 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=87333.33333333333, ans=0.0 2023-12-04 02:52:10,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=87400.0, ans=0.2 2023-12-04 02:52:12,754 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.596e+02 1.795e+02 1.987e+02 2.603e+02, threshold=3.589e+02, percent-clipped=0.0 2023-12-04 02:52:12,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=87400.0, ans=0.0 2023-12-04 02:52:29,062 INFO [train.py:1087] (2/4) Epoch 15, batch 600, loss[loss=0.2021, simple_loss=0.2808, pruned_loss=0.06175, over 24561.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2813, pruned_loss=0.05514, over 4574215.03 frames. ], batch size: 62, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:52:41,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=87600.0, ans=0.2 2023-12-04 02:52:52,987 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=12.0 2023-12-04 02:52:59,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87666.66666666667, ans=0.1 2023-12-04 02:53:10,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=87733.33333333333, ans=0.125 2023-12-04 02:53:14,323 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-12-04 02:53:21,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=87800.0, ans=0.0 2023-12-04 02:53:24,768 INFO [train.py:1087] (2/4) Epoch 15, batch 650, loss[loss=0.1979, simple_loss=0.2826, pruned_loss=0.05657, over 24552.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2813, pruned_loss=0.05498, over 4628010.52 frames. ], batch size: 62, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:53:27,145 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87866.66666666667, ans=0.1 2023-12-04 02:53:29,338 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=87866.66666666667, ans=0.05 2023-12-04 02:54:02,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=88066.66666666667, ans=0.125 2023-12-04 02:54:03,473 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.576e+02 1.749e+02 1.959e+02 2.517e+02, threshold=3.499e+02, percent-clipped=0.0 2023-12-04 02:54:08,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88133.33333333333, ans=0.1 2023-12-04 02:54:09,244 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-12-04 02:54:20,521 INFO [train.py:1087] (2/4) Epoch 15, batch 700, loss[loss=0.1911, simple_loss=0.2779, pruned_loss=0.05217, over 24789.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2819, pruned_loss=0.05539, over 4651303.27 frames. ], batch size: 73, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:54:20,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=88200.0, ans=0.07 2023-12-04 02:54:26,100 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=88200.0, ans=0.2 2023-12-04 02:54:31,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88266.66666666667, ans=0.1 2023-12-04 02:54:33,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=88266.66666666667, ans=0.0 2023-12-04 02:54:45,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=88333.33333333333, ans=0.125 2023-12-04 02:55:08,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88466.66666666667, ans=0.1 2023-12-04 02:55:10,402 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=88466.66666666667, ans=0.2 2023-12-04 02:55:12,465 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88466.66666666667, ans=0.1 2023-12-04 02:55:12,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=88466.66666666667, ans=0.2 2023-12-04 02:55:16,301 INFO [train.py:1087] (2/4) Epoch 15, batch 750, loss[loss=0.2006, simple_loss=0.2893, pruned_loss=0.05594, over 24451.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2823, pruned_loss=0.05576, over 4665610.62 frames. ], batch size: 77, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:55:18,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=88533.33333333333, ans=0.125 2023-12-04 02:55:18,829 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.93 vs. limit=10.0 2023-12-04 02:55:22,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=88533.33333333333, ans=10.0 2023-12-04 02:55:25,986 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=88600.0, ans=0.0 2023-12-04 02:55:31,337 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=88600.0, ans=0.04949747468305833 2023-12-04 02:55:42,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=88666.66666666667, ans=0.125 2023-12-04 02:55:52,782 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=88733.33333333333, ans=0.0 2023-12-04 02:55:54,578 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.298e+02 1.502e+02 1.659e+02 1.932e+02 2.669e+02, threshold=3.318e+02, percent-clipped=0.0 2023-12-04 02:55:56,905 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88733.33333333333, ans=0.1 2023-12-04 02:55:57,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88733.33333333333, ans=0.1 2023-12-04 02:55:58,177 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.89 vs. limit=10.0 2023-12-04 02:56:10,932 INFO [train.py:1087] (2/4) Epoch 15, batch 800, loss[loss=0.2099, simple_loss=0.2928, pruned_loss=0.06354, over 21344.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2818, pruned_loss=0.05547, over 4701339.35 frames. ], batch size: 127, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:56:24,055 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.85 vs. limit=22.5 2023-12-04 02:56:29,720 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88933.33333333333, ans=0.125 2023-12-04 02:56:56,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=89133.33333333333, ans=0.125 2023-12-04 02:57:02,422 INFO [train.py:1087] (2/4) Epoch 15, batch 850, loss[loss=0.1865, simple_loss=0.279, pruned_loss=0.04705, over 24717.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2813, pruned_loss=0.05516, over 4721249.50 frames. ], batch size: 69, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:57:14,834 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=89266.66666666667, ans=10.0 2023-12-04 02:57:29,417 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-12-04 02:57:38,601 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.272e+02 1.620e+02 1.776e+02 2.057e+02 2.968e+02, threshold=3.552e+02, percent-clipped=0.0 2023-12-04 02:58:03,684 INFO [train.py:1087] (2/4) Epoch 16, batch 0, loss[loss=0.1855, simple_loss=0.2725, pruned_loss=0.04922, over 24854.00 frames. ], tot_loss[loss=0.1855, simple_loss=0.2725, pruned_loss=0.04922, over 24854.00 frames. ], batch size: 68, lr: 1.50e-02, grad_scale: 32.0 2023-12-04 02:58:03,684 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 02:58:15,875 INFO [train.py:1119] (2/4) Epoch 16, validation: loss=0.1672, simple_loss=0.2691, pruned_loss=0.03271, over 944034.00 frames. 2023-12-04 02:58:15,875 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 02:58:19,363 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:58:22,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=89500.0, ans=0.2 2023-12-04 02:58:24,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=89500.0, ans=0.125 2023-12-04 02:58:39,795 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.72 vs. limit=15.0 2023-12-04 02:58:41,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.94 vs. limit=10.0 2023-12-04 02:59:11,230 INFO [train.py:1087] (2/4) Epoch 16, batch 50, loss[loss=0.183, simple_loss=0.2712, pruned_loss=0.04739, over 24723.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2806, pruned_loss=0.05395, over 1083040.30 frames. ], batch size: 67, lr: 1.50e-02, grad_scale: 32.0 2023-12-04 02:59:14,593 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=89833.33333333333, ans=0.125 2023-12-04 02:59:44,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=90033.33333333333, ans=0.125 2023-12-04 02:59:56,094 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.481e+02 1.701e+02 1.981e+02 2.994e+02, threshold=3.402e+02, percent-clipped=0.0 2023-12-04 02:59:59,540 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90100.0, ans=0.1 2023-12-04 03:00:00,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90100.0, ans=0.1 2023-12-04 03:00:05,583 INFO [train.py:1087] (2/4) Epoch 16, batch 100, loss[loss=0.186, simple_loss=0.2776, pruned_loss=0.04722, over 24858.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2801, pruned_loss=0.0533, over 1912616.93 frames. ], batch size: 68, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:00:25,156 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-12-04 03:00:29,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=90300.0, ans=0.125 2023-12-04 03:00:31,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=90300.0, ans=0.125 2023-12-04 03:00:32,513 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=90300.0, ans=0.0 2023-12-04 03:00:55,609 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2023-12-04 03:01:00,644 INFO [train.py:1087] (2/4) Epoch 16, batch 150, loss[loss=0.1901, simple_loss=0.279, pruned_loss=0.0506, over 24738.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2788, pruned_loss=0.05278, over 2557248.57 frames. ], batch size: 61, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:01:19,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=90566.66666666667, ans=0.0 2023-12-04 03:01:32,954 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-12-04 03:01:46,642 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.563e+02 1.755e+02 1.966e+02 2.864e+02, threshold=3.511e+02, percent-clipped=0.0 2023-12-04 03:01:56,346 INFO [train.py:1087] (2/4) Epoch 16, batch 200, loss[loss=0.1901, simple_loss=0.2758, pruned_loss=0.05218, over 24546.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2792, pruned_loss=0.05323, over 3064251.99 frames. ], batch size: 63, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:02:09,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=90900.0, ans=0.125 2023-12-04 03:02:17,239 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90900.0, ans=0.1 2023-12-04 03:02:43,145 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=91100.0, ans=0.2 2023-12-04 03:02:50,527 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2023-12-04 03:02:52,148 INFO [train.py:1087] (2/4) Epoch 16, batch 250, loss[loss=0.2025, simple_loss=0.2834, pruned_loss=0.06084, over 24339.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2792, pruned_loss=0.05345, over 3453762.79 frames. ], batch size: 79, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:02:53,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=91166.66666666667, ans=0.2 2023-12-04 03:03:01,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91233.33333333333, ans=0.0 2023-12-04 03:03:22,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=91300.0, ans=0.125 2023-12-04 03:03:28,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=91366.66666666667, ans=0.2 2023-12-04 03:03:29,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=91366.66666666667, ans=0.1 2023-12-04 03:03:36,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91433.33333333333, ans=0.1 2023-12-04 03:03:37,190 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.524e+02 1.658e+02 1.918e+02 2.871e+02, threshold=3.316e+02, percent-clipped=0.0 2023-12-04 03:03:45,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=91433.33333333333, ans=0.2 2023-12-04 03:03:47,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=91500.0, ans=0.0 2023-12-04 03:03:48,447 INFO [train.py:1087] (2/4) Epoch 16, batch 300, loss[loss=0.1927, simple_loss=0.2784, pruned_loss=0.05349, over 24751.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2802, pruned_loss=0.05436, over 3722105.47 frames. ], batch size: 65, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:03:54,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=91500.0, ans=0.0 2023-12-04 03:04:01,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=91566.66666666667, ans=0.125 2023-12-04 03:04:02,952 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-12-04 03:04:07,451 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=12.0 2023-12-04 03:04:12,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=91633.33333333333, ans=0.0 2023-12-04 03:04:42,563 INFO [train.py:1087] (2/4) Epoch 16, batch 350, loss[loss=0.1896, simple_loss=0.2765, pruned_loss=0.05134, over 24554.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2802, pruned_loss=0.05426, over 3971070.38 frames. ], batch size: 66, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:04:42,869 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=91833.33333333333, ans=0.125 2023-12-04 03:04:59,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=91900.0, ans=0.2 2023-12-04 03:05:03,937 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.26 vs. limit=15.0 2023-12-04 03:05:13,705 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-12-04 03:05:28,191 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.477e+02 1.566e+02 1.706e+02 2.525e+02, threshold=3.132e+02, percent-clipped=0.0 2023-12-04 03:05:37,753 INFO [train.py:1087] (2/4) Epoch 16, batch 400, loss[loss=0.1823, simple_loss=0.2728, pruned_loss=0.0459, over 24758.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.28, pruned_loss=0.05397, over 4166796.95 frames. ], batch size: 70, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:05:39,043 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=92166.66666666667, ans=0.2 2023-12-04 03:05:55,043 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=92233.33333333333, ans=0.0 2023-12-04 03:05:56,041 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92233.33333333333, ans=0.1 2023-12-04 03:06:07,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=92300.0, ans=0.0 2023-12-04 03:06:26,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=92433.33333333333, ans=0.125 2023-12-04 03:06:27,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=92433.33333333333, ans=0.0 2023-12-04 03:06:33,021 INFO [train.py:1087] (2/4) Epoch 16, batch 450, loss[loss=0.1934, simple_loss=0.2819, pruned_loss=0.05242, over 24488.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2799, pruned_loss=0.05394, over 4296985.44 frames. ], batch size: 77, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:06:42,158 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=92500.0, ans=12.0 2023-12-04 03:06:42,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92566.66666666667, ans=0.125 2023-12-04 03:06:53,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=92566.66666666667, ans=0.0 2023-12-04 03:07:09,536 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.74 vs. limit=15.0 2023-12-04 03:07:17,580 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.537e+02 1.693e+02 1.843e+02 2.757e+02, threshold=3.387e+02, percent-clipped=0.0 2023-12-04 03:07:28,233 INFO [train.py:1087] (2/4) Epoch 16, batch 500, loss[loss=0.186, simple_loss=0.2719, pruned_loss=0.0501, over 24749.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2792, pruned_loss=0.05334, over 4424402.40 frames. ], batch size: 63, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:07:36,308 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=92833.33333333333, ans=0.125 2023-12-04 03:07:50,085 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-12-04 03:08:13,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=93100.0, ans=0.025 2023-12-04 03:08:22,771 INFO [train.py:1087] (2/4) Epoch 16, batch 550, loss[loss=0.2272, simple_loss=0.3003, pruned_loss=0.07702, over 16993.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2787, pruned_loss=0.05317, over 4503139.01 frames. ], batch size: 178, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:08:27,657 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.32 vs. limit=15.0 2023-12-04 03:08:39,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=93233.33333333333, ans=0.025 2023-12-04 03:08:39,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=93233.33333333333, ans=0.2 2023-12-04 03:08:41,829 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2023-12-04 03:08:51,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=93300.0, ans=0.0 2023-12-04 03:08:56,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=93366.66666666667, ans=0.2 2023-12-04 03:09:08,443 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.166e+02 1.494e+02 1.570e+02 1.787e+02 2.496e+02, threshold=3.140e+02, percent-clipped=0.0 2023-12-04 03:09:11,688 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-12-04 03:09:18,305 INFO [train.py:1087] (2/4) Epoch 16, batch 600, loss[loss=0.1944, simple_loss=0.277, pruned_loss=0.05592, over 24168.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2783, pruned_loss=0.05277, over 4588211.49 frames. ], batch size: 82, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:09:29,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=93566.66666666667, ans=0.95 2023-12-04 03:09:38,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=93566.66666666667, ans=0.125 2023-12-04 03:09:43,747 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.81 vs. limit=22.5 2023-12-04 03:09:45,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=93633.33333333333, ans=0.125 2023-12-04 03:09:51,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93700.0, ans=0.125 2023-12-04 03:09:55,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=93700.0, ans=0.0 2023-12-04 03:10:13,595 INFO [train.py:1087] (2/4) Epoch 16, batch 650, loss[loss=0.2098, simple_loss=0.2924, pruned_loss=0.06361, over 21470.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2791, pruned_loss=0.05328, over 4615600.39 frames. ], batch size: 128, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:10:31,203 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=93900.0, ans=0.2 2023-12-04 03:10:32,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=93900.0, ans=0.125 2023-12-04 03:10:36,515 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:10:52,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=94033.33333333333, ans=0.0 2023-12-04 03:10:57,839 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.251e+02 1.496e+02 1.627e+02 1.957e+02 3.259e+02, threshold=3.254e+02, percent-clipped=1.0 2023-12-04 03:11:05,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=94100.0, ans=0.2 2023-12-04 03:11:08,063 INFO [train.py:1087] (2/4) Epoch 16, batch 700, loss[loss=0.1905, simple_loss=0.2777, pruned_loss=0.05163, over 24575.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2791, pruned_loss=0.0531, over 4670834.59 frames. ], batch size: 65, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:11:21,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=94233.33333333333, ans=0.125 2023-12-04 03:11:45,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=94366.66666666667, ans=0.125 2023-12-04 03:11:46,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=94366.66666666667, ans=0.0 2023-12-04 03:11:47,920 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.95 vs. limit=15.0 2023-12-04 03:11:58,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94433.33333333333, ans=0.1 2023-12-04 03:12:01,303 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-12-04 03:12:03,066 INFO [train.py:1087] (2/4) Epoch 16, batch 750, loss[loss=0.1953, simple_loss=0.2836, pruned_loss=0.05346, over 24786.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2789, pruned_loss=0.05328, over 4709388.97 frames. ], batch size: 72, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:12:16,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=94566.66666666667, ans=0.125 2023-12-04 03:12:27,387 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94633.33333333333, ans=0.125 2023-12-04 03:12:35,490 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=94700.0, ans=0.125 2023-12-04 03:12:38,038 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.97 vs. limit=22.5 2023-12-04 03:12:44,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=94700.0, ans=0.125 2023-12-04 03:12:48,110 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.571e+02 1.796e+02 2.145e+02 3.958e+02, threshold=3.592e+02, percent-clipped=2.0 2023-12-04 03:12:57,864 INFO [train.py:1087] (2/4) Epoch 16, batch 800, loss[loss=0.2268, simple_loss=0.3064, pruned_loss=0.07357, over 21233.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2788, pruned_loss=0.05322, over 4730572.03 frames. ], batch size: 128, lr: 1.46e-02, grad_scale: 32.0 2023-12-04 03:13:17,300 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=94900.0, ans=0.125 2023-12-04 03:13:34,879 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.98 vs. limit=10.0 2023-12-04 03:13:49,135 INFO [train.py:1087] (2/4) Epoch 16, batch 850, loss[loss=0.1863, simple_loss=0.2698, pruned_loss=0.05142, over 24446.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2787, pruned_loss=0.05332, over 4747334.29 frames. ], batch size: 77, lr: 1.46e-02, grad_scale: 32.0 2023-12-04 03:13:49,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95166.66666666667, ans=0.125 2023-12-04 03:13:56,269 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=95166.66666666667, ans=0.125 2023-12-04 03:13:58,150 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=95233.33333333333, ans=0.125 2023-12-04 03:14:03,099 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=95233.33333333333, ans=0.0 2023-12-04 03:14:08,426 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=22.5 2023-12-04 03:14:18,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=95366.66666666667, ans=0.125 2023-12-04 03:14:30,473 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.524e+02 1.746e+02 2.004e+02 3.223e+02, threshold=3.492e+02, percent-clipped=0.0 2023-12-04 03:14:49,884 INFO [train.py:1087] (2/4) Epoch 17, batch 0, loss[loss=0.1795, simple_loss=0.2686, pruned_loss=0.04516, over 24617.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2686, pruned_loss=0.04516, over 24617.00 frames. ], batch size: 68, lr: 1.42e-02, grad_scale: 32.0 2023-12-04 03:14:49,885 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 03:14:59,777 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2182, 3.0933, 3.1858, 4.7104], device='cuda:2') 2023-12-04 03:15:02,235 INFO [train.py:1119] (2/4) Epoch 17, validation: loss=0.165, simple_loss=0.267, pruned_loss=0.03149, over 944034.00 frames. 2023-12-04 03:15:02,236 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 03:15:02,848 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=15.0 2023-12-04 03:15:12,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=95533.33333333333, ans=0.125 2023-12-04 03:15:17,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=95533.33333333333, ans=0.025 2023-12-04 03:15:26,856 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:15:27,787 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=95600.0, ans=0.125 2023-12-04 03:15:41,051 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-12-04 03:15:41,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=95666.66666666667, ans=0.125 2023-12-04 03:15:46,242 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=95733.33333333333, ans=0.125 2023-12-04 03:15:57,564 INFO [train.py:1087] (2/4) Epoch 17, batch 50, loss[loss=0.1759, simple_loss=0.264, pruned_loss=0.04392, over 24718.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2823, pruned_loss=0.05546, over 1054593.03 frames. ], batch size: 67, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:16:09,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95866.66666666667, ans=0.1 2023-12-04 03:16:13,740 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:16:17,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=95866.66666666667, ans=0.125 2023-12-04 03:16:28,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95933.33333333333, ans=0.1 2023-12-04 03:16:28,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=95933.33333333333, ans=0.0 2023-12-04 03:16:41,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-12-04 03:16:48,005 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.540e+02 1.659e+02 1.795e+02 3.449e+02, threshold=3.318e+02, percent-clipped=0.0 2023-12-04 03:16:52,983 INFO [train.py:1087] (2/4) Epoch 17, batch 100, loss[loss=0.1805, simple_loss=0.2679, pruned_loss=0.04658, over 24749.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2779, pruned_loss=0.05239, over 1902731.99 frames. ], batch size: 64, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:16:53,498 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.00 vs. limit=15.0 2023-12-04 03:17:07,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=96200.0, ans=0.125 2023-12-04 03:17:08,691 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=96200.0, ans=0.2 2023-12-04 03:17:11,970 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:17:32,337 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.41 vs. limit=6.0 2023-12-04 03:17:36,026 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.09 vs. limit=15.0 2023-12-04 03:17:47,769 INFO [train.py:1087] (2/4) Epoch 17, batch 150, loss[loss=0.1783, simple_loss=0.2707, pruned_loss=0.04293, over 24574.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2777, pruned_loss=0.05231, over 2530714.68 frames. ], batch size: 64, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:17:58,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=96533.33333333333, ans=0.2 2023-12-04 03:18:22,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96666.66666666667, ans=0.1 2023-12-04 03:18:31,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=96733.33333333333, ans=0.0 2023-12-04 03:18:39,126 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.196e+02 1.417e+02 1.576e+02 1.746e+02 2.655e+02, threshold=3.153e+02, percent-clipped=0.0 2023-12-04 03:18:39,386 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96733.33333333333, ans=0.1 2023-12-04 03:18:39,419 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96733.33333333333, ans=0.1 2023-12-04 03:18:42,553 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=96800.0, ans=0.125 2023-12-04 03:18:43,416 INFO [train.py:1087] (2/4) Epoch 17, batch 200, loss[loss=0.1867, simple_loss=0.2704, pruned_loss=0.05149, over 24748.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2769, pruned_loss=0.05183, over 3044352.72 frames. ], batch size: 64, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:18:47,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96800.0, ans=0.1 2023-12-04 03:18:49,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96800.0, ans=0.1 2023-12-04 03:18:52,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=96800.0, ans=0.2 2023-12-04 03:19:04,554 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=96933.33333333333, ans=0.125 2023-12-04 03:19:11,439 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.17 vs. limit=8.0 2023-12-04 03:19:15,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=97000.0, ans=0.125 2023-12-04 03:19:28,395 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97066.66666666667, ans=0.1 2023-12-04 03:19:37,983 INFO [train.py:1087] (2/4) Epoch 17, batch 250, loss[loss=0.1988, simple_loss=0.2768, pruned_loss=0.0604, over 24472.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2771, pruned_loss=0.05185, over 3443449.03 frames. ], batch size: 75, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:19:41,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=97133.33333333333, ans=0.125 2023-12-04 03:19:49,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=97200.0, ans=0.125 2023-12-04 03:20:13,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=97333.33333333333, ans=0.2 2023-12-04 03:20:28,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=97400.0, ans=0.0 2023-12-04 03:20:29,367 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.456e+02 1.611e+02 1.755e+02 2.197e+02, threshold=3.223e+02, percent-clipped=0.0 2023-12-04 03:20:33,564 INFO [train.py:1087] (2/4) Epoch 17, batch 300, loss[loss=0.1996, simple_loss=0.2817, pruned_loss=0.05874, over 24475.00 frames. ], tot_loss[loss=0.1914, simple_loss=0.2779, pruned_loss=0.05238, over 3732964.96 frames. ], batch size: 75, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:20:41,682 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=15.0 2023-12-04 03:20:46,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97533.33333333333, ans=0.1 2023-12-04 03:20:57,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=97600.0, ans=0.125 2023-12-04 03:21:26,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=97733.33333333333, ans=0.125 2023-12-04 03:21:28,765 INFO [train.py:1087] (2/4) Epoch 17, batch 350, loss[loss=0.197, simple_loss=0.2868, pruned_loss=0.05358, over 24037.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2779, pruned_loss=0.05223, over 3970845.48 frames. ], batch size: 87, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:21:38,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=97800.0, ans=0.125 2023-12-04 03:21:45,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=97866.66666666667, ans=0.125 2023-12-04 03:22:19,383 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.204e+02 1.476e+02 1.651e+02 1.851e+02 2.677e+02, threshold=3.301e+02, percent-clipped=0.0 2023-12-04 03:22:20,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=98066.66666666667, ans=0.0 2023-12-04 03:22:23,651 INFO [train.py:1087] (2/4) Epoch 17, batch 400, loss[loss=0.1784, simple_loss=0.2679, pruned_loss=0.04447, over 24749.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2777, pruned_loss=0.05186, over 4154925.86 frames. ], batch size: 66, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:22:28,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=98133.33333333333, ans=0.09899494936611666 2023-12-04 03:22:35,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=98200.0, ans=0.2 2023-12-04 03:22:35,658 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=98200.0, ans=0.0 2023-12-04 03:22:44,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=98266.66666666667, ans=0.125 2023-12-04 03:23:02,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=98333.33333333333, ans=0.125 2023-12-04 03:23:10,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=98400.0, ans=0.125 2023-12-04 03:23:18,717 INFO [train.py:1087] (2/4) Epoch 17, batch 450, loss[loss=0.1836, simple_loss=0.2666, pruned_loss=0.05028, over 24134.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2769, pruned_loss=0.0516, over 4313020.85 frames. ], batch size: 82, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:23:38,976 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=98533.33333333333, ans=0.125 2023-12-04 03:24:09,763 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.473e+02 1.580e+02 1.739e+02 2.730e+02, threshold=3.159e+02, percent-clipped=0.0 2023-12-04 03:24:13,670 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.01 vs. limit=15.0 2023-12-04 03:24:14,102 INFO [train.py:1087] (2/4) Epoch 17, batch 500, loss[loss=0.181, simple_loss=0.2687, pruned_loss=0.04664, over 24760.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2763, pruned_loss=0.05135, over 4425007.61 frames. ], batch size: 64, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:24:22,064 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=15.0 2023-12-04 03:24:23,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=98866.66666666667, ans=0.035 2023-12-04 03:24:25,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=98866.66666666667, ans=0.0 2023-12-04 03:24:26,997 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=98866.66666666667, ans=0.125 2023-12-04 03:24:29,040 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98866.66666666667, ans=0.1 2023-12-04 03:24:34,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=98933.33333333333, ans=0.07 2023-12-04 03:24:35,813 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.31 vs. limit=15.0 2023-12-04 03:24:40,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=98933.33333333333, ans=0.125 2023-12-04 03:24:40,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=98933.33333333333, ans=0.0 2023-12-04 03:24:42,916 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2023-12-04 03:24:58,476 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=99066.66666666667, ans=0.0 2023-12-04 03:25:08,080 INFO [train.py:1087] (2/4) Epoch 17, batch 550, loss[loss=0.1837, simple_loss=0.271, pruned_loss=0.04818, over 24610.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2765, pruned_loss=0.0513, over 4522145.14 frames. ], batch size: 68, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:25:41,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=99333.33333333333, ans=0.015 2023-12-04 03:25:44,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=99333.33333333333, ans=0.2 2023-12-04 03:25:58,105 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:25:58,813 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.529e+02 1.713e+02 1.854e+02 2.755e+02, threshold=3.426e+02, percent-clipped=0.0 2023-12-04 03:26:03,158 INFO [train.py:1087] (2/4) Epoch 17, batch 600, loss[loss=0.1869, simple_loss=0.274, pruned_loss=0.0499, over 24763.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2764, pruned_loss=0.05124, over 4582752.94 frames. ], batch size: 65, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:26:26,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=99600.0, ans=0.2 2023-12-04 03:26:27,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=99600.0, ans=0.125 2023-12-04 03:26:27,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=99600.0, ans=0.125 2023-12-04 03:26:40,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=99666.66666666667, ans=0.0 2023-12-04 03:26:59,239 INFO [train.py:1087] (2/4) Epoch 17, batch 650, loss[loss=0.1855, simple_loss=0.2769, pruned_loss=0.04701, over 24795.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2764, pruned_loss=0.05124, over 4637287.23 frames. ], batch size: 62, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:27:07,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99800.0, ans=0.125 2023-12-04 03:27:47,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=100066.66666666667, ans=0.0 2023-12-04 03:27:50,901 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.216e+02 1.486e+02 1.651e+02 1.836e+02 2.522e+02, threshold=3.302e+02, percent-clipped=0.0 2023-12-04 03:27:51,564 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=15.0 2023-12-04 03:27:55,167 INFO [train.py:1087] (2/4) Epoch 17, batch 700, loss[loss=0.2102, simple_loss=0.2915, pruned_loss=0.06441, over 24466.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2766, pruned_loss=0.0515, over 4661803.50 frames. ], batch size: 75, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:28:19,530 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=100266.66666666667, ans=0.2 2023-12-04 03:28:19,815 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2023-12-04 03:28:27,302 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:28:48,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100400.0, ans=0.125 2023-12-04 03:28:50,306 INFO [train.py:1087] (2/4) Epoch 17, batch 750, loss[loss=0.1889, simple_loss=0.2768, pruned_loss=0.05047, over 24570.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2765, pruned_loss=0.05146, over 4696365.33 frames. ], batch size: 65, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:29:17,905 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100600.0, ans=0.125 2023-12-04 03:29:41,094 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.245e+02 1.557e+02 1.699e+02 1.928e+02 3.020e+02, threshold=3.398e+02, percent-clipped=0.0 2023-12-04 03:29:45,357 INFO [train.py:1087] (2/4) Epoch 17, batch 800, loss[loss=0.1841, simple_loss=0.2716, pruned_loss=0.04832, over 24792.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.276, pruned_loss=0.05135, over 4725101.95 frames. ], batch size: 62, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:30:36,416 INFO [train.py:1087] (2/4) Epoch 17, batch 850, loss[loss=0.1899, simple_loss=0.2739, pruned_loss=0.05302, over 24716.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.276, pruned_loss=0.05155, over 4744128.68 frames. ], batch size: 69, lr: 1.38e-02, grad_scale: 32.0 2023-12-04 03:30:39,891 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=101133.33333333333, ans=0.0 2023-12-04 03:30:46,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=101200.0, ans=0.0 2023-12-04 03:30:48,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=101200.0, ans=0.125 2023-12-04 03:30:49,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=101200.0, ans=0.0 2023-12-04 03:31:04,641 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-12-04 03:31:17,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=101400.0, ans=0.125 2023-12-04 03:31:18,487 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=101400.0, ans=0.125 2023-12-04 03:31:38,640 INFO [train.py:1087] (2/4) Epoch 18, batch 0, loss[loss=0.183, simple_loss=0.2711, pruned_loss=0.04742, over 24740.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2711, pruned_loss=0.04742, over 24740.00 frames. ], batch size: 63, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:31:38,641 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 03:31:50,770 INFO [train.py:1119] (2/4) Epoch 18, validation: loss=0.1646, simple_loss=0.2659, pruned_loss=0.03165, over 944034.00 frames. 2023-12-04 03:31:50,770 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 03:31:51,846 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.509e+02 1.678e+02 1.874e+02 3.730e+02, threshold=3.357e+02, percent-clipped=2.0 2023-12-04 03:31:55,195 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101433.33333333333, ans=0.1 2023-12-04 03:32:02,524 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=101500.0, ans=0.125 2023-12-04 03:32:21,292 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-12-04 03:32:28,765 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=12.0 2023-12-04 03:32:40,090 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=101700.0, ans=0.2 2023-12-04 03:32:41,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=101700.0, ans=0.0 2023-12-04 03:32:46,112 INFO [train.py:1087] (2/4) Epoch 18, batch 50, loss[loss=0.1865, simple_loss=0.2774, pruned_loss=0.04785, over 24773.00 frames. ], tot_loss[loss=0.1885, simple_loss=0.2758, pruned_loss=0.05056, over 1093553.52 frames. ], batch size: 70, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:33:09,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=101900.0, ans=0.0 2023-12-04 03:33:24,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=101966.66666666667, ans=0.0 2023-12-04 03:33:36,603 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102033.33333333333, ans=0.125 2023-12-04 03:33:41,198 INFO [train.py:1087] (2/4) Epoch 18, batch 100, loss[loss=0.1939, simple_loss=0.2812, pruned_loss=0.05331, over 24469.00 frames. ], tot_loss[loss=0.1877, simple_loss=0.2752, pruned_loss=0.05009, over 1916051.96 frames. ], batch size: 77, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:33:42,276 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.448e+02 1.568e+02 1.739e+02 2.957e+02, threshold=3.137e+02, percent-clipped=0.0 2023-12-04 03:34:02,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102233.33333333333, ans=0.125 2023-12-04 03:34:10,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=102233.33333333333, ans=0.5 2023-12-04 03:34:15,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=102300.0, ans=0.125 2023-12-04 03:34:35,702 INFO [train.py:1087] (2/4) Epoch 18, batch 150, loss[loss=0.1945, simple_loss=0.2796, pruned_loss=0.05471, over 23400.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2748, pruned_loss=0.05018, over 2563499.91 frames. ], batch size: 94, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:34:40,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=102433.33333333333, ans=0.0 2023-12-04 03:34:41,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=102433.33333333333, ans=0.2 2023-12-04 03:34:47,471 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.94 vs. limit=12.0 2023-12-04 03:34:50,615 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.75 vs. limit=15.0 2023-12-04 03:34:51,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=102500.0, ans=0.0 2023-12-04 03:34:55,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102500.0, ans=0.1 2023-12-04 03:35:04,738 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-12-04 03:35:07,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102633.33333333333, ans=0.1 2023-12-04 03:35:18,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102633.33333333333, ans=0.1 2023-12-04 03:35:26,759 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.98 vs. limit=15.0 2023-12-04 03:35:28,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=102700.0, ans=0.125 2023-12-04 03:35:30,497 INFO [train.py:1087] (2/4) Epoch 18, batch 200, loss[loss=0.1766, simple_loss=0.266, pruned_loss=0.04354, over 24771.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2741, pruned_loss=0.05, over 3065049.00 frames. ], batch size: 70, lr: 1.34e-02, grad_scale: 64.0 2023-12-04 03:35:31,514 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.477e+02 1.619e+02 1.796e+02 3.001e+02, threshold=3.237e+02, percent-clipped=0.0 2023-12-04 03:35:48,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102833.33333333333, ans=0.0 2023-12-04 03:36:08,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=102966.66666666667, ans=0.125 2023-12-04 03:36:11,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=102966.66666666667, ans=0.0 2023-12-04 03:36:17,760 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=103033.33333333333, ans=0.0 2023-12-04 03:36:25,786 INFO [train.py:1087] (2/4) Epoch 18, batch 250, loss[loss=0.2003, simple_loss=0.2874, pruned_loss=0.05655, over 22817.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2749, pruned_loss=0.05074, over 3441538.34 frames. ], batch size: 106, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:36:51,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=103233.33333333333, ans=0.125 2023-12-04 03:36:59,654 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=103300.0, ans=0.125 2023-12-04 03:37:06,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=103300.0, ans=0.1 2023-12-04 03:37:12,757 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.07 vs. limit=15.0 2023-12-04 03:37:16,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=103366.66666666667, ans=0.0 2023-12-04 03:37:19,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=103366.66666666667, ans=0.0 2023-12-04 03:37:21,845 INFO [train.py:1087] (2/4) Epoch 18, batch 300, loss[loss=0.181, simple_loss=0.2735, pruned_loss=0.04422, over 24716.00 frames. ], tot_loss[loss=0.1879, simple_loss=0.2749, pruned_loss=0.05049, over 3753443.58 frames. ], batch size: 69, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:37:22,888 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.255e+02 1.409e+02 1.517e+02 1.696e+02 2.292e+02, threshold=3.035e+02, percent-clipped=0.0 2023-12-04 03:37:47,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=103566.66666666667, ans=0.125 2023-12-04 03:38:13,632 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=103700.0, ans=0.125 2023-12-04 03:38:16,496 INFO [train.py:1087] (2/4) Epoch 18, batch 350, loss[loss=0.1866, simple_loss=0.2728, pruned_loss=0.05021, over 24590.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2748, pruned_loss=0.05043, over 3994494.58 frames. ], batch size: 68, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:38:26,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=103766.66666666667, ans=0.125 2023-12-04 03:39:09,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104033.33333333333, ans=0.1 2023-12-04 03:39:12,315 INFO [train.py:1087] (2/4) Epoch 18, batch 400, loss[loss=0.1884, simple_loss=0.28, pruned_loss=0.04838, over 24782.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2745, pruned_loss=0.05003, over 4187502.87 frames. ], batch size: 62, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:39:13,354 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.517e+02 1.661e+02 1.855e+02 2.860e+02, threshold=3.323e+02, percent-clipped=0.0 2023-12-04 03:39:22,859 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-12-04 03:40:07,624 INFO [train.py:1087] (2/4) Epoch 18, batch 450, loss[loss=0.1697, simple_loss=0.2592, pruned_loss=0.04006, over 24563.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2744, pruned_loss=0.04993, over 4326426.09 frames. ], batch size: 62, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:40:23,940 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=104500.0, ans=0.125 2023-12-04 03:40:46,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=104633.33333333333, ans=0.0 2023-12-04 03:40:47,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=104633.33333333333, ans=22.5 2023-12-04 03:41:00,030 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=104700.0, ans=0.0 2023-12-04 03:41:00,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=104700.0, ans=0.125 2023-12-04 03:41:03,346 INFO [train.py:1087] (2/4) Epoch 18, batch 500, loss[loss=0.1843, simple_loss=0.2764, pruned_loss=0.04607, over 24703.00 frames. ], tot_loss[loss=0.1874, simple_loss=0.2746, pruned_loss=0.0501, over 4446351.55 frames. ], batch size: 69, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:41:04,772 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.191e+02 1.505e+02 1.621e+02 1.794e+02 2.576e+02, threshold=3.242e+02, percent-clipped=0.0 2023-12-04 03:41:23,107 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=104833.33333333333, ans=10.0 2023-12-04 03:41:36,216 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104966.66666666667, ans=0.1 2023-12-04 03:41:40,416 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:41:42,601 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=104966.66666666667, ans=0.0 2023-12-04 03:41:53,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=105033.33333333333, ans=0.125 2023-12-04 03:41:59,015 INFO [train.py:1087] (2/4) Epoch 18, batch 550, loss[loss=0.1952, simple_loss=0.2854, pruned_loss=0.05253, over 22545.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2748, pruned_loss=0.05016, over 4523003.28 frames. ], batch size: 54, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:42:10,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=105166.66666666667, ans=0.0 2023-12-04 03:42:13,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=105166.66666666667, ans=0.2 2023-12-04 03:42:17,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=105166.66666666667, ans=0.09899494936611666 2023-12-04 03:42:40,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=105300.0, ans=0.125 2023-12-04 03:42:41,154 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-12-04 03:42:50,858 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=105366.66666666667, ans=0.2 2023-12-04 03:42:54,874 INFO [train.py:1087] (2/4) Epoch 18, batch 600, loss[loss=0.1712, simple_loss=0.2626, pruned_loss=0.03995, over 24557.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.275, pruned_loss=0.05029, over 4562869.50 frames. ], batch size: 62, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:42:57,010 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.490e+02 1.631e+02 1.843e+02 3.006e+02, threshold=3.263e+02, percent-clipped=0.0 2023-12-04 03:43:08,267 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-12-04 03:43:09,216 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=105500.0, ans=0.125 2023-12-04 03:43:14,549 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=105500.0, ans=0.125 2023-12-04 03:43:29,067 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=12.0 2023-12-04 03:43:50,970 INFO [train.py:1087] (2/4) Epoch 18, batch 650, loss[loss=0.1851, simple_loss=0.2719, pruned_loss=0.04913, over 24565.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2751, pruned_loss=0.05028, over 4627447.33 frames. ], batch size: 64, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:43:52,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105766.66666666667, ans=0.125 2023-12-04 03:44:17,588 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=105900.0, ans=0.125 2023-12-04 03:44:24,318 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=105966.66666666667, ans=0.0 2023-12-04 03:44:32,906 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=105966.66666666667, ans=0.125 2023-12-04 03:44:46,353 INFO [train.py:1087] (2/4) Epoch 18, batch 700, loss[loss=0.2483, simple_loss=0.3178, pruned_loss=0.08938, over 16965.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2746, pruned_loss=0.04991, over 4668908.22 frames. ], batch size: 177, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:44:46,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=106100.0, ans=0.025 2023-12-04 03:44:48,454 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.438e+02 1.544e+02 1.710e+02 2.632e+02, threshold=3.088e+02, percent-clipped=0.0 2023-12-04 03:44:52,951 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:45:18,638 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=106300.0, ans=0.2 2023-12-04 03:45:19,834 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=106300.0, ans=0.2 2023-12-04 03:45:22,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106300.0, ans=0.1 2023-12-04 03:45:42,181 INFO [train.py:1087] (2/4) Epoch 18, batch 750, loss[loss=0.186, simple_loss=0.2768, pruned_loss=0.04757, over 24765.00 frames. ], tot_loss[loss=0.1869, simple_loss=0.2743, pruned_loss=0.04972, over 4706774.81 frames. ], batch size: 71, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:45:42,485 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=106433.33333333333, ans=0.2 2023-12-04 03:45:48,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106433.33333333333, ans=0.0 2023-12-04 03:46:16,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106633.33333333333, ans=0.1 2023-12-04 03:46:19,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=106633.33333333333, ans=0.2 2023-12-04 03:46:39,668 INFO [train.py:1087] (2/4) Epoch 18, batch 800, loss[loss=0.2092, simple_loss=0.289, pruned_loss=0.06469, over 23413.00 frames. ], tot_loss[loss=0.1871, simple_loss=0.2744, pruned_loss=0.04987, over 4724099.31 frames. ], batch size: 94, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:46:41,752 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.203e+02 1.420e+02 1.554e+02 1.731e+02 2.690e+02, threshold=3.108e+02, percent-clipped=0.0 2023-12-04 03:46:47,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=106766.66666666667, ans=0.0 2023-12-04 03:47:04,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=106900.0, ans=0.125 2023-12-04 03:47:06,713 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=12.0 2023-12-04 03:47:15,632 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.25 vs. limit=15.0 2023-12-04 03:47:16,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.33 vs. limit=15.0 2023-12-04 03:47:21,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=107033.33333333333, ans=0.0 2023-12-04 03:47:27,511 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=107033.33333333333, ans=0.2 2023-12-04 03:47:28,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=107033.33333333333, ans=0.0 2023-12-04 03:47:31,256 INFO [train.py:1087] (2/4) Epoch 18, batch 850, loss[loss=0.1961, simple_loss=0.2834, pruned_loss=0.05443, over 24764.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2746, pruned_loss=0.05004, over 4744194.17 frames. ], batch size: 64, lr: 1.31e-02, grad_scale: 32.0 2023-12-04 03:47:48,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=107166.66666666667, ans=0.125 2023-12-04 03:47:56,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=107233.33333333333, ans=0.125 2023-12-04 03:47:57,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107233.33333333333, ans=0.1 2023-12-04 03:48:09,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=107300.0, ans=0.125 2023-12-04 03:48:33,732 INFO [train.py:1087] (2/4) Epoch 19, batch 0, loss[loss=0.1828, simple_loss=0.2713, pruned_loss=0.0472, over 24768.00 frames. ], tot_loss[loss=0.1828, simple_loss=0.2713, pruned_loss=0.0472, over 24768.00 frames. ], batch size: 65, lr: 1.28e-02, grad_scale: 32.0 2023-12-04 03:48:33,733 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 03:48:45,998 INFO [train.py:1119] (2/4) Epoch 19, validation: loss=0.1614, simple_loss=0.2634, pruned_loss=0.02973, over 944034.00 frames. 2023-12-04 03:48:45,999 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 03:48:47,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=107400.0, ans=0.125 2023-12-04 03:48:47,607 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.10 vs. limit=10.0 2023-12-04 03:48:50,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=107400.0, ans=0.0 2023-12-04 03:48:53,374 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.474e+02 1.635e+02 1.770e+02 2.891e+02, threshold=3.271e+02, percent-clipped=0.0 2023-12-04 03:48:57,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=107466.66666666667, ans=0.125 2023-12-04 03:49:07,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=107533.33333333333, ans=0.0 2023-12-04 03:49:23,231 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:49:26,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=107600.0, ans=0.0 2023-12-04 03:49:29,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=107666.66666666667, ans=0.0 2023-12-04 03:49:40,989 INFO [train.py:1087] (2/4) Epoch 19, batch 50, loss[loss=0.1782, simple_loss=0.2678, pruned_loss=0.04423, over 24732.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2741, pruned_loss=0.04916, over 1106115.64 frames. ], batch size: 61, lr: 1.28e-02, grad_scale: 32.0 2023-12-04 03:49:44,276 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=107733.33333333333, ans=0.0 2023-12-04 03:50:16,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=107933.33333333333, ans=0.2 2023-12-04 03:50:17,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=107933.33333333333, ans=0.1 2023-12-04 03:50:21,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=107933.33333333333, ans=0.0 2023-12-04 03:50:35,775 INFO [train.py:1087] (2/4) Epoch 19, batch 100, loss[loss=0.1866, simple_loss=0.2732, pruned_loss=0.05002, over 24476.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2731, pruned_loss=0.04816, over 1940950.51 frames. ], batch size: 77, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:50:39,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=108066.66666666667, ans=0.2 2023-12-04 03:50:44,318 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.474e+02 1.630e+02 1.863e+02 2.336e+02, threshold=3.260e+02, percent-clipped=0.0 2023-12-04 03:50:46,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=108066.66666666667, ans=15.0 2023-12-04 03:51:02,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=108200.0, ans=0.2 2023-12-04 03:51:03,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108200.0, ans=0.1 2023-12-04 03:51:17,275 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=108266.66666666667, ans=0.125 2023-12-04 03:51:18,441 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=108266.66666666667, ans=0.0 2023-12-04 03:51:18,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=108266.66666666667, ans=0.0 2023-12-04 03:51:18,873 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=22.5 2023-12-04 03:51:23,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108333.33333333333, ans=0.125 2023-12-04 03:51:30,828 INFO [train.py:1087] (2/4) Epoch 19, batch 150, loss[loss=0.1736, simple_loss=0.2596, pruned_loss=0.04376, over 24725.00 frames. ], tot_loss[loss=0.1852, simple_loss=0.273, pruned_loss=0.04868, over 2574842.52 frames. ], batch size: 67, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:51:47,329 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=15.0 2023-12-04 03:51:48,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.whiten.whitening_limit, batch_count=108466.66666666667, ans=12.0 2023-12-04 03:52:00,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=108533.33333333333, ans=0.125 2023-12-04 03:52:13,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=108666.66666666667, ans=0.0 2023-12-04 03:52:24,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=108733.33333333333, ans=0.125 2023-12-04 03:52:25,846 INFO [train.py:1087] (2/4) Epoch 19, batch 200, loss[loss=0.194, simple_loss=0.2813, pruned_loss=0.05332, over 23471.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2734, pruned_loss=0.04921, over 3064381.92 frames. ], batch size: 94, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:52:28,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=108733.33333333333, ans=0.125 2023-12-04 03:52:33,215 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.228e+02 1.475e+02 1.584e+02 1.733e+02 2.643e+02, threshold=3.168e+02, percent-clipped=0.0 2023-12-04 03:52:33,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=108733.33333333333, ans=0.125 2023-12-04 03:52:40,266 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=108800.0, ans=0.125 2023-12-04 03:53:07,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=108933.33333333333, ans=0.125 2023-12-04 03:53:21,524 INFO [train.py:1087] (2/4) Epoch 19, batch 250, loss[loss=0.1849, simple_loss=0.271, pruned_loss=0.04946, over 24571.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2739, pruned_loss=0.0496, over 3441242.11 frames. ], batch size: 65, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:53:25,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=109066.66666666667, ans=0.125 2023-12-04 03:53:30,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=109066.66666666667, ans=0.125 2023-12-04 03:53:34,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=109133.33333333333, ans=0.0 2023-12-04 03:54:04,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=109333.33333333333, ans=0.2 2023-12-04 03:54:14,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=109333.33333333333, ans=0.1 2023-12-04 03:54:17,665 INFO [train.py:1087] (2/4) Epoch 19, batch 300, loss[loss=0.1867, simple_loss=0.2705, pruned_loss=0.05142, over 24807.00 frames. ], tot_loss[loss=0.1863, simple_loss=0.2737, pruned_loss=0.04948, over 3741002.29 frames. ], batch size: 62, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:54:17,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=109400.0, ans=0.0 2023-12-04 03:54:22,459 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=109400.0, ans=0.0 2023-12-04 03:54:25,402 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.230e+02 1.469e+02 1.582e+02 1.881e+02 2.761e+02, threshold=3.164e+02, percent-clipped=0.0 2023-12-04 03:54:27,992 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-12-04 03:54:30,141 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.12 vs. limit=15.0 2023-12-04 03:54:32,977 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=109466.66666666667, ans=0.0 2023-12-04 03:54:59,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=109600.0, ans=0.0 2023-12-04 03:55:10,834 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=15.0 2023-12-04 03:55:12,208 INFO [train.py:1087] (2/4) Epoch 19, batch 350, loss[loss=0.1815, simple_loss=0.2686, pruned_loss=0.04721, over 24777.00 frames. ], tot_loss[loss=0.1859, simple_loss=0.2733, pruned_loss=0.04923, over 3989484.86 frames. ], batch size: 71, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:55:12,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109733.33333333333, ans=0.1 2023-12-04 03:55:27,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=109800.0, ans=0.0 2023-12-04 03:55:28,004 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-12-04 03:55:52,151 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=109933.33333333333, ans=0.125 2023-12-04 03:55:56,917 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.78 vs. limit=22.5 2023-12-04 03:55:58,038 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=110000.0, ans=22.5 2023-12-04 03:56:07,943 INFO [train.py:1087] (2/4) Epoch 19, batch 400, loss[loss=0.1682, simple_loss=0.2609, pruned_loss=0.03778, over 24780.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2732, pruned_loss=0.04911, over 4166827.76 frames. ], batch size: 73, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:56:08,148 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=110066.66666666667, ans=0.125 2023-12-04 03:56:14,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110066.66666666667, ans=0.1 2023-12-04 03:56:15,315 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.532e+02 1.716e+02 1.999e+02 2.876e+02, threshold=3.431e+02, percent-clipped=0.0 2023-12-04 03:56:25,274 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=110133.33333333333, ans=0.0 2023-12-04 03:57:03,727 INFO [train.py:1087] (2/4) Epoch 19, batch 450, loss[loss=0.1985, simple_loss=0.2828, pruned_loss=0.05708, over 23625.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2724, pruned_loss=0.0487, over 4311526.53 frames. ], batch size: 94, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:57:08,627 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.30 vs. limit=22.5 2023-12-04 03:57:12,585 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=110400.0, ans=0.09899494936611666 2023-12-04 03:57:34,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=110533.33333333333, ans=0.125 2023-12-04 03:57:43,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=110600.0, ans=0.0 2023-12-04 03:57:44,344 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-12-04 03:57:45,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=110600.0, ans=0.07 2023-12-04 03:57:49,369 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=110666.66666666667, ans=0.2 2023-12-04 03:57:53,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=110666.66666666667, ans=0.2 2023-12-04 03:57:55,642 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-12-04 03:57:59,709 INFO [train.py:1087] (2/4) Epoch 19, batch 500, loss[loss=0.1993, simple_loss=0.2849, pruned_loss=0.05687, over 21508.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.273, pruned_loss=0.04884, over 4425262.27 frames. ], batch size: 128, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:58:06,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=110733.33333333333, ans=0.025 2023-12-04 03:58:07,432 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.246e+02 1.421e+02 1.568e+02 1.756e+02 2.852e+02, threshold=3.136e+02, percent-clipped=0.0 2023-12-04 03:58:19,711 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-12-04 03:58:20,426 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110866.66666666667, ans=0.125 2023-12-04 03:58:35,332 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=110933.33333333333, ans=0.0 2023-12-04 03:58:46,918 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=15.0 2023-12-04 03:58:50,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=111000.0, ans=0.125 2023-12-04 03:58:54,911 INFO [train.py:1087] (2/4) Epoch 19, batch 550, loss[loss=0.1842, simple_loss=0.2773, pruned_loss=0.04555, over 24752.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2729, pruned_loss=0.04861, over 4520126.79 frames. ], batch size: 70, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:59:04,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=111066.66666666667, ans=0.125 2023-12-04 03:59:21,166 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=111200.0, ans=0.2 2023-12-04 03:59:44,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=111333.33333333333, ans=0.125 2023-12-04 03:59:46,158 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-12-04 03:59:50,214 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=111400.0, ans=0.025 2023-12-04 03:59:50,937 INFO [train.py:1087] (2/4) Epoch 19, batch 600, loss[loss=0.1718, simple_loss=0.2619, pruned_loss=0.04084, over 24765.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.273, pruned_loss=0.04874, over 4596180.74 frames. ], batch size: 64, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:59:58,416 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.173e+02 1.428e+02 1.598e+02 1.767e+02 2.995e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:00:05,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=111466.66666666667, ans=0.125 2023-12-04 04:00:11,459 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.40 vs. limit=15.0 2023-12-04 04:00:12,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111533.33333333333, ans=0.125 2023-12-04 04:00:38,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=111666.66666666667, ans=0.0 2023-12-04 04:00:46,495 INFO [train.py:1087] (2/4) Epoch 19, batch 650, loss[loss=0.1862, simple_loss=0.2764, pruned_loss=0.04806, over 22709.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2726, pruned_loss=0.04846, over 4655577.72 frames. ], batch size: 106, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 04:00:54,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=111733.33333333333, ans=0.125 2023-12-04 04:00:56,082 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-12-04 04:01:02,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten.whitening_limit, batch_count=111800.0, ans=22.5 2023-12-04 04:01:42,495 INFO [train.py:1087] (2/4) Epoch 19, batch 700, loss[loss=0.177, simple_loss=0.2694, pruned_loss=0.04234, over 24691.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2728, pruned_loss=0.04889, over 4666455.06 frames. ], batch size: 61, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:01:49,797 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.467e+02 1.598e+02 1.818e+02 2.734e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:01:50,531 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-12-04 04:02:02,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112133.33333333333, ans=0.1 2023-12-04 04:02:10,419 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112200.0, ans=0.1 2023-12-04 04:02:27,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=112333.33333333333, ans=0.1 2023-12-04 04:02:37,380 INFO [train.py:1087] (2/4) Epoch 19, batch 750, loss[loss=0.1846, simple_loss=0.2746, pruned_loss=0.04725, over 24777.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2728, pruned_loss=0.04886, over 4698055.49 frames. ], batch size: 71, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:02:43,369 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=112400.0, ans=0.07 2023-12-04 04:02:46,414 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=112400.0, ans=0.125 2023-12-04 04:02:57,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=112466.66666666667, ans=0.0 2023-12-04 04:03:08,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=112533.33333333333, ans=0.0 2023-12-04 04:03:26,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=112666.66666666667, ans=0.2 2023-12-04 04:03:32,476 INFO [train.py:1087] (2/4) Epoch 19, batch 800, loss[loss=0.1681, simple_loss=0.2568, pruned_loss=0.03966, over 23783.00 frames. ], tot_loss[loss=0.1847, simple_loss=0.2722, pruned_loss=0.04858, over 4728697.70 frames. ], batch size: 57, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:03:40,494 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.431e+02 1.569e+02 1.763e+02 3.495e+02, threshold=3.139e+02, percent-clipped=1.0 2023-12-04 04:04:00,179 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:04:01,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=112866.66666666667, ans=0.125 2023-12-04 04:04:14,093 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:04:18,129 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113000.0, ans=0.125 2023-12-04 04:04:23,888 INFO [train.py:1087] (2/4) Epoch 19, batch 850, loss[loss=0.1908, simple_loss=0.2794, pruned_loss=0.05108, over 24314.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2728, pruned_loss=0.04901, over 4737927.13 frames. ], batch size: 79, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:04:36,074 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=113133.33333333333, ans=0.125 2023-12-04 04:05:26,255 INFO [train.py:1087] (2/4) Epoch 20, batch 0, loss[loss=0.1787, simple_loss=0.2705, pruned_loss=0.04344, over 24741.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2705, pruned_loss=0.04344, over 24741.00 frames. ], batch size: 63, lr: 1.22e-02, grad_scale: 32.0 2023-12-04 04:05:26,256 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 04:05:38,414 INFO [train.py:1119] (2/4) Epoch 20, validation: loss=0.1617, simple_loss=0.2631, pruned_loss=0.03021, over 944034.00 frames. 2023-12-04 04:05:38,414 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 04:05:50,315 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=113433.33333333333, ans=0.0 2023-12-04 04:05:51,047 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.230e+02 1.506e+02 1.674e+02 1.994e+02 3.219e+02, threshold=3.349e+02, percent-clipped=1.0 2023-12-04 04:05:51,761 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-12-04 04:05:52,373 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=113433.33333333333, ans=0.125 2023-12-04 04:06:14,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=113566.66666666667, ans=6.0 2023-12-04 04:06:19,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=113566.66666666667, ans=0.125 2023-12-04 04:06:28,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=113633.33333333333, ans=0.025 2023-12-04 04:06:33,103 INFO [train.py:1087] (2/4) Epoch 20, batch 50, loss[loss=0.1785, simple_loss=0.2686, pruned_loss=0.04418, over 24740.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2744, pruned_loss=0.05012, over 1087191.39 frames. ], batch size: 63, lr: 1.22e-02, grad_scale: 32.0 2023-12-04 04:06:46,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=113766.66666666667, ans=0.125 2023-12-04 04:07:06,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=113900.0, ans=0.0 2023-12-04 04:07:27,372 INFO [train.py:1087] (2/4) Epoch 20, batch 100, loss[loss=0.1858, simple_loss=0.2706, pruned_loss=0.05052, over 24067.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2729, pruned_loss=0.04892, over 1924469.44 frames. ], batch size: 87, lr: 1.21e-02, grad_scale: 32.0 2023-12-04 04:07:28,058 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-12-04 04:07:31,815 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.08 vs. limit=10.0 2023-12-04 04:07:41,264 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.491e+02 1.624e+02 1.858e+02 3.040e+02, threshold=3.248e+02, percent-clipped=0.0 2023-12-04 04:08:01,749 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.20 vs. limit=15.0 2023-12-04 04:08:08,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=114233.33333333333, ans=0.125 2023-12-04 04:08:22,642 INFO [train.py:1087] (2/4) Epoch 20, batch 150, loss[loss=0.1865, simple_loss=0.2721, pruned_loss=0.05042, over 24554.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2725, pruned_loss=0.04851, over 2565315.96 frames. ], batch size: 62, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:08:30,297 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114366.66666666667, ans=0.0 2023-12-04 04:08:50,255 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-12-04 04:08:59,500 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=114566.66666666667, ans=0.125 2023-12-04 04:09:13,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114633.33333333333, ans=0.1 2023-12-04 04:09:18,240 INFO [train.py:1087] (2/4) Epoch 20, batch 200, loss[loss=0.1843, simple_loss=0.2772, pruned_loss=0.04567, over 24340.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2725, pruned_loss=0.04823, over 3061248.17 frames. ], batch size: 79, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:09:29,515 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:09:32,385 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.461e+02 1.607e+02 1.753e+02 2.582e+02, threshold=3.215e+02, percent-clipped=0.0 2023-12-04 04:09:56,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=114900.0, ans=0.125 2023-12-04 04:09:58,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=114900.0, ans=0.125 2023-12-04 04:10:00,959 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:10:02,204 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=114966.66666666667, ans=0.125 2023-12-04 04:10:05,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114966.66666666667, ans=0.1 2023-12-04 04:10:07,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=114966.66666666667, ans=0.0 2023-12-04 04:10:12,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115033.33333333333, ans=0.1 2023-12-04 04:10:13,391 INFO [train.py:1087] (2/4) Epoch 20, batch 250, loss[loss=0.188, simple_loss=0.2755, pruned_loss=0.05024, over 24439.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2722, pruned_loss=0.0484, over 3439287.63 frames. ], batch size: 77, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:10:23,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=115100.0, ans=0.0 2023-12-04 04:10:35,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=115166.66666666667, ans=0.2 2023-12-04 04:10:44,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=115166.66666666667, ans=10.0 2023-12-04 04:10:48,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=115233.33333333333, ans=0.0 2023-12-04 04:10:50,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=115233.33333333333, ans=0.125 2023-12-04 04:11:08,281 INFO [train.py:1087] (2/4) Epoch 20, batch 300, loss[loss=0.1791, simple_loss=0.2694, pruned_loss=0.04439, over 24782.00 frames. ], tot_loss[loss=0.1829, simple_loss=0.2709, pruned_loss=0.04748, over 3763440.44 frames. ], batch size: 72, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:11:16,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=115366.66666666667, ans=0.0 2023-12-04 04:11:17,977 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-12-04 04:11:22,633 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.437e+02 1.530e+02 1.741e+02 2.975e+02, threshold=3.060e+02, percent-clipped=0.0 2023-12-04 04:11:25,537 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-12-04 04:11:32,753 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.70 vs. limit=15.0 2023-12-04 04:11:35,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=115500.0, ans=0.125 2023-12-04 04:11:49,937 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-12-04 04:11:59,138 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=115633.33333333333, ans=0.0 2023-12-04 04:12:03,182 INFO [train.py:1087] (2/4) Epoch 20, batch 350, loss[loss=0.1764, simple_loss=0.2628, pruned_loss=0.045, over 24580.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2708, pruned_loss=0.0477, over 3992062.98 frames. ], batch size: 64, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:12:19,497 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=115766.66666666667, ans=0.07 2023-12-04 04:12:45,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=115900.0, ans=0.125 2023-12-04 04:12:49,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=115966.66666666667, ans=0.0 2023-12-04 04:12:50,401 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=115966.66666666667, ans=0.0 2023-12-04 04:12:56,106 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-12-04 04:12:58,839 INFO [train.py:1087] (2/4) Epoch 20, batch 400, loss[loss=0.1976, simple_loss=0.2837, pruned_loss=0.0558, over 21693.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.271, pruned_loss=0.04791, over 4167105.59 frames. ], batch size: 52, lr: 1.20e-02, grad_scale: 32.0 2023-12-04 04:12:59,004 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=116033.33333333333, ans=0.5 2023-12-04 04:13:06,607 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:13:13,100 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.517e+02 1.725e+02 1.895e+02 2.824e+02, threshold=3.451e+02, percent-clipped=0.0 2023-12-04 04:13:16,510 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:13:24,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=15.0 2023-12-04 04:13:27,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=116166.66666666667, ans=0.125 2023-12-04 04:13:54,172 INFO [train.py:1087] (2/4) Epoch 20, batch 450, loss[loss=0.1889, simple_loss=0.276, pruned_loss=0.05091, over 24774.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.271, pruned_loss=0.04789, over 4318375.60 frames. ], batch size: 64, lr: 1.20e-02, grad_scale: 16.0 2023-12-04 04:14:07,464 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.40 vs. limit=15.0 2023-12-04 04:14:10,402 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=116433.33333333333, ans=0.2 2023-12-04 04:14:20,425 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.22 vs. limit=22.5 2023-12-04 04:14:22,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=116500.0, ans=0.125 2023-12-04 04:14:48,755 INFO [train.py:1087] (2/4) Epoch 20, batch 500, loss[loss=0.1717, simple_loss=0.2661, pruned_loss=0.0386, over 24556.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2711, pruned_loss=0.04789, over 4433622.12 frames. ], batch size: 66, lr: 1.20e-02, grad_scale: 16.0 2023-12-04 04:15:03,788 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.217e+02 1.461e+02 1.591e+02 1.740e+02 2.413e+02, threshold=3.181e+02, percent-clipped=0.0 2023-12-04 04:15:17,272 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-12-04 04:15:19,277 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-12-04 04:15:22,223 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-12-04 04:15:43,007 INFO [train.py:1087] (2/4) Epoch 20, batch 550, loss[loss=0.1793, simple_loss=0.2695, pruned_loss=0.04454, over 24771.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2712, pruned_loss=0.04806, over 4516810.92 frames. ], batch size: 70, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:16:39,071 INFO [train.py:1087] (2/4) Epoch 20, batch 600, loss[loss=0.1747, simple_loss=0.2661, pruned_loss=0.04164, over 24774.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2707, pruned_loss=0.04765, over 4584601.01 frames. ], batch size: 72, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:16:41,383 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=117366.66666666667, ans=0.0 2023-12-04 04:16:41,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=117366.66666666667, ans=0.2 2023-12-04 04:16:43,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=117366.66666666667, ans=0.2 2023-12-04 04:16:43,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=117366.66666666667, ans=0.125 2023-12-04 04:16:55,667 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.445e+02 1.548e+02 1.707e+02 4.018e+02, threshold=3.095e+02, percent-clipped=1.0 2023-12-04 04:17:03,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=117500.0, ans=0.125 2023-12-04 04:17:07,220 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=117500.0, ans=0.125 2023-12-04 04:17:09,217 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=117500.0, ans=0.125 2023-12-04 04:17:22,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=117633.33333333333, ans=0.0 2023-12-04 04:17:23,177 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=117633.33333333333, ans=0.125 2023-12-04 04:17:35,042 INFO [train.py:1087] (2/4) Epoch 20, batch 650, loss[loss=0.1787, simple_loss=0.2683, pruned_loss=0.04456, over 24798.00 frames. ], tot_loss[loss=0.1838, simple_loss=0.2714, pruned_loss=0.04806, over 4624518.07 frames. ], batch size: 71, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:17:35,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=117700.0, ans=0.0 2023-12-04 04:17:36,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=117700.0, ans=0.125 2023-12-04 04:17:45,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=117766.66666666667, ans=0.0 2023-12-04 04:17:46,046 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=117766.66666666667, ans=0.0 2023-12-04 04:17:48,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=117766.66666666667, ans=0.125 2023-12-04 04:18:02,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.85 vs. limit=22.5 2023-12-04 04:18:21,183 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=117966.66666666667, ans=0.0 2023-12-04 04:18:30,186 INFO [train.py:1087] (2/4) Epoch 20, batch 700, loss[loss=0.1783, simple_loss=0.2695, pruned_loss=0.04353, over 24568.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.271, pruned_loss=0.0476, over 4669172.54 frames. ], batch size: 63, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:18:30,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=118033.33333333333, ans=0.125 2023-12-04 04:18:40,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118100.0, ans=0.1 2023-12-04 04:18:46,759 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.613e+02 1.825e+02 2.244e+02 3.207e+02, threshold=3.650e+02, percent-clipped=2.0 2023-12-04 04:19:03,216 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.28 vs. limit=15.0 2023-12-04 04:19:15,081 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=118300.0, ans=0.2 2023-12-04 04:19:25,347 INFO [train.py:1087] (2/4) Epoch 20, batch 750, loss[loss=0.1908, simple_loss=0.2822, pruned_loss=0.04967, over 21214.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2713, pruned_loss=0.04773, over 4691897.86 frames. ], batch size: 127, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:19:25,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=118366.66666666667, ans=0.2 2023-12-04 04:19:29,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=118366.66666666667, ans=0.125 2023-12-04 04:19:38,695 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2023-12-04 04:19:40,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118433.33333333333, ans=0.1 2023-12-04 04:19:53,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=118500.0, ans=0.035 2023-12-04 04:19:55,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=118500.0, ans=0.0 2023-12-04 04:20:21,375 INFO [train.py:1087] (2/4) Epoch 20, batch 800, loss[loss=0.1825, simple_loss=0.2709, pruned_loss=0.04703, over 24800.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2709, pruned_loss=0.04754, over 4721081.85 frames. ], batch size: 71, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:20:23,632 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=118700.0, ans=0.0 2023-12-04 04:20:28,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=118700.0, ans=0.0 2023-12-04 04:20:38,595 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.491e+02 1.643e+02 1.888e+02 3.122e+02, threshold=3.286e+02, percent-clipped=0.0 2023-12-04 04:20:46,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=118833.33333333333, ans=0.0 2023-12-04 04:20:46,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=118833.33333333333, ans=0.0 2023-12-04 04:20:52,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=118900.0, ans=0.125 2023-12-04 04:21:12,871 INFO [train.py:1087] (2/4) Epoch 20, batch 850, loss[loss=0.2293, simple_loss=0.3041, pruned_loss=0.0772, over 16771.00 frames. ], tot_loss[loss=0.183, simple_loss=0.271, pruned_loss=0.04755, over 4732094.35 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:21:13,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=119033.33333333333, ans=0.07 2023-12-04 04:21:30,419 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=119100.0, ans=0.95 2023-12-04 04:21:40,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=119166.66666666667, ans=0.0 2023-12-04 04:21:45,454 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=119233.33333333333, ans=0.125 2023-12-04 04:22:12,130 INFO [train.py:1087] (2/4) Epoch 21, batch 0, loss[loss=0.1752, simple_loss=0.2667, pruned_loss=0.04186, over 24763.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2667, pruned_loss=0.04186, over 24763.00 frames. ], batch size: 65, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:22:12,131 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 04:22:24,247 INFO [train.py:1119] (2/4) Epoch 21, validation: loss=0.1615, simple_loss=0.2627, pruned_loss=0.03013, over 944034.00 frames. 2023-12-04 04:22:24,248 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 04:22:31,802 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:22:44,374 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:22:46,409 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=119466.66666666667, ans=0.125 2023-12-04 04:22:47,218 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.228e+02 1.497e+02 1.650e+02 1.872e+02 2.842e+02, threshold=3.300e+02, percent-clipped=0.0 2023-12-04 04:22:56,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=119533.33333333333, ans=0.0 2023-12-04 04:23:11,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=119600.0, ans=0.125 2023-12-04 04:23:19,157 INFO [train.py:1087] (2/4) Epoch 21, batch 50, loss[loss=0.1902, simple_loss=0.2732, pruned_loss=0.05363, over 24076.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2722, pruned_loss=0.04775, over 1091351.56 frames. ], batch size: 87, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:23:21,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=119666.66666666667, ans=0.0 2023-12-04 04:23:23,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119666.66666666667, ans=0.1 2023-12-04 04:23:32,722 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.19 vs. limit=10.0 2023-12-04 04:23:34,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=119733.33333333333, ans=0.125 2023-12-04 04:23:48,833 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:23:52,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119866.66666666667, ans=0.1 2023-12-04 04:23:56,610 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:24:13,796 INFO [train.py:1087] (2/4) Epoch 21, batch 100, loss[loss=0.19, simple_loss=0.2794, pruned_loss=0.05032, over 21695.00 frames. ], tot_loss[loss=0.1835, simple_loss=0.272, pruned_loss=0.04752, over 1910083.33 frames. ], batch size: 127, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:24:37,169 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.224e+02 1.430e+02 1.557e+02 1.757e+02 2.969e+02, threshold=3.114e+02, percent-clipped=0.0 2023-12-04 04:24:49,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=120200.0, ans=0.95 2023-12-04 04:24:53,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=120200.0, ans=0.125 2023-12-04 04:24:55,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=120200.0, ans=0.0 2023-12-04 04:25:04,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=120266.66666666667, ans=0.04949747468305833 2023-12-04 04:25:09,002 INFO [train.py:1087] (2/4) Epoch 21, batch 150, loss[loss=0.176, simple_loss=0.2631, pruned_loss=0.04447, over 24767.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2708, pruned_loss=0.04686, over 2566540.82 frames. ], batch size: 64, lr: 1.16e-02, grad_scale: 8.0 2023-12-04 04:25:09,348 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=120333.33333333333, ans=0.125 2023-12-04 04:25:28,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=120400.0, ans=0.0 2023-12-04 04:25:37,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=120466.66666666667, ans=0.2 2023-12-04 04:25:49,841 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.0 2023-12-04 04:25:51,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=120533.33333333333, ans=0.05 2023-12-04 04:25:53,089 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=15.0 2023-12-04 04:25:56,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120600.0, ans=0.1 2023-12-04 04:26:03,064 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.11 vs. limit=22.5 2023-12-04 04:26:04,734 INFO [train.py:1087] (2/4) Epoch 21, batch 200, loss[loss=0.1826, simple_loss=0.2673, pruned_loss=0.04889, over 24559.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.271, pruned_loss=0.04762, over 3038100.52 frames. ], batch size: 63, lr: 1.16e-02, grad_scale: 8.0 2023-12-04 04:26:26,985 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.42 vs. limit=10.0 2023-12-04 04:26:29,413 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.197e+02 1.469e+02 1.704e+02 1.964e+02 2.913e+02, threshold=3.408e+02, percent-clipped=0.0 2023-12-04 04:26:44,136 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-12-04 04:26:53,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=120933.33333333333, ans=0.2 2023-12-04 04:27:00,141 INFO [train.py:1087] (2/4) Epoch 21, batch 250, loss[loss=0.1662, simple_loss=0.2575, pruned_loss=0.03746, over 24791.00 frames. ], tot_loss[loss=0.1833, simple_loss=0.2711, pruned_loss=0.04776, over 3429045.42 frames. ], batch size: 73, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:27:03,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=121000.0, ans=0.125 2023-12-04 04:27:09,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=121000.0, ans=0.5 2023-12-04 04:27:13,787 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.19 vs. limit=12.0 2023-12-04 04:27:17,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=121066.66666666667, ans=0.125 2023-12-04 04:27:34,175 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.16 vs. limit=12.0 2023-12-04 04:27:56,412 INFO [train.py:1087] (2/4) Epoch 21, batch 300, loss[loss=0.1871, simple_loss=0.2699, pruned_loss=0.05211, over 24562.00 frames. ], tot_loss[loss=0.1826, simple_loss=0.2706, pruned_loss=0.0473, over 3734259.90 frames. ], batch size: 63, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:28:09,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=121400.0, ans=0.09899494936611666 2023-12-04 04:28:10,464 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=121400.0, ans=0.125 2023-12-04 04:28:18,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121466.66666666667, ans=0.125 2023-12-04 04:28:20,266 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.454e+02 1.562e+02 1.743e+02 3.143e+02, threshold=3.125e+02, percent-clipped=0.0 2023-12-04 04:28:22,050 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=121466.66666666667, ans=0.125 2023-12-04 04:28:29,153 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=121533.33333333333, ans=0.07 2023-12-04 04:28:41,233 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-12-04 04:28:51,491 INFO [train.py:1087] (2/4) Epoch 21, batch 350, loss[loss=0.1684, simple_loss=0.26, pruned_loss=0.03845, over 24769.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2705, pruned_loss=0.04712, over 3968694.39 frames. ], batch size: 64, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:28:52,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=121666.66666666667, ans=0.125 2023-12-04 04:29:28,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=121866.66666666667, ans=0.125 2023-12-04 04:29:29,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121866.66666666667, ans=0.125 2023-12-04 04:29:47,546 INFO [train.py:1087] (2/4) Epoch 21, batch 400, loss[loss=0.1847, simple_loss=0.2731, pruned_loss=0.04816, over 21251.00 frames. ], tot_loss[loss=0.1818, simple_loss=0.2701, pruned_loss=0.04676, over 4155570.78 frames. ], batch size: 127, lr: 1.15e-02, grad_scale: 16.0 2023-12-04 04:29:53,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=122000.0, ans=10.0 2023-12-04 04:30:12,438 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.189e+02 1.444e+02 1.634e+02 1.974e+02 2.926e+02, threshold=3.268e+02, percent-clipped=0.0 2023-12-04 04:30:21,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=122200.0, ans=0.05 2023-12-04 04:30:31,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=122266.66666666667, ans=0.0 2023-12-04 04:30:38,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=122266.66666666667, ans=0.125 2023-12-04 04:30:43,588 INFO [train.py:1087] (2/4) Epoch 21, batch 450, loss[loss=0.1737, simple_loss=0.2625, pruned_loss=0.04248, over 24802.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2704, pruned_loss=0.04727, over 4297416.94 frames. ], batch size: 70, lr: 1.15e-02, grad_scale: 16.0 2023-12-04 04:30:50,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=122333.33333333333, ans=0.125 2023-12-04 04:31:16,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=122533.33333333333, ans=0.0 2023-12-04 04:31:21,266 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=122533.33333333333, ans=0.0 2023-12-04 04:31:31,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=122600.0, ans=0.125 2023-12-04 04:31:39,268 INFO [train.py:1087] (2/4) Epoch 21, batch 500, loss[loss=0.1827, simple_loss=0.2701, pruned_loss=0.04763, over 24740.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2697, pruned_loss=0.0468, over 4431094.65 frames. ], batch size: 66, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:31:46,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=122666.66666666667, ans=0.0 2023-12-04 04:32:03,942 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.232e+02 1.484e+02 1.598e+02 1.764e+02 4.123e+02, threshold=3.196e+02, percent-clipped=1.0 2023-12-04 04:32:16,259 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-12-04 04:32:22,286 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:32:29,633 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=122933.33333333333, ans=0.0 2023-12-04 04:32:34,081 INFO [train.py:1087] (2/4) Epoch 21, batch 550, loss[loss=0.1874, simple_loss=0.2699, pruned_loss=0.0524, over 24188.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.27, pruned_loss=0.0472, over 4486825.71 frames. ], batch size: 82, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:32:41,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=123000.0, ans=0.125 2023-12-04 04:32:41,873 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-12-04 04:32:47,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=123066.66666666667, ans=0.5 2023-12-04 04:32:51,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=123066.66666666667, ans=0.125 2023-12-04 04:33:10,285 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=123200.0, ans=0.0 2023-12-04 04:33:29,596 INFO [train.py:1087] (2/4) Epoch 21, batch 600, loss[loss=0.1714, simple_loss=0.2632, pruned_loss=0.03982, over 24773.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2703, pruned_loss=0.04724, over 4551006.54 frames. ], batch size: 70, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:33:41,166 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=123400.0, ans=0.125 2023-12-04 04:33:41,445 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-12-04 04:33:55,485 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.192e+02 1.454e+02 1.634e+02 1.764e+02 2.243e+02, threshold=3.267e+02, percent-clipped=0.0 2023-12-04 04:34:10,600 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-12-04 04:34:25,801 INFO [train.py:1087] (2/4) Epoch 21, batch 650, loss[loss=0.1811, simple_loss=0.269, pruned_loss=0.04659, over 24459.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2709, pruned_loss=0.0476, over 4583306.63 frames. ], batch size: 77, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:34:26,208 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=123666.66666666667, ans=0.0 2023-12-04 04:34:33,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=123666.66666666667, ans=0.125 2023-12-04 04:34:38,304 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123733.33333333333, ans=0.125 2023-12-04 04:34:39,757 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:34:57,336 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=123800.0, ans=0.0 2023-12-04 04:34:58,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=123866.66666666667, ans=0.125 2023-12-04 04:34:59,487 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=123866.66666666667, ans=0.0 2023-12-04 04:35:04,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=123866.66666666667, ans=0.125 2023-12-04 04:35:11,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=123933.33333333333, ans=0.05 2023-12-04 04:35:13,567 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=12.0 2023-12-04 04:35:16,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=123933.33333333333, ans=0.2 2023-12-04 04:35:22,127 INFO [train.py:1087] (2/4) Epoch 21, batch 700, loss[loss=0.1803, simple_loss=0.2722, pruned_loss=0.04417, over 24601.00 frames. ], tot_loss[loss=0.1823, simple_loss=0.2703, pruned_loss=0.04715, over 4637558.36 frames. ], batch size: 68, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:35:29,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=124000.0, ans=0.125 2023-12-04 04:35:42,536 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=124066.66666666667, ans=0.125 2023-12-04 04:35:42,667 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-12-04 04:35:47,783 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.184e+02 1.416e+02 1.577e+02 1.741e+02 2.814e+02, threshold=3.154e+02, percent-clipped=0.0 2023-12-04 04:35:54,143 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.14 vs. limit=22.5 2023-12-04 04:36:09,267 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=124266.66666666667, ans=0.0 2023-12-04 04:36:18,173 INFO [train.py:1087] (2/4) Epoch 21, batch 750, loss[loss=0.1839, simple_loss=0.269, pruned_loss=0.04938, over 22536.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2701, pruned_loss=0.04689, over 4680243.57 frames. ], batch size: 54, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:36:20,799 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-12-04 04:36:22,015 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-12-04 04:36:26,431 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.43 vs. limit=15.0 2023-12-04 04:36:33,009 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.33 vs. limit=15.0 2023-12-04 04:36:44,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=124466.66666666667, ans=0.125 2023-12-04 04:36:58,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=124533.33333333333, ans=0.125 2023-12-04 04:37:13,039 INFO [train.py:1087] (2/4) Epoch 21, batch 800, loss[loss=0.18, simple_loss=0.2665, pruned_loss=0.04678, over 24713.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.27, pruned_loss=0.04676, over 4702966.22 frames. ], batch size: 69, lr: 1.14e-02, grad_scale: 16.0 2023-12-04 04:37:15,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=124666.66666666667, ans=0.0 2023-12-04 04:37:37,101 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.246e+02 1.466e+02 1.578e+02 1.737e+02 2.758e+02, threshold=3.156e+02, percent-clipped=0.0 2023-12-04 04:37:38,650 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2023-12-04 04:37:44,465 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=124866.66666666667, ans=0.125 2023-12-04 04:37:52,402 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=124866.66666666667, ans=0.0 2023-12-04 04:38:00,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124933.33333333333, ans=0.1 2023-12-04 04:38:04,652 INFO [train.py:1087] (2/4) Epoch 21, batch 850, loss[loss=0.2227, simple_loss=0.2979, pruned_loss=0.07374, over 17104.00 frames. ], tot_loss[loss=0.1821, simple_loss=0.2703, pruned_loss=0.04699, over 4710056.59 frames. ], batch size: 177, lr: 1.14e-02, grad_scale: 16.0 2023-12-04 04:38:05,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=125000.0, ans=0.125 2023-12-04 04:38:27,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=125133.33333333333, ans=0.1 2023-12-04 04:38:31,579 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=125133.33333333333, ans=10.0 2023-12-04 04:38:34,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=125200.0, ans=0.09899494936611666 2023-12-04 04:38:44,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=125266.66666666667, ans=0.125 2023-12-04 04:38:45,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=125266.66666666667, ans=0.125 2023-12-04 04:39:02,746 INFO [train.py:1087] (2/4) Epoch 22, batch 0, loss[loss=0.1795, simple_loss=0.2674, pruned_loss=0.04584, over 24567.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2674, pruned_loss=0.04584, over 24567.00 frames. ], batch size: 64, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:39:02,747 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 04:39:15,182 INFO [train.py:1119] (2/4) Epoch 22, validation: loss=0.1596, simple_loss=0.2606, pruned_loss=0.0293, over 944034.00 frames. 2023-12-04 04:39:15,183 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 04:39:25,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125366.66666666667, ans=0.1 2023-12-04 04:39:38,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=125433.33333333333, ans=0.125 2023-12-04 04:39:45,901 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.163e+02 1.445e+02 1.599e+02 1.752e+02 2.849e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:40:03,815 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2023-12-04 04:40:10,030 INFO [train.py:1087] (2/4) Epoch 22, batch 50, loss[loss=0.1776, simple_loss=0.2678, pruned_loss=0.04366, over 24582.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2705, pruned_loss=0.04719, over 1086350.95 frames. ], batch size: 65, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:40:36,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=125766.66666666667, ans=0.07 2023-12-04 04:40:46,143 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-12-04 04:41:04,959 INFO [train.py:1087] (2/4) Epoch 22, batch 100, loss[loss=0.169, simple_loss=0.2569, pruned_loss=0.04061, over 24859.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2695, pruned_loss=0.04649, over 1905714.52 frames. ], batch size: 68, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:41:05,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=125966.66666666667, ans=0.125 2023-12-04 04:41:15,185 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:41:16,074 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126033.33333333333, ans=0.125 2023-12-04 04:41:19,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126033.33333333333, ans=0.125 2023-12-04 04:41:26,062 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:41:31,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=126100.0, ans=0.025 2023-12-04 04:41:33,648 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-12-04 04:41:35,558 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.440e+02 1.549e+02 1.793e+02 2.554e+02, threshold=3.098e+02, percent-clipped=0.0 2023-12-04 04:41:38,919 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=126166.66666666667, ans=0.125 2023-12-04 04:41:48,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=126233.33333333333, ans=0.125 2023-12-04 04:41:49,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=126233.33333333333, ans=0.125 2023-12-04 04:41:55,091 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=126233.33333333333, ans=0.125 2023-12-04 04:42:00,157 INFO [train.py:1087] (2/4) Epoch 22, batch 150, loss[loss=0.1683, simple_loss=0.2584, pruned_loss=0.03914, over 24571.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.27, pruned_loss=0.04667, over 2547386.24 frames. ], batch size: 64, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:42:34,075 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-12-04 04:42:54,369 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.74 vs. limit=15.0 2023-12-04 04:42:55,947 INFO [train.py:1087] (2/4) Epoch 22, batch 200, loss[loss=0.194, simple_loss=0.2747, pruned_loss=0.05666, over 24492.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2699, pruned_loss=0.04677, over 3036712.48 frames. ], batch size: 75, lr: 1.11e-02, grad_scale: 16.0 2023-12-04 04:43:12,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=126700.0, ans=0.2 2023-12-04 04:43:18,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126766.66666666667, ans=0.1 2023-12-04 04:43:28,330 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.405e+02 1.560e+02 1.711e+02 3.260e+02, threshold=3.121e+02, percent-clipped=1.0 2023-12-04 04:43:33,587 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-12-04 04:43:45,793 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:43:46,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=126900.0, ans=0.0 2023-12-04 04:43:50,038 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=126900.0, ans=0.0 2023-12-04 04:43:51,865 INFO [train.py:1087] (2/4) Epoch 22, batch 250, loss[loss=0.1934, simple_loss=0.2788, pruned_loss=0.05401, over 21039.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2691, pruned_loss=0.0461, over 3428664.75 frames. ], batch size: 127, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:44:02,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127033.33333333333, ans=0.1 2023-12-04 04:44:04,358 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=127033.33333333333, ans=0.125 2023-12-04 04:44:09,549 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127033.33333333333, ans=0.1 2023-12-04 04:44:19,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=127100.0, ans=0.05 2023-12-04 04:44:21,366 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-12-04 04:44:37,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=127233.33333333333, ans=0.2 2023-12-04 04:44:47,173 INFO [train.py:1087] (2/4) Epoch 22, batch 300, loss[loss=0.1662, simple_loss=0.2569, pruned_loss=0.03775, over 24768.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2692, pruned_loss=0.04628, over 3722858.55 frames. ], batch size: 73, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:44:52,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=127300.0, ans=0.09899494936611666 2023-12-04 04:45:02,456 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.90 vs. limit=22.5 2023-12-04 04:45:04,291 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=127366.66666666667, ans=0.0 2023-12-04 04:45:18,868 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.432e+02 1.560e+02 1.813e+02 2.392e+02, threshold=3.120e+02, percent-clipped=0.0 2023-12-04 04:45:33,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=127566.66666666667, ans=0.125 2023-12-04 04:45:40,896 INFO [train.py:1087] (2/4) Epoch 22, batch 350, loss[loss=0.1911, simple_loss=0.2796, pruned_loss=0.05135, over 22737.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2691, pruned_loss=0.04604, over 3977953.22 frames. ], batch size: 106, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:45:46,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=127633.33333333333, ans=0.1 2023-12-04 04:46:07,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=127766.66666666667, ans=0.2 2023-12-04 04:46:12,089 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-12-04 04:46:12,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=127766.66666666667, ans=0.1 2023-12-04 04:46:33,330 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=127900.0, ans=0.125 2023-12-04 04:46:37,415 INFO [train.py:1087] (2/4) Epoch 22, batch 400, loss[loss=0.1863, simple_loss=0.2727, pruned_loss=0.04991, over 21242.00 frames. ], tot_loss[loss=0.181, simple_loss=0.2694, pruned_loss=0.0463, over 4145540.78 frames. ], batch size: 127, lr: 1.10e-02, grad_scale: 32.0 2023-12-04 04:46:45,461 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=127966.66666666667, ans=0.2 2023-12-04 04:46:59,849 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.96 vs. limit=15.0 2023-12-04 04:47:00,608 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=128100.0, ans=0.1 2023-12-04 04:47:03,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=128100.0, ans=0.1 2023-12-04 04:47:09,771 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.423e+02 1.544e+02 1.672e+02 2.386e+02, threshold=3.087e+02, percent-clipped=0.0 2023-12-04 04:47:21,072 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=128233.33333333333, ans=0.125 2023-12-04 04:47:22,727 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=15.0 2023-12-04 04:47:24,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=128233.33333333333, ans=0.2 2023-12-04 04:47:28,705 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2023-12-04 04:47:33,188 INFO [train.py:1087] (2/4) Epoch 22, batch 450, loss[loss=0.1881, simple_loss=0.28, pruned_loss=0.04808, over 21475.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2688, pruned_loss=0.0461, over 4284182.42 frames. ], batch size: 128, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:47:47,245 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=128366.66666666667, ans=0.125 2023-12-04 04:48:20,691 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=128566.66666666667, ans=0.125 2023-12-04 04:48:22,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=128566.66666666667, ans=0.035 2023-12-04 04:48:29,320 INFO [train.py:1087] (2/4) Epoch 22, batch 500, loss[loss=0.1938, simple_loss=0.2865, pruned_loss=0.0506, over 21651.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2687, pruned_loss=0.04615, over 4406781.40 frames. ], batch size: 127, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:48:30,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.02 vs. limit=22.5 2023-12-04 04:48:32,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=128633.33333333333, ans=0.125 2023-12-04 04:48:33,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=128633.33333333333, ans=0.0 2023-12-04 04:48:47,871 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-12-04 04:48:54,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=128766.66666666667, ans=0.2 2023-12-04 04:48:56,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=128766.66666666667, ans=0.125 2023-12-04 04:49:01,867 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=128833.33333333333, ans=0.125 2023-12-04 04:49:02,609 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.487e+02 1.631e+02 1.827e+02 2.916e+02, threshold=3.262e+02, percent-clipped=0.0 2023-12-04 04:49:08,041 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=128833.33333333333, ans=0.0 2023-12-04 04:49:14,634 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.40 vs. limit=22.5 2023-12-04 04:49:24,096 INFO [train.py:1087] (2/4) Epoch 22, batch 550, loss[loss=0.1839, simple_loss=0.2714, pruned_loss=0.04822, over 23567.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.269, pruned_loss=0.04611, over 4493155.90 frames. ], batch size: 94, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:49:31,366 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=128966.66666666667, ans=0.125 2023-12-04 04:49:35,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=129033.33333333333, ans=0.07 2023-12-04 04:49:54,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129100.0, ans=0.125 2023-12-04 04:50:08,745 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-12-04 04:50:14,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=129233.33333333333, ans=0.125 2023-12-04 04:50:15,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129233.33333333333, ans=0.125 2023-12-04 04:50:19,880 INFO [train.py:1087] (2/4) Epoch 22, batch 600, loss[loss=0.1853, simple_loss=0.2716, pruned_loss=0.04953, over 24582.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2693, pruned_loss=0.04624, over 4560951.45 frames. ], batch size: 65, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:50:53,446 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.181e+02 1.429e+02 1.565e+02 1.721e+02 2.657e+02, threshold=3.131e+02, percent-clipped=0.0 2023-12-04 04:51:01,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=129500.0, ans=0.125 2023-12-04 04:51:07,414 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=129566.66666666667, ans=0.125 2023-12-04 04:51:07,546 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=129566.66666666667, ans=0.125 2023-12-04 04:51:12,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=129566.66666666667, ans=0.125 2023-12-04 04:51:15,939 INFO [train.py:1087] (2/4) Epoch 22, batch 650, loss[loss=0.1779, simple_loss=0.2689, pruned_loss=0.0435, over 24772.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2688, pruned_loss=0.0461, over 4611686.71 frames. ], batch size: 64, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:51:16,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=129633.33333333333, ans=0.05 2023-12-04 04:51:16,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=129633.33333333333, ans=0.125 2023-12-04 04:51:17,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=129633.33333333333, ans=0.125 2023-12-04 04:51:21,837 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.65 vs. limit=15.0 2023-12-04 04:51:22,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=129633.33333333333, ans=0.0 2023-12-04 04:51:31,736 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-12-04 04:51:42,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=129766.66666666667, ans=0.0 2023-12-04 04:52:10,916 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.77 vs. limit=10.0 2023-12-04 04:52:11,362 INFO [train.py:1087] (2/4) Epoch 22, batch 700, loss[loss=0.1755, simple_loss=0.2668, pruned_loss=0.04209, over 24611.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2687, pruned_loss=0.04609, over 4655178.02 frames. ], batch size: 68, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:52:22,814 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.87 vs. limit=15.0 2023-12-04 04:52:34,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=130100.0, ans=0.125 2023-12-04 04:52:44,310 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.399e+02 1.549e+02 1.743e+02 2.457e+02, threshold=3.098e+02, percent-clipped=0.0 2023-12-04 04:52:57,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=130233.33333333333, ans=0.125 2023-12-04 04:52:59,611 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=130233.33333333333, ans=0.125 2023-12-04 04:53:03,117 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=130233.33333333333, ans=0.125 2023-12-04 04:53:06,046 INFO [train.py:1087] (2/4) Epoch 22, batch 750, loss[loss=0.1728, simple_loss=0.2641, pruned_loss=0.04071, over 24548.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2682, pruned_loss=0.04561, over 4692733.88 frames. ], batch size: 77, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:53:16,253 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:53:20,777 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.32 vs. limit=22.5 2023-12-04 04:53:43,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=130500.0, ans=0.07 2023-12-04 04:53:51,560 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.06 vs. limit=22.5 2023-12-04 04:53:55,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=130566.66666666667, ans=0.0 2023-12-04 04:53:55,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=130566.66666666667, ans=0.0 2023-12-04 04:54:00,952 INFO [train.py:1087] (2/4) Epoch 22, batch 800, loss[loss=0.2028, simple_loss=0.2876, pruned_loss=0.05903, over 22694.00 frames. ], tot_loss[loss=0.1796, simple_loss=0.2679, pruned_loss=0.04563, over 4713284.36 frames. ], batch size: 106, lr: 1.09e-02, grad_scale: 32.0 2023-12-04 04:54:01,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=130633.33333333333, ans=0.0 2023-12-04 04:54:06,153 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.36 vs. limit=12.0 2023-12-04 04:54:16,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130700.0, ans=0.1 2023-12-04 04:54:23,306 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:54:23,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=130766.66666666667, ans=0.125 2023-12-04 04:54:33,122 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.431e+02 1.591e+02 1.794e+02 2.662e+02, threshold=3.181e+02, percent-clipped=0.0 2023-12-04 04:54:41,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130833.33333333333, ans=0.1 2023-12-04 04:54:53,214 INFO [train.py:1087] (2/4) Epoch 22, batch 850, loss[loss=0.1732, simple_loss=0.2638, pruned_loss=0.04131, over 24289.00 frames. ], tot_loss[loss=0.1804, simple_loss=0.2684, pruned_loss=0.04616, over 4704259.02 frames. ], batch size: 79, lr: 1.09e-02, grad_scale: 32.0 2023-12-04 04:54:59,773 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5 2023-12-04 04:55:06,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten.whitening_limit, batch_count=131033.33333333333, ans=22.5 2023-12-04 04:55:11,701 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=131033.33333333333, ans=0.0 2023-12-04 04:55:23,379 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.39 vs. limit=15.0 2023-12-04 04:55:29,321 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.19 vs. limit=15.0 2023-12-04 04:55:32,054 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-12-04 04:55:35,096 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.69 vs. limit=15.0 2023-12-04 04:55:45,878 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2023-12-04 04:55:55,149 INFO [train.py:1087] (2/4) Epoch 23, batch 0, loss[loss=0.1822, simple_loss=0.2696, pruned_loss=0.04737, over 24491.00 frames. ], tot_loss[loss=0.1822, simple_loss=0.2696, pruned_loss=0.04737, over 24491.00 frames. ], batch size: 77, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:55:55,150 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 04:56:07,333 INFO [train.py:1119] (2/4) Epoch 23, validation: loss=0.1586, simple_loss=0.2601, pruned_loss=0.02859, over 944034.00 frames. 2023-12-04 04:56:07,334 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 04:56:10,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=131266.66666666666, ans=0.125 2023-12-04 04:56:38,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=131400.0, ans=12.0 2023-12-04 04:56:43,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=131466.66666666666, ans=0.125 2023-12-04 04:56:45,819 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.445e+02 1.626e+02 1.881e+02 3.357e+02, threshold=3.253e+02, percent-clipped=2.0 2023-12-04 04:57:02,807 INFO [train.py:1087] (2/4) Epoch 23, batch 50, loss[loss=0.1871, simple_loss=0.2716, pruned_loss=0.05134, over 24448.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2674, pruned_loss=0.04418, over 1086164.95 frames. ], batch size: 77, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:57:06,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=131600.0, ans=0.0 2023-12-04 04:57:06,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=131600.0, ans=0.0 2023-12-04 04:57:14,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=131666.66666666666, ans=0.2 2023-12-04 04:57:19,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=131666.66666666666, ans=0.125 2023-12-04 04:57:21,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=131666.66666666666, ans=0.125 2023-12-04 04:57:26,269 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=131733.33333333334, ans=0.0 2023-12-04 04:57:26,270 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:57:48,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=131866.66666666666, ans=0.07 2023-12-04 04:57:57,949 INFO [train.py:1087] (2/4) Epoch 23, batch 100, loss[loss=0.1792, simple_loss=0.2688, pruned_loss=0.0448, over 24795.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2664, pruned_loss=0.04406, over 1908759.99 frames. ], batch size: 73, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:58:00,744 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=131933.33333333334, ans=0.125 2023-12-04 04:58:04,298 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.68 vs. limit=15.0 2023-12-04 04:58:25,012 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.17 vs. limit=22.5 2023-12-04 04:58:32,306 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.14 vs. limit=15.0 2023-12-04 04:58:37,411 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.409e+02 1.496e+02 1.675e+02 3.177e+02, threshold=2.991e+02, percent-clipped=0.0 2023-12-04 04:58:44,437 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-12-04 04:58:53,607 INFO [train.py:1087] (2/4) Epoch 23, batch 150, loss[loss=0.1685, simple_loss=0.2563, pruned_loss=0.04033, over 24551.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2663, pruned_loss=0.04371, over 2572715.37 frames. ], batch size: 62, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:59:00,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=132266.66666666666, ans=0.125 2023-12-04 04:59:03,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=132266.66666666666, ans=10.0 2023-12-04 04:59:07,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=132333.33333333334, ans=0.125 2023-12-04 04:59:34,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=132466.66666666666, ans=0.0 2023-12-04 04:59:34,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=132466.66666666666, ans=0.125 2023-12-04 04:59:34,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=132466.66666666666, ans=0.1 2023-12-04 04:59:43,542 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-12-04 04:59:49,247 INFO [train.py:1087] (2/4) Epoch 23, batch 200, loss[loss=0.1745, simple_loss=0.2632, pruned_loss=0.04294, over 24741.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2674, pruned_loss=0.04495, over 3061953.56 frames. ], batch size: 63, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:00:18,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132733.33333333334, ans=0.1 2023-12-04 05:00:25,435 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-12-04 05:00:27,845 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.207e+02 1.403e+02 1.524e+02 1.738e+02 3.583e+02, threshold=3.048e+02, percent-clipped=1.0 2023-12-04 05:00:42,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=132866.66666666666, ans=0.125 2023-12-04 05:00:45,488 INFO [train.py:1087] (2/4) Epoch 23, batch 250, loss[loss=0.1716, simple_loss=0.2598, pruned_loss=0.04171, over 24768.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2672, pruned_loss=0.04482, over 3455783.98 frames. ], batch size: 65, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:00:45,698 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:00:55,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133000.0, ans=0.1 2023-12-04 05:01:03,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=133000.0, ans=0.04949747468305833 2023-12-04 05:01:21,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=133133.33333333334, ans=0.2 2023-12-04 05:01:24,249 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.56 vs. limit=15.0 2023-12-04 05:01:25,104 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=133133.33333333334, ans=0.0 2023-12-04 05:01:28,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=133200.0, ans=0.0 2023-12-04 05:01:40,334 INFO [train.py:1087] (2/4) Epoch 23, batch 300, loss[loss=0.1836, simple_loss=0.2755, pruned_loss=0.04583, over 23500.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2667, pruned_loss=0.04432, over 3769004.50 frames. ], batch size: 94, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:01:40,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=15.0 2023-12-04 05:01:44,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=133266.66666666666, ans=0.125 2023-12-04 05:02:06,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=133400.0, ans=0.0 2023-12-04 05:02:17,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133466.66666666666, ans=0.1 2023-12-04 05:02:21,685 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.188e+02 1.411e+02 1.532e+02 1.698e+02 2.321e+02, threshold=3.065e+02, percent-clipped=0.0 2023-12-04 05:02:29,630 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-12-04 05:02:31,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133533.33333333334, ans=0.125 2023-12-04 05:02:37,643 INFO [train.py:1087] (2/4) Epoch 23, batch 350, loss[loss=0.1831, simple_loss=0.2726, pruned_loss=0.04682, over 24760.00 frames. ], tot_loss[loss=0.1787, simple_loss=0.2675, pruned_loss=0.04498, over 3975854.64 frames. ], batch size: 70, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:02:38,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=133600.0, ans=0.05 2023-12-04 05:02:39,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=133600.0, ans=0.0 2023-12-04 05:02:44,718 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:03:07,294 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=133733.33333333334, ans=0.125 2023-12-04 05:03:16,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=133800.0, ans=0.0 2023-12-04 05:03:19,098 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133800.0, ans=0.1 2023-12-04 05:03:32,706 INFO [train.py:1087] (2/4) Epoch 23, batch 400, loss[loss=0.1953, simple_loss=0.2785, pruned_loss=0.05609, over 22807.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2674, pruned_loss=0.04535, over 4133005.86 frames. ], batch size: 106, lr: 1.05e-02, grad_scale: 32.0 2023-12-04 05:03:38,178 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=133933.33333333334, ans=0.125 2023-12-04 05:03:54,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=134066.66666666666, ans=0.0 2023-12-04 05:03:56,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=134066.66666666666, ans=0.1 2023-12-04 05:04:03,768 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134066.66666666666, ans=0.1 2023-12-04 05:04:12,108 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.207e+02 1.404e+02 1.564e+02 1.778e+02 2.505e+02, threshold=3.128e+02, percent-clipped=0.0 2023-12-04 05:04:28,733 INFO [train.py:1087] (2/4) Epoch 23, batch 450, loss[loss=0.167, simple_loss=0.2537, pruned_loss=0.04021, over 24701.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2664, pruned_loss=0.04473, over 4285579.45 frames. ], batch size: 74, lr: 1.05e-02, grad_scale: 32.0 2023-12-04 05:04:49,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=134400.0, ans=0.125 2023-12-04 05:04:52,207 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134400.0, ans=0.1 2023-12-04 05:05:06,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=134466.66666666666, ans=0.2 2023-12-04 05:05:19,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=134533.33333333334, ans=0.0 2023-12-04 05:05:24,235 INFO [train.py:1087] (2/4) Epoch 23, batch 500, loss[loss=0.1891, simple_loss=0.2766, pruned_loss=0.05077, over 23994.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2674, pruned_loss=0.04542, over 4364194.42 frames. ], batch size: 87, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:05:25,563 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:05:44,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=134666.66666666666, ans=0.125 2023-12-04 05:06:04,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=134800.0, ans=0.05 2023-12-04 05:06:05,593 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.442e+02 1.601e+02 1.815e+02 2.698e+02, threshold=3.202e+02, percent-clipped=0.0 2023-12-04 05:06:08,958 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=134866.66666666666, ans=0.2 2023-12-04 05:06:19,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=134933.33333333334, ans=0.0 2023-12-04 05:06:19,790 INFO [train.py:1087] (2/4) Epoch 23, batch 550, loss[loss=0.1909, simple_loss=0.28, pruned_loss=0.05088, over 24334.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2674, pruned_loss=0.04526, over 4468801.63 frames. ], batch size: 79, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:06:27,497 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.88 vs. limit=22.5 2023-12-04 05:06:32,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135000.0, ans=0.1 2023-12-04 05:06:44,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=135066.66666666666, ans=0.125 2023-12-04 05:06:46,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=135066.66666666666, ans=0.125 2023-12-04 05:06:52,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=135133.33333333334, ans=0.0 2023-12-04 05:06:52,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=135133.33333333334, ans=0.0 2023-12-04 05:06:55,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=135133.33333333334, ans=0.0 2023-12-04 05:07:04,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=135200.0, ans=0.125 2023-12-04 05:07:15,071 INFO [train.py:1087] (2/4) Epoch 23, batch 600, loss[loss=0.1637, simple_loss=0.2589, pruned_loss=0.03423, over 24867.00 frames. ], tot_loss[loss=0.1795, simple_loss=0.2679, pruned_loss=0.04557, over 4532248.35 frames. ], batch size: 68, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:07:22,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=135266.66666666666, ans=0.125 2023-12-04 05:07:26,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=135333.33333333334, ans=0.1 2023-12-04 05:07:27,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=135333.33333333334, ans=0.0 2023-12-04 05:07:34,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=135333.33333333334, ans=0.05 2023-12-04 05:07:39,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=135400.0, ans=0.1 2023-12-04 05:07:43,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=135400.0, ans=0.0 2023-12-04 05:07:49,850 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=15.23 vs. limit=15.0 2023-12-04 05:07:57,305 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.383e+02 1.459e+02 1.628e+02 2.950e+02, threshold=2.918e+02, percent-clipped=0.0 2023-12-04 05:08:06,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135533.33333333334, ans=0.1 2023-12-04 05:08:10,607 INFO [train.py:1087] (2/4) Epoch 23, batch 650, loss[loss=0.1698, simple_loss=0.2629, pruned_loss=0.03836, over 24745.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2679, pruned_loss=0.04546, over 4587013.49 frames. ], batch size: 66, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:08:10,867 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=135600.0, ans=0.0 2023-12-04 05:08:26,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=135666.66666666666, ans=0.125 2023-12-04 05:08:55,546 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.42 vs. limit=22.5 2023-12-04 05:09:06,362 INFO [train.py:1087] (2/4) Epoch 23, batch 700, loss[loss=0.1669, simple_loss=0.2587, pruned_loss=0.03755, over 24767.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2675, pruned_loss=0.04523, over 4631321.90 frames. ], batch size: 66, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:09:09,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=135933.33333333334, ans=0.0 2023-12-04 05:09:22,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=136000.0, ans=0.0 2023-12-04 05:09:23,398 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=136000.0, ans=0.09899494936611666 2023-12-04 05:09:41,423 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=136133.33333333334, ans=0.125 2023-12-04 05:09:45,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=136133.33333333334, ans=0.0 2023-12-04 05:09:48,658 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.387e+02 1.558e+02 1.692e+02 2.997e+02, threshold=3.117e+02, percent-clipped=1.0 2023-12-04 05:09:53,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=136200.0, ans=0.125 2023-12-04 05:10:00,259 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-12-04 05:10:02,087 INFO [train.py:1087] (2/4) Epoch 23, batch 750, loss[loss=0.1708, simple_loss=0.2598, pruned_loss=0.04091, over 24742.00 frames. ], tot_loss[loss=0.178, simple_loss=0.2668, pruned_loss=0.04461, over 4695157.89 frames. ], batch size: 63, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:10:12,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=136333.33333333334, ans=0.125 2023-12-04 05:10:33,137 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=136400.0, ans=0.0 2023-12-04 05:10:51,093 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=15.0 2023-12-04 05:10:56,917 INFO [train.py:1087] (2/4) Epoch 23, batch 800, loss[loss=0.184, simple_loss=0.2709, pruned_loss=0.04855, over 24764.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.2659, pruned_loss=0.04394, over 4738221.01 frames. ], batch size: 66, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:11:18,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=136733.33333333334, ans=0.125 2023-12-04 05:11:18,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=136733.33333333334, ans=0.125 2023-12-04 05:11:37,274 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.190e+02 1.432e+02 1.525e+02 1.721e+02 2.600e+02, threshold=3.050e+02, percent-clipped=0.0 2023-12-04 05:11:38,550 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=136866.66666666666, ans=0.125 2023-12-04 05:11:49,175 INFO [train.py:1087] (2/4) Epoch 23, batch 850, loss[loss=0.2223, simple_loss=0.2953, pruned_loss=0.07467, over 17170.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2663, pruned_loss=0.04415, over 4747560.93 frames. ], batch size: 178, lr: 1.04e-02, grad_scale: 16.0 2023-12-04 05:11:53,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=136933.33333333334, ans=0.125 2023-12-04 05:11:58,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=137000.0, ans=0.125 2023-12-04 05:11:59,574 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-12-04 05:12:03,304 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=137000.0, ans=0.125 2023-12-04 05:12:08,378 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=137066.66666666666, ans=0.125 2023-12-04 05:12:51,365 INFO [train.py:1087] (2/4) Epoch 24, batch 0, loss[loss=0.1743, simple_loss=0.2666, pruned_loss=0.04097, over 24703.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2666, pruned_loss=0.04097, over 24703.00 frames. ], batch size: 74, lr: 1.02e-02, grad_scale: 32.0 2023-12-04 05:12:51,366 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 05:13:03,463 INFO [train.py:1119] (2/4) Epoch 24, validation: loss=0.1585, simple_loss=0.2596, pruned_loss=0.02867, over 944034.00 frames. 2023-12-04 05:13:03,464 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 05:13:08,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=137233.33333333334, ans=0.025 2023-12-04 05:13:25,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137366.66666666666, ans=0.1 2023-12-04 05:13:37,168 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-12-04 05:13:40,102 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.20 vs. limit=15.0 2023-12-04 05:13:45,170 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=137433.33333333334, ans=0.0 2023-12-04 05:13:47,324 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=137500.0, ans=0.0 2023-12-04 05:13:50,533 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.490e+02 1.634e+02 1.835e+02 3.241e+02, threshold=3.268e+02, percent-clipped=2.0 2023-12-04 05:13:53,189 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137500.0, ans=0.1 2023-12-04 05:13:58,650 INFO [train.py:1087] (2/4) Epoch 24, batch 50, loss[loss=0.1726, simple_loss=0.2646, pruned_loss=0.04032, over 24572.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2671, pruned_loss=0.04463, over 1085205.09 frames. ], batch size: 65, lr: 1.02e-02, grad_scale: 32.0 2023-12-04 05:14:12,937 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-12-04 05:14:15,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=137633.33333333334, ans=0.0 2023-12-04 05:14:29,472 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=15.0 2023-12-04 05:14:36,204 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-12-04 05:14:37,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=137766.66666666666, ans=0.125 2023-12-04 05:14:53,181 INFO [train.py:1087] (2/4) Epoch 24, batch 100, loss[loss=0.1708, simple_loss=0.2599, pruned_loss=0.04088, over 24556.00 frames. ], tot_loss[loss=0.1792, simple_loss=0.2681, pruned_loss=0.0451, over 1903959.28 frames. ], batch size: 63, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:15:01,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=137900.0, ans=0.0 2023-12-04 05:15:02,797 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.34 vs. limit=15.0 2023-12-04 05:15:10,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2023-12-04 05:15:31,518 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-12-04 05:15:32,095 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138100.0, ans=0.1 2023-12-04 05:15:41,782 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.417e+02 1.577e+02 1.700e+02 2.458e+02, threshold=3.154e+02, percent-clipped=0.0 2023-12-04 05:15:48,294 INFO [train.py:1087] (2/4) Epoch 24, batch 150, loss[loss=0.155, simple_loss=0.2473, pruned_loss=0.03134, over 24571.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2676, pruned_loss=0.04477, over 2532599.09 frames. ], batch size: 64, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:15:50,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=138233.33333333334, ans=0.125 2023-12-04 05:16:10,530 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.23 vs. limit=15.0 2023-12-04 05:16:39,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=138500.0, ans=6.0 2023-12-04 05:16:41,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138500.0, ans=0.125 2023-12-04 05:16:43,688 INFO [train.py:1087] (2/4) Epoch 24, batch 200, loss[loss=0.1795, simple_loss=0.274, pruned_loss=0.04252, over 24720.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2666, pruned_loss=0.04421, over 3037903.78 frames. ], batch size: 69, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:16:43,920 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=138566.66666666666, ans=0.2 2023-12-04 05:16:54,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=138633.33333333334, ans=0.125 2023-12-04 05:16:54,911 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.09 vs. limit=22.5 2023-12-04 05:16:56,171 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=22.5 2023-12-04 05:17:05,539 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.34 vs. limit=12.0 2023-12-04 05:17:33,089 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.176e+02 1.351e+02 1.474e+02 1.677e+02 2.689e+02, threshold=2.947e+02, percent-clipped=0.0 2023-12-04 05:17:34,571 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-12-04 05:17:39,410 INFO [train.py:1087] (2/4) Epoch 24, batch 250, loss[loss=0.1635, simple_loss=0.2539, pruned_loss=0.03652, over 24804.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2666, pruned_loss=0.04441, over 3412400.06 frames. ], batch size: 62, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:17:48,074 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:18:00,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=139033.33333333334, ans=0.2 2023-12-04 05:18:09,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=139033.33333333334, ans=0.125 2023-12-04 05:18:24,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=139166.66666666666, ans=0.2 2023-12-04 05:18:35,184 INFO [train.py:1087] (2/4) Epoch 24, batch 300, loss[loss=0.168, simple_loss=0.2603, pruned_loss=0.03785, over 24758.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2662, pruned_loss=0.0441, over 3734398.87 frames. ], batch size: 65, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:19:11,136 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=15.0 2023-12-04 05:19:15,944 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=139433.33333333334, ans=0.2 2023-12-04 05:19:23,147 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.392e+02 1.510e+02 1.643e+02 2.800e+02, threshold=3.020e+02, percent-clipped=0.0 2023-12-04 05:19:23,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=139500.0, ans=0.0 2023-12-04 05:19:25,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=139500.0, ans=0.125 2023-12-04 05:19:30,021 INFO [train.py:1087] (2/4) Epoch 24, batch 350, loss[loss=0.1616, simple_loss=0.2541, pruned_loss=0.03454, over 24801.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2658, pruned_loss=0.04378, over 3977710.06 frames. ], batch size: 72, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:19:35,304 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=139566.66666666666, ans=0.2 2023-12-04 05:19:46,854 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=139633.33333333334, ans=0.04949747468305833 2023-12-04 05:20:03,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=139766.66666666666, ans=0.125 2023-12-04 05:20:21,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=139833.33333333334, ans=0.125 2023-12-04 05:20:24,712 INFO [train.py:1087] (2/4) Epoch 24, batch 400, loss[loss=0.1856, simple_loss=0.2757, pruned_loss=0.04774, over 23735.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2653, pruned_loss=0.04356, over 4181603.42 frames. ], batch size: 95, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:20:26,355 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.80 vs. limit=15.0 2023-12-04 05:20:37,713 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=139966.66666666666, ans=0.0 2023-12-04 05:20:37,897 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=22.5 2023-12-04 05:20:38,183 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-12-04 05:20:43,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139966.66666666666, ans=0.0 2023-12-04 05:20:49,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=140033.33333333334, ans=0.0 2023-12-04 05:21:13,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=140166.66666666666, ans=0.2 2023-12-04 05:21:14,177 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.193e+02 1.386e+02 1.500e+02 1.610e+02 2.229e+02, threshold=2.999e+02, percent-clipped=0.0 2023-12-04 05:21:20,588 INFO [train.py:1087] (2/4) Epoch 24, batch 450, loss[loss=0.1643, simple_loss=0.2554, pruned_loss=0.0366, over 24582.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.265, pruned_loss=0.04331, over 4319656.91 frames. ], batch size: 64, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:21:39,109 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.66 vs. limit=22.5 2023-12-04 05:21:43,885 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-12-04 05:22:04,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.01 vs. limit=12.0 2023-12-04 05:22:16,263 INFO [train.py:1087] (2/4) Epoch 24, batch 500, loss[loss=0.1774, simple_loss=0.2703, pruned_loss=0.0422, over 24772.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2655, pruned_loss=0.04358, over 4414025.43 frames. ], batch size: 73, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:22:17,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=140566.66666666666, ans=0.2 2023-12-04 05:22:39,383 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=140700.0, ans=0.125 2023-12-04 05:22:44,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=140700.0, ans=0.0 2023-12-04 05:23:03,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140833.33333333334, ans=0.125 2023-12-04 05:23:04,031 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.391e+02 1.542e+02 1.716e+02 2.195e+02, threshold=3.083e+02, percent-clipped=0.0 2023-12-04 05:23:10,178 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=140900.0, ans=0.125 2023-12-04 05:23:11,416 INFO [train.py:1087] (2/4) Epoch 24, batch 550, loss[loss=0.1824, simple_loss=0.2685, pruned_loss=0.04819, over 24486.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2659, pruned_loss=0.04388, over 4499338.64 frames. ], batch size: 77, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:23:12,682 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:23:15,652 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.06 vs. limit=22.5 2023-12-04 05:23:16,250 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=140900.0, ans=0.125 2023-12-04 05:23:21,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=140966.66666666666, ans=0.0 2023-12-04 05:23:26,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140966.66666666666, ans=0.0 2023-12-04 05:23:53,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=141100.0, ans=0.125 2023-12-04 05:24:06,077 INFO [train.py:1087] (2/4) Epoch 24, batch 600, loss[loss=0.1982, simple_loss=0.2844, pruned_loss=0.05604, over 22895.00 frames. ], tot_loss[loss=0.1758, simple_loss=0.2651, pruned_loss=0.04326, over 4581743.20 frames. ], batch size: 106, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:24:08,576 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=141233.33333333334, ans=0.0 2023-12-04 05:24:16,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=141300.0, ans=0.0 2023-12-04 05:24:24,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=141300.0, ans=0.0 2023-12-04 05:24:41,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=141433.33333333334, ans=0.125 2023-12-04 05:24:46,802 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:24:52,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=141500.0, ans=0.09899494936611666 2023-12-04 05:24:55,383 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.415e+02 1.530e+02 1.698e+02 2.648e+02, threshold=3.060e+02, percent-clipped=0.0 2023-12-04 05:24:57,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=141500.0, ans=0.125 2023-12-04 05:25:01,788 INFO [train.py:1087] (2/4) Epoch 24, batch 650, loss[loss=0.2198, simple_loss=0.2949, pruned_loss=0.07232, over 16624.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2652, pruned_loss=0.04346, over 4618539.21 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:25:20,216 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=141633.33333333334, ans=0.0 2023-12-04 05:25:52,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=141833.33333333334, ans=0.125 2023-12-04 05:25:57,574 INFO [train.py:1087] (2/4) Epoch 24, batch 700, loss[loss=0.1789, simple_loss=0.2648, pruned_loss=0.04651, over 24573.00 frames. ], tot_loss[loss=0.1757, simple_loss=0.265, pruned_loss=0.04325, over 4668418.30 frames. ], batch size: 64, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:26:43,011 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=142166.66666666666, ans=0.125 2023-12-04 05:26:47,012 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.394e+02 1.531e+02 1.673e+02 2.876e+02, threshold=3.062e+02, percent-clipped=0.0 2023-12-04 05:26:52,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=142233.33333333334, ans=0.1 2023-12-04 05:26:53,149 INFO [train.py:1087] (2/4) Epoch 24, batch 750, loss[loss=0.1675, simple_loss=0.2577, pruned_loss=0.03864, over 24601.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2652, pruned_loss=0.04342, over 4699762.37 frames. ], batch size: 68, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:26:53,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=142233.33333333334, ans=0.2 2023-12-04 05:27:20,434 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=142366.66666666666, ans=0.125 2023-12-04 05:27:28,608 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=142433.33333333334, ans=0.2 2023-12-04 05:27:32,919 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=142433.33333333334, ans=0.2 2023-12-04 05:27:36,450 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:27:41,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=142500.0, ans=0.0 2023-12-04 05:27:47,856 INFO [train.py:1087] (2/4) Epoch 24, batch 800, loss[loss=0.1734, simple_loss=0.2588, pruned_loss=0.04399, over 24706.00 frames. ], tot_loss[loss=0.1762, simple_loss=0.2654, pruned_loss=0.04353, over 4718193.81 frames. ], batch size: 69, lr: 1.00e-02, grad_scale: 32.0 2023-12-04 05:27:48,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=142566.66666666666, ans=0.05 2023-12-04 05:27:54,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=142566.66666666666, ans=0.0 2023-12-04 05:28:11,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=142700.0, ans=0.125 2023-12-04 05:28:18,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=142766.66666666666, ans=0.0 2023-12-04 05:28:18,804 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.12 vs. limit=15.0 2023-12-04 05:28:20,434 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=142766.66666666666, ans=0.125 2023-12-04 05:28:26,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=142766.66666666666, ans=0.125 2023-12-04 05:28:28,517 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=142833.33333333334, ans=0.5 2023-12-04 05:28:34,637 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.402e+02 1.503e+02 1.645e+02 2.328e+02, threshold=3.007e+02, percent-clipped=0.0 2023-12-04 05:28:39,615 INFO [train.py:1087] (2/4) Epoch 24, batch 850, loss[loss=0.1965, simple_loss=0.2866, pruned_loss=0.05321, over 21659.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2654, pruned_loss=0.04359, over 4728902.50 frames. ], batch size: 128, lr: 1.00e-02, grad_scale: 32.0 2023-12-04 05:28:42,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=142900.0, ans=0.05 2023-12-04 05:28:44,921 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-12-04 05:28:45,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142900.0, ans=0.125 2023-12-04 05:28:50,634 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142966.66666666666, ans=0.125 2023-12-04 05:28:56,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=142966.66666666666, ans=0.125 2023-12-04 05:29:01,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=143033.33333333334, ans=0.125 2023-12-04 05:29:05,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=143033.33333333334, ans=0.125 2023-12-04 05:29:06,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143033.33333333334, ans=0.1 2023-12-04 05:29:20,483 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=143166.66666666666, ans=0.1 2023-12-04 05:29:21,045 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-12-04 05:29:39,883 INFO [train.py:1087] (2/4) Epoch 25, batch 0, loss[loss=0.1824, simple_loss=0.2771, pruned_loss=0.04387, over 24555.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2771, pruned_loss=0.04387, over 24555.00 frames. ], batch size: 62, lr: 9.81e-03, grad_scale: 32.0 2023-12-04 05:29:39,884 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 05:29:52,070 INFO [train.py:1119] (2/4) Epoch 25, validation: loss=0.1569, simple_loss=0.258, pruned_loss=0.02794, over 944034.00 frames. 2023-12-04 05:29:52,071 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 05:30:47,249 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.087e+02 1.371e+02 1.534e+02 1.745e+02 2.652e+02, threshold=3.068e+02, percent-clipped=0.0 2023-12-04 05:30:47,277 INFO [train.py:1087] (2/4) Epoch 25, batch 50, loss[loss=0.183, simple_loss=0.2727, pruned_loss=0.04667, over 24203.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2659, pruned_loss=0.04253, over 1090292.18 frames. ], batch size: 82, lr: 9.80e-03, grad_scale: 32.0 2023-12-04 05:30:52,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=143533.33333333334, ans=0.125 2023-12-04 05:31:14,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=143666.66666666666, ans=0.0 2023-12-04 05:31:22,106 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.47 vs. limit=10.0 2023-12-04 05:31:35,718 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.56 vs. limit=10.0 2023-12-04 05:31:41,484 INFO [train.py:1087] (2/4) Epoch 25, batch 100, loss[loss=0.1783, simple_loss=0.2681, pruned_loss=0.04422, over 24568.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2654, pruned_loss=0.04297, over 1918658.99 frames. ], batch size: 62, lr: 9.79e-03, grad_scale: 32.0 2023-12-04 05:31:49,352 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=143866.66666666666, ans=0.125 2023-12-04 05:32:00,470 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-12-04 05:32:10,086 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=144000.0, ans=0.125 2023-12-04 05:32:14,980 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-12-04 05:32:15,090 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.92 vs. limit=5.0 2023-12-04 05:32:37,713 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.373e+02 1.509e+02 1.642e+02 2.631e+02, threshold=3.018e+02, percent-clipped=0.0 2023-12-04 05:32:37,739 INFO [train.py:1087] (2/4) Epoch 25, batch 150, loss[loss=0.1777, simple_loss=0.2671, pruned_loss=0.04417, over 24553.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2655, pruned_loss=0.04269, over 2568264.72 frames. ], batch size: 62, lr: 9.78e-03, grad_scale: 32.0 2023-12-04 05:33:03,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=144333.33333333334, ans=0.125 2023-12-04 05:33:06,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=144333.33333333334, ans=0.0 2023-12-04 05:33:22,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=144466.66666666666, ans=0.0 2023-12-04 05:33:31,525 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=12.0 2023-12-04 05:33:33,285 INFO [train.py:1087] (2/4) Epoch 25, batch 200, loss[loss=0.1702, simple_loss=0.2624, pruned_loss=0.03903, over 23543.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2651, pruned_loss=0.04258, over 3075534.85 frames. ], batch size: 94, lr: 9.77e-03, grad_scale: 16.0 2023-12-04 05:34:00,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144666.66666666666, ans=0.125 2023-12-04 05:34:02,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=144666.66666666666, ans=0.125 2023-12-04 05:34:12,880 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=144733.33333333334, ans=0.125 2023-12-04 05:34:15,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=144733.33333333334, ans=0.0 2023-12-04 05:34:21,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=144800.0, ans=0.0 2023-12-04 05:34:23,435 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2023-12-04 05:34:28,530 INFO [train.py:1087] (2/4) Epoch 25, batch 250, loss[loss=0.1762, simple_loss=0.2702, pruned_loss=0.04109, over 24712.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2653, pruned_loss=0.04294, over 3461201.35 frames. ], batch size: 74, lr: 9.76e-03, grad_scale: 16.0 2023-12-04 05:34:29,518 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.419e+02 1.558e+02 1.718e+02 2.856e+02, threshold=3.117e+02, percent-clipped=0.0 2023-12-04 05:34:30,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=144866.66666666666, ans=0.0 2023-12-04 05:34:50,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=145000.0, ans=0.125 2023-12-04 05:34:56,270 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=145000.0, ans=0.09899494936611666 2023-12-04 05:35:13,603 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=145133.33333333334, ans=0.125 2023-12-04 05:35:15,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=145133.33333333334, ans=0.125 2023-12-04 05:35:17,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145133.33333333334, ans=0.1 2023-12-04 05:35:23,217 INFO [train.py:1087] (2/4) Epoch 25, batch 300, loss[loss=0.1851, simple_loss=0.2683, pruned_loss=0.05093, over 24294.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2653, pruned_loss=0.04324, over 3748301.56 frames. ], batch size: 79, lr: 9.75e-03, grad_scale: 16.0 2023-12-04 05:35:28,256 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-12-04 05:36:01,378 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=145400.0, ans=0.05 2023-12-04 05:36:06,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=145466.66666666666, ans=0.1 2023-12-04 05:36:15,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=145466.66666666666, ans=0.0 2023-12-04 05:36:18,582 INFO [train.py:1087] (2/4) Epoch 25, batch 350, loss[loss=0.1738, simple_loss=0.265, pruned_loss=0.04127, over 24708.00 frames. ], tot_loss[loss=0.176, simple_loss=0.2653, pruned_loss=0.04333, over 3990896.41 frames. ], batch size: 69, lr: 9.74e-03, grad_scale: 16.0 2023-12-04 05:36:19,597 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.169e+02 1.342e+02 1.479e+02 1.657e+02 2.661e+02, threshold=2.958e+02, percent-clipped=0.0 2023-12-04 05:36:19,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=145533.33333333334, ans=0.125 2023-12-04 05:36:27,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=145533.33333333334, ans=0.2 2023-12-04 05:36:29,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=145600.0, ans=0.125 2023-12-04 05:36:43,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=145666.66666666666, ans=0.125 2023-12-04 05:37:14,130 INFO [train.py:1087] (2/4) Epoch 25, batch 400, loss[loss=0.1706, simple_loss=0.2588, pruned_loss=0.04122, over 24784.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2647, pruned_loss=0.04323, over 4162171.82 frames. ], batch size: 62, lr: 9.73e-03, grad_scale: 32.0 2023-12-04 05:37:15,449 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=145866.66666666666, ans=0.125 2023-12-04 05:37:25,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=145933.33333333334, ans=0.125 2023-12-04 05:37:25,434 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=145933.33333333334, ans=0.125 2023-12-04 05:37:46,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=146066.66666666666, ans=0.0 2023-12-04 05:37:49,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=146066.66666666666, ans=0.0 2023-12-04 05:37:53,465 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=146066.66666666666, ans=0.125 2023-12-04 05:37:54,095 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=22.5 2023-12-04 05:37:56,228 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=146066.66666666666, ans=0.2 2023-12-04 05:37:57,878 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-12-04 05:37:58,455 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=146133.33333333334, ans=0.0 2023-12-04 05:38:03,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=146133.33333333334, ans=0.2 2023-12-04 05:38:09,559 INFO [train.py:1087] (2/4) Epoch 25, batch 450, loss[loss=0.1752, simple_loss=0.2586, pruned_loss=0.04589, over 24717.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2652, pruned_loss=0.04347, over 4293724.04 frames. ], batch size: 67, lr: 9.72e-03, grad_scale: 32.0 2023-12-04 05:38:10,570 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.420e+02 1.535e+02 1.721e+02 2.483e+02, threshold=3.070e+02, percent-clipped=0.0 2023-12-04 05:38:18,332 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=146200.0, ans=10.0 2023-12-04 05:38:26,100 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=146266.66666666666, ans=0.0 2023-12-04 05:38:55,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=146466.66666666666, ans=0.0 2023-12-04 05:39:05,045 INFO [train.py:1087] (2/4) Epoch 25, batch 500, loss[loss=0.1712, simple_loss=0.2624, pruned_loss=0.03996, over 24777.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2647, pruned_loss=0.04299, over 4419513.31 frames. ], batch size: 64, lr: 9.71e-03, grad_scale: 32.0 2023-12-04 05:39:34,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146666.66666666666, ans=0.1 2023-12-04 05:39:45,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=146733.33333333334, ans=0.95 2023-12-04 05:39:59,311 INFO [train.py:1087] (2/4) Epoch 25, batch 550, loss[loss=0.1719, simple_loss=0.2596, pruned_loss=0.04204, over 24567.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2647, pruned_loss=0.0429, over 4509775.27 frames. ], batch size: 65, lr: 9.70e-03, grad_scale: 16.0 2023-12-04 05:40:01,391 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.385e+02 1.522e+02 1.751e+02 2.434e+02, threshold=3.044e+02, percent-clipped=0.0 2023-12-04 05:40:12,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=146933.33333333334, ans=0.125 2023-12-04 05:40:22,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147000.0, ans=0.1 2023-12-04 05:40:42,606 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=147133.33333333334, ans=0.125 2023-12-04 05:40:48,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147133.33333333334, ans=0.1 2023-12-04 05:40:53,784 INFO [train.py:1087] (2/4) Epoch 25, batch 600, loss[loss=0.1677, simple_loss=0.259, pruned_loss=0.03816, over 24801.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2649, pruned_loss=0.043, over 4572362.22 frames. ], batch size: 72, lr: 9.69e-03, grad_scale: 16.0 2023-12-04 05:40:59,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=147200.0, ans=0.0 2023-12-04 05:41:49,103 INFO [train.py:1087] (2/4) Epoch 25, batch 650, loss[loss=0.1789, simple_loss=0.2677, pruned_loss=0.045, over 22834.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2646, pruned_loss=0.04293, over 4630312.99 frames. ], batch size: 106, lr: 9.68e-03, grad_scale: 16.0 2023-12-04 05:41:51,298 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.379e+02 1.470e+02 1.642e+02 3.747e+02, threshold=2.939e+02, percent-clipped=1.0 2023-12-04 05:41:51,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=147533.33333333334, ans=0.2 2023-12-04 05:42:11,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147666.66666666666, ans=0.125 2023-12-04 05:42:13,288 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2023-12-04 05:42:13,308 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-12-04 05:42:22,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=147733.33333333334, ans=0.0 2023-12-04 05:42:24,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=147733.33333333334, ans=0.0 2023-12-04 05:42:30,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=147733.33333333334, ans=0.2 2023-12-04 05:42:36,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=147800.0, ans=0.125 2023-12-04 05:42:36,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=147800.0, ans=0.2 2023-12-04 05:42:44,037 INFO [train.py:1087] (2/4) Epoch 25, batch 700, loss[loss=0.1773, simple_loss=0.2678, pruned_loss=0.04335, over 24575.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2648, pruned_loss=0.043, over 4655078.92 frames. ], batch size: 65, lr: 9.67e-03, grad_scale: 16.0 2023-12-04 05:43:17,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=148066.66666666666, ans=0.125 2023-12-04 05:43:39,040 INFO [train.py:1087] (2/4) Epoch 25, batch 750, loss[loss=0.1722, simple_loss=0.2611, pruned_loss=0.04162, over 24767.00 frames. ], tot_loss[loss=0.1752, simple_loss=0.2645, pruned_loss=0.04296, over 4687308.95 frames. ], batch size: 64, lr: 9.67e-03, grad_scale: 16.0 2023-12-04 05:43:41,170 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.201e+02 1.392e+02 1.520e+02 1.675e+02 2.210e+02, threshold=3.040e+02, percent-clipped=0.0 2023-12-04 05:43:41,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=148200.0, ans=0.0 2023-12-04 05:43:48,334 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=148200.0, ans=0.125 2023-12-04 05:43:52,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=148266.66666666666, ans=0.0 2023-12-04 05:44:04,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=148333.33333333334, ans=0.125 2023-12-04 05:44:10,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=148400.0, ans=0.125 2023-12-04 05:44:14,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=148400.0, ans=0.0 2023-12-04 05:44:33,503 INFO [train.py:1087] (2/4) Epoch 25, batch 800, loss[loss=0.1848, simple_loss=0.2741, pruned_loss=0.0477, over 24721.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2643, pruned_loss=0.04277, over 4714953.06 frames. ], batch size: 67, lr: 9.66e-03, grad_scale: 32.0 2023-12-04 05:44:54,156 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:45:23,322 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:45:24,701 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.82 vs. limit=22.5 2023-12-04 05:45:25,059 INFO [train.py:1087] (2/4) Epoch 25, batch 850, loss[loss=0.1798, simple_loss=0.267, pruned_loss=0.04628, over 24523.00 frames. ], tot_loss[loss=0.1751, simple_loss=0.2644, pruned_loss=0.04286, over 4744367.12 frames. ], batch size: 77, lr: 9.65e-03, grad_scale: 32.0 2023-12-04 05:45:26,993 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.224e+02 1.434e+02 1.522e+02 1.667e+02 2.579e+02, threshold=3.044e+02, percent-clipped=0.0 2023-12-04 05:45:27,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=148866.66666666666, ans=0.0 2023-12-04 05:45:27,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=148866.66666666666, ans=0.0 2023-12-04 05:45:52,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=149000.0, ans=0.0 2023-12-04 05:46:07,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=149133.33333333334, ans=0.125 2023-12-04 05:46:25,820 INFO [train.py:1087] (2/4) Epoch 26, batch 0, loss[loss=0.1761, simple_loss=0.2687, pruned_loss=0.0418, over 22118.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2687, pruned_loss=0.0418, over 22118.00 frames. ], batch size: 53, lr: 9.45e-03, grad_scale: 32.0 2023-12-04 05:46:25,820 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 05:46:37,860 INFO [train.py:1119] (2/4) Epoch 26, validation: loss=0.1564, simple_loss=0.2574, pruned_loss=0.02768, over 944034.00 frames. 2023-12-04 05:46:37,861 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 05:46:46,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=149166.66666666666, ans=0.0 2023-12-04 05:46:49,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=149233.33333333334, ans=0.0 2023-12-04 05:46:50,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=149233.33333333334, ans=0.2 2023-12-04 05:46:52,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=149233.33333333334, ans=0.125 2023-12-04 05:47:00,129 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=149300.0, ans=0.125 2023-12-04 05:47:31,516 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.95 vs. limit=22.5 2023-12-04 05:47:34,148 INFO [train.py:1087] (2/4) Epoch 26, batch 50, loss[loss=0.1673, simple_loss=0.253, pruned_loss=0.04077, over 24137.00 frames. ], tot_loss[loss=0.1761, simple_loss=0.2658, pruned_loss=0.04326, over 1085863.86 frames. ], batch size: 58, lr: 9.44e-03, grad_scale: 32.0 2023-12-04 05:47:41,535 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.353e+02 1.478e+02 1.657e+02 3.231e+02, threshold=2.956e+02, percent-clipped=1.0 2023-12-04 05:47:41,867 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=149500.0, ans=0.0 2023-12-04 05:47:43,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149566.66666666666, ans=0.125 2023-12-04 05:48:03,049 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.56 vs. limit=10.0 2023-12-04 05:48:14,599 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-12-04 05:48:28,395 INFO [train.py:1087] (2/4) Epoch 26, batch 100, loss[loss=0.211, simple_loss=0.2885, pruned_loss=0.06671, over 16708.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2645, pruned_loss=0.04258, over 1913598.55 frames. ], batch size: 179, lr: 9.43e-03, grad_scale: 32.0 2023-12-04 05:48:50,019 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=149966.66666666666, ans=0.95 2023-12-04 05:48:52,137 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=149966.66666666666, ans=0.125 2023-12-04 05:49:10,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=150033.33333333334, ans=0.2 2023-12-04 05:49:23,248 INFO [train.py:1087] (2/4) Epoch 26, batch 150, loss[loss=0.1603, simple_loss=0.2519, pruned_loss=0.03441, over 24793.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2638, pruned_loss=0.04246, over 2549047.03 frames. ], batch size: 72, lr: 9.42e-03, grad_scale: 32.0 2023-12-04 05:49:25,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-12-04 05:49:31,506 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.354e+02 1.451e+02 1.669e+02 2.320e+02, threshold=2.903e+02, percent-clipped=0.0 2023-12-04 05:49:51,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=150300.0, ans=0.125 2023-12-04 05:49:57,226 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-12-04 05:49:57,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=150366.66666666666, ans=0.125 2023-12-04 05:50:11,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=150433.33333333334, ans=0.0 2023-12-04 05:50:15,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=150433.33333333334, ans=0.125 2023-12-04 05:50:18,525 INFO [train.py:1087] (2/4) Epoch 26, batch 200, loss[loss=0.1909, simple_loss=0.2793, pruned_loss=0.05128, over 23345.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2636, pruned_loss=0.04238, over 3051615.08 frames. ], batch size: 56, lr: 9.41e-03, grad_scale: 32.0 2023-12-04 05:51:14,163 INFO [train.py:1087] (2/4) Epoch 26, batch 250, loss[loss=0.1707, simple_loss=0.2598, pruned_loss=0.04082, over 24115.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2638, pruned_loss=0.04248, over 3446048.76 frames. ], batch size: 87, lr: 9.40e-03, grad_scale: 32.0 2023-12-04 05:51:15,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=150833.33333333334, ans=0.0 2023-12-04 05:51:19,955 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=150833.33333333334, ans=0.125 2023-12-04 05:51:21,781 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.341e+02 1.458e+02 1.607e+02 2.850e+02, threshold=2.916e+02, percent-clipped=0.0 2023-12-04 05:51:42,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=150966.66666666666, ans=0.125 2023-12-04 05:51:50,391 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=151033.33333333334, ans=0.125 2023-12-04 05:51:50,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=151033.33333333334, ans=0.0 2023-12-04 05:52:09,148 INFO [train.py:1087] (2/4) Epoch 26, batch 300, loss[loss=0.1699, simple_loss=0.2574, pruned_loss=0.04116, over 24571.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2637, pruned_loss=0.04224, over 3757711.46 frames. ], batch size: 65, lr: 9.39e-03, grad_scale: 32.0 2023-12-04 05:52:17,712 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=151166.66666666666, ans=0.125 2023-12-04 05:52:21,230 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=15.0 2023-12-04 05:52:51,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=151433.33333333334, ans=0.125 2023-12-04 05:52:51,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=151433.33333333334, ans=0.125 2023-12-04 05:53:03,538 INFO [train.py:1087] (2/4) Epoch 26, batch 350, loss[loss=0.1685, simple_loss=0.2566, pruned_loss=0.04021, over 24756.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2638, pruned_loss=0.04229, over 3999557.00 frames. ], batch size: 66, lr: 9.38e-03, grad_scale: 32.0 2023-12-04 05:53:11,695 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.373e+02 1.473e+02 1.593e+02 2.083e+02, threshold=2.946e+02, percent-clipped=0.0 2023-12-04 05:53:30,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=151633.33333333334, ans=0.5 2023-12-04 05:53:33,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=151633.33333333334, ans=0.2 2023-12-04 05:53:53,821 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-12-04 05:53:58,348 INFO [train.py:1087] (2/4) Epoch 26, batch 400, loss[loss=0.1668, simple_loss=0.2542, pruned_loss=0.03963, over 24557.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2637, pruned_loss=0.04234, over 4165782.54 frames. ], batch size: 66, lr: 9.37e-03, grad_scale: 32.0 2023-12-04 05:54:05,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=151833.33333333334, ans=0.125 2023-12-04 05:54:33,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=152033.33333333334, ans=0.0 2023-12-04 05:54:37,783 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.25 vs. limit=15.0 2023-12-04 05:54:43,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=152100.0, ans=0.125 2023-12-04 05:54:53,830 INFO [train.py:1087] (2/4) Epoch 26, batch 450, loss[loss=0.1735, simple_loss=0.2633, pruned_loss=0.04184, over 24546.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2637, pruned_loss=0.04229, over 4309291.37 frames. ], batch size: 63, lr: 9.36e-03, grad_scale: 32.0 2023-12-04 05:54:56,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=152166.66666666666, ans=0.125 2023-12-04 05:55:01,288 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.219e+02 1.411e+02 1.506e+02 1.676e+02 2.196e+02, threshold=3.012e+02, percent-clipped=0.0 2023-12-04 05:55:10,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=152233.33333333334, ans=0.125 2023-12-04 05:55:38,888 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=15.0 2023-12-04 05:55:42,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=152433.33333333334, ans=0.125 2023-12-04 05:55:49,575 INFO [train.py:1087] (2/4) Epoch 26, batch 500, loss[loss=0.1759, simple_loss=0.2636, pruned_loss=0.04407, over 24484.00 frames. ], tot_loss[loss=0.1739, simple_loss=0.2634, pruned_loss=0.04215, over 4426142.25 frames. ], batch size: 75, lr: 9.35e-03, grad_scale: 32.0 2023-12-04 05:56:06,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152566.66666666666, ans=0.0 2023-12-04 05:56:12,633 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=152633.33333333334, ans=0.125 2023-12-04 05:56:15,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152633.33333333334, ans=0.125 2023-12-04 05:56:45,233 INFO [train.py:1087] (2/4) Epoch 26, batch 550, loss[loss=0.1812, simple_loss=0.267, pruned_loss=0.04765, over 24578.00 frames. ], tot_loss[loss=0.1738, simple_loss=0.2635, pruned_loss=0.04206, over 4515443.36 frames. ], batch size: 65, lr: 9.34e-03, grad_scale: 32.0 2023-12-04 05:56:52,942 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.362e+02 1.478e+02 1.633e+02 2.091e+02, threshold=2.955e+02, percent-clipped=0.0 2023-12-04 05:57:09,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152966.66666666666, ans=0.1 2023-12-04 05:57:13,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=152966.66666666666, ans=0.0 2023-12-04 05:57:19,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=153033.33333333334, ans=0.2 2023-12-04 05:57:21,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-12-04 05:57:39,769 INFO [train.py:1087] (2/4) Epoch 26, batch 600, loss[loss=0.1763, simple_loss=0.2657, pruned_loss=0.0434, over 24511.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2639, pruned_loss=0.0421, over 4574003.15 frames. ], batch size: 75, lr: 9.33e-03, grad_scale: 32.0 2023-12-04 05:57:42,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=153166.66666666666, ans=0.0 2023-12-04 05:58:10,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=153300.0, ans=0.05 2023-12-04 05:58:16,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153366.66666666666, ans=0.1 2023-12-04 05:58:22,776 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:58:35,281 INFO [train.py:1087] (2/4) Epoch 26, batch 650, loss[loss=0.1897, simple_loss=0.2765, pruned_loss=0.05144, over 23471.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2639, pruned_loss=0.04222, over 4620947.39 frames. ], batch size: 94, lr: 9.32e-03, grad_scale: 32.0 2023-12-04 05:58:42,601 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.332e+02 1.436e+02 1.583e+02 2.553e+02, threshold=2.873e+02, percent-clipped=0.0 2023-12-04 05:58:56,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=153633.33333333334, ans=0.0 2023-12-04 05:59:07,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=153700.0, ans=0.0 2023-12-04 05:59:30,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=153833.33333333334, ans=0.0 2023-12-04 05:59:30,808 INFO [train.py:1087] (2/4) Epoch 26, batch 700, loss[loss=0.1688, simple_loss=0.263, pruned_loss=0.03727, over 24858.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2638, pruned_loss=0.04223, over 4663728.72 frames. ], batch size: 68, lr: 9.32e-03, grad_scale: 32.0 2023-12-04 05:59:37,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=153833.33333333334, ans=0.125 2023-12-04 05:59:55,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153966.66666666666, ans=0.1 2023-12-04 05:59:57,578 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:00:19,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=154100.0, ans=0.0 2023-12-04 06:00:19,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=154100.0, ans=0.07 2023-12-04 06:00:25,922 INFO [train.py:1087] (2/4) Epoch 26, batch 750, loss[loss=0.1739, simple_loss=0.2674, pruned_loss=0.04025, over 24796.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2633, pruned_loss=0.04191, over 4710528.43 frames. ], batch size: 72, lr: 9.31e-03, grad_scale: 32.0 2023-12-04 06:00:29,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=154166.66666666666, ans=0.0 2023-12-04 06:00:32,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=154166.66666666666, ans=0.125 2023-12-04 06:00:33,760 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.063e+02 1.380e+02 1.538e+02 1.731e+02 2.192e+02, threshold=3.076e+02, percent-clipped=0.0 2023-12-04 06:00:41,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=154233.33333333334, ans=0.0 2023-12-04 06:00:42,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=154233.33333333334, ans=0.125 2023-12-04 06:00:55,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154300.0, ans=0.125 2023-12-04 06:00:59,391 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=154366.66666666666, ans=0.0 2023-12-04 06:01:07,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154366.66666666666, ans=0.125 2023-12-04 06:01:20,922 INFO [train.py:1087] (2/4) Epoch 26, batch 800, loss[loss=0.1678, simple_loss=0.2574, pruned_loss=0.03908, over 24757.00 frames. ], tot_loss[loss=0.1737, simple_loss=0.2635, pruned_loss=0.04197, over 4738821.40 frames. ], batch size: 66, lr: 9.30e-03, grad_scale: 32.0 2023-12-04 06:01:23,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=154500.0, ans=0.015 2023-12-04 06:01:32,540 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154566.66666666666, ans=0.1 2023-12-04 06:02:03,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=154766.66666666666, ans=0.0 2023-12-04 06:02:07,087 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=154766.66666666666, ans=0.125 2023-12-04 06:02:12,819 INFO [train.py:1087] (2/4) Epoch 26, batch 850, loss[loss=0.1624, simple_loss=0.2547, pruned_loss=0.03507, over 24800.00 frames. ], tot_loss[loss=0.1748, simple_loss=0.2642, pruned_loss=0.04274, over 4734545.44 frames. ], batch size: 72, lr: 9.29e-03, grad_scale: 32.0 2023-12-04 06:02:16,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=154833.33333333334, ans=0.2 2023-12-04 06:02:19,851 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.214e+02 1.388e+02 1.511e+02 1.708e+02 2.505e+02, threshold=3.022e+02, percent-clipped=0.0 2023-12-04 06:02:48,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=155033.33333333334, ans=0.125 2023-12-04 06:03:14,465 INFO [train.py:1087] (2/4) Epoch 27, batch 0, loss[loss=0.1742, simple_loss=0.2644, pruned_loss=0.04195, over 24352.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2644, pruned_loss=0.04195, over 24352.00 frames. ], batch size: 79, lr: 9.10e-03, grad_scale: 32.0 2023-12-04 06:03:14,465 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 06:03:26,615 INFO [train.py:1119] (2/4) Epoch 27, validation: loss=0.1567, simple_loss=0.2572, pruned_loss=0.02815, over 944034.00 frames. 2023-12-04 06:03:26,616 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 06:03:54,500 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.96 vs. limit=15.0 2023-12-04 06:03:57,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=155266.66666666666, ans=0.125 2023-12-04 06:03:57,513 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-12-04 06:04:01,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=155333.33333333334, ans=0.125 2023-12-04 06:04:05,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=155333.33333333334, ans=0.0 2023-12-04 06:04:09,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=155400.0, ans=0.125 2023-12-04 06:04:12,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=155400.0, ans=0.0 2023-12-04 06:04:21,333 INFO [train.py:1087] (2/4) Epoch 27, batch 50, loss[loss=0.1729, simple_loss=0.2599, pruned_loss=0.04298, over 24557.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.265, pruned_loss=0.04383, over 1080830.11 frames. ], batch size: 64, lr: 9.09e-03, grad_scale: 32.0 2023-12-04 06:04:25,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155466.66666666666, ans=0.1 2023-12-04 06:04:29,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=155466.66666666666, ans=0.125 2023-12-04 06:04:30,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155466.66666666666, ans=0.125 2023-12-04 06:04:31,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=155533.33333333334, ans=0.2 2023-12-04 06:04:34,948 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.419e+02 1.535e+02 1.765e+02 3.473e+02, threshold=3.070e+02, percent-clipped=1.0 2023-12-04 06:04:36,545 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.64 vs. limit=10.0 2023-12-04 06:04:40,836 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=155533.33333333334, ans=0.0 2023-12-04 06:04:40,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155533.33333333334, ans=0.1 2023-12-04 06:04:52,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=155600.0, ans=0.09899494936611666 2023-12-04 06:04:59,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=155666.66666666666, ans=0.0 2023-12-04 06:05:16,282 INFO [train.py:1087] (2/4) Epoch 27, batch 100, loss[loss=0.1686, simple_loss=0.2546, pruned_loss=0.04134, over 24747.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2637, pruned_loss=0.04227, over 1914232.68 frames. ], batch size: 65, lr: 9.09e-03, grad_scale: 32.0 2023-12-04 06:05:24,911 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=155800.0, ans=0.125 2023-12-04 06:05:33,663 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=155866.66666666666, ans=0.125 2023-12-04 06:05:49,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156000.0, ans=0.125 2023-12-04 06:05:52,714 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156000.0, ans=0.1 2023-12-04 06:05:59,188 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=12.0 2023-12-04 06:06:10,418 INFO [train.py:1087] (2/4) Epoch 27, batch 150, loss[loss=0.179, simple_loss=0.2692, pruned_loss=0.04445, over 24737.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2631, pruned_loss=0.04156, over 2561901.18 frames. ], batch size: 74, lr: 9.08e-03, grad_scale: 32.0 2023-12-04 06:06:10,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=156133.33333333334, ans=0.2 2023-12-04 06:06:12,777 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=156133.33333333334, ans=0.0 2023-12-04 06:06:25,101 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.219e+02 1.353e+02 1.457e+02 1.592e+02 2.359e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 06:06:34,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=156266.66666666666, ans=0.125 2023-12-04 06:06:40,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=156266.66666666666, ans=0.09899494936611666 2023-12-04 06:06:52,744 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=156333.33333333334, ans=0.07 2023-12-04 06:06:59,034 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:06:59,138 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=156400.0, ans=0.0 2023-12-04 06:07:05,160 INFO [train.py:1087] (2/4) Epoch 27, batch 200, loss[loss=0.1773, simple_loss=0.2704, pruned_loss=0.04214, over 24216.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2624, pruned_loss=0.0412, over 3063554.28 frames. ], batch size: 82, lr: 9.07e-03, grad_scale: 32.0 2023-12-04 06:07:12,033 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.13 vs. limit=22.5 2023-12-04 06:07:25,598 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=156533.33333333334, ans=0.0 2023-12-04 06:07:46,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=156666.66666666666, ans=0.2 2023-12-04 06:08:00,028 INFO [train.py:1087] (2/4) Epoch 27, batch 250, loss[loss=0.1746, simple_loss=0.2665, pruned_loss=0.0414, over 24730.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2624, pruned_loss=0.04135, over 3452113.55 frames. ], batch size: 61, lr: 9.06e-03, grad_scale: 32.0 2023-12-04 06:08:13,749 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.395e+02 1.466e+02 1.676e+02 3.271e+02, threshold=2.931e+02, percent-clipped=1.0 2023-12-04 06:08:17,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=156866.66666666666, ans=0.0 2023-12-04 06:08:20,327 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156933.33333333334, ans=0.1 2023-12-04 06:08:37,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=157000.0, ans=0.125 2023-12-04 06:08:39,456 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=157000.0, ans=0.2 2023-12-04 06:08:39,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=157000.0, ans=0.2 2023-12-04 06:08:46,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157066.66666666666, ans=0.125 2023-12-04 06:08:47,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157066.66666666666, ans=0.125 2023-12-04 06:08:49,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=157066.66666666666, ans=10.0 2023-12-04 06:08:54,207 INFO [train.py:1087] (2/4) Epoch 27, batch 300, loss[loss=0.1576, simple_loss=0.2491, pruned_loss=0.03307, over 24700.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2623, pruned_loss=0.04134, over 3753723.62 frames. ], batch size: 74, lr: 9.05e-03, grad_scale: 32.0 2023-12-04 06:09:21,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157266.66666666666, ans=0.1 2023-12-04 06:09:25,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157266.66666666666, ans=0.0 2023-12-04 06:09:33,219 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=157333.33333333334, ans=0.125 2023-12-04 06:09:37,051 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.80 vs. limit=22.5 2023-12-04 06:09:49,182 INFO [train.py:1087] (2/4) Epoch 27, batch 350, loss[loss=0.1724, simple_loss=0.2614, pruned_loss=0.04167, over 24871.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2627, pruned_loss=0.04188, over 3976968.32 frames. ], batch size: 68, lr: 9.04e-03, grad_scale: 32.0 2023-12-04 06:09:59,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157533.33333333334, ans=0.1 2023-12-04 06:10:03,549 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.341e+02 1.468e+02 1.612e+02 2.620e+02, threshold=2.936e+02, percent-clipped=0.0 2023-12-04 06:10:18,782 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=157600.0, ans=0.1 2023-12-04 06:10:18,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157600.0, ans=0.125 2023-12-04 06:10:19,000 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=157600.0, ans=0.2 2023-12-04 06:10:20,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=157600.0, ans=0.125 2023-12-04 06:10:21,279 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.52 vs. limit=15.0 2023-12-04 06:10:25,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=157666.66666666666, ans=0.2 2023-12-04 06:10:31,406 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=157666.66666666666, ans=0.125 2023-12-04 06:10:38,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=157733.33333333334, ans=0.125 2023-12-04 06:10:44,315 INFO [train.py:1087] (2/4) Epoch 27, batch 400, loss[loss=0.1692, simple_loss=0.2626, pruned_loss=0.03793, over 24726.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2627, pruned_loss=0.04186, over 4150218.59 frames. ], batch size: 67, lr: 9.03e-03, grad_scale: 32.0 2023-12-04 06:10:49,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=157800.0, ans=0.125 2023-12-04 06:10:55,337 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-12-04 06:11:21,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=158000.0, ans=0.0 2023-12-04 06:11:39,243 INFO [train.py:1087] (2/4) Epoch 27, batch 450, loss[loss=0.1873, simple_loss=0.2722, pruned_loss=0.05124, over 24710.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2626, pruned_loss=0.04183, over 4295479.20 frames. ], batch size: 69, lr: 9.02e-03, grad_scale: 32.0 2023-12-04 06:11:44,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=158133.33333333334, ans=0.2 2023-12-04 06:11:52,512 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158200.0, ans=0.1 2023-12-04 06:11:53,334 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.360e+02 1.480e+02 1.620e+02 2.583e+02, threshold=2.959e+02, percent-clipped=0.0 2023-12-04 06:11:57,761 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:12:04,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=158266.66666666666, ans=0.0 2023-12-04 06:12:10,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=158266.66666666666, ans=0.125 2023-12-04 06:12:27,411 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=158400.0, ans=0.125 2023-12-04 06:12:33,885 INFO [train.py:1087] (2/4) Epoch 27, batch 500, loss[loss=0.1561, simple_loss=0.247, pruned_loss=0.03257, over 24571.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2621, pruned_loss=0.04143, over 4419719.52 frames. ], batch size: 64, lr: 9.02e-03, grad_scale: 32.0 2023-12-04 06:12:35,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158466.66666666666, ans=0.1 2023-12-04 06:12:35,399 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=158466.66666666666, ans=22.5 2023-12-04 06:12:40,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158466.66666666666, ans=0.125 2023-12-04 06:12:48,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=158533.33333333334, ans=0.125 2023-12-04 06:13:03,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=158600.0, ans=0.015 2023-12-04 06:13:07,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158666.66666666666, ans=0.1 2023-12-04 06:13:07,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=158666.66666666666, ans=0.125 2023-12-04 06:13:11,520 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=158666.66666666666, ans=0.025 2023-12-04 06:13:12,833 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.18 vs. limit=6.0 2023-12-04 06:13:16,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=158733.33333333334, ans=0.2 2023-12-04 06:13:21,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=158733.33333333334, ans=0.125 2023-12-04 06:13:26,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=158733.33333333334, ans=0.125 2023-12-04 06:13:28,571 INFO [train.py:1087] (2/4) Epoch 27, batch 550, loss[loss=0.1648, simple_loss=0.2558, pruned_loss=0.03688, over 24694.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2618, pruned_loss=0.04124, over 4521512.93 frames. ], batch size: 74, lr: 9.01e-03, grad_scale: 32.0 2023-12-04 06:13:42,756 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.388e+02 1.537e+02 1.705e+02 2.928e+02, threshold=3.074e+02, percent-clipped=0.0 2023-12-04 06:13:44,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=158866.66666666666, ans=0.0 2023-12-04 06:13:50,883 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-12-04 06:13:52,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=158933.33333333334, ans=0.0 2023-12-04 06:13:59,658 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.48 vs. limit=15.0 2023-12-04 06:14:01,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=159000.0, ans=0.125 2023-12-04 06:14:22,647 INFO [train.py:1087] (2/4) Epoch 27, batch 600, loss[loss=0.1809, simple_loss=0.2687, pruned_loss=0.04653, over 22577.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2624, pruned_loss=0.04147, over 4568537.44 frames. ], batch size: 54, lr: 9.00e-03, grad_scale: 32.0 2023-12-04 06:14:28,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159133.33333333334, ans=0.1 2023-12-04 06:14:55,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=159333.33333333334, ans=0.125 2023-12-04 06:15:09,708 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.70 vs. limit=22.5 2023-12-04 06:15:11,846 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2023-12-04 06:15:16,119 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.21 vs. limit=15.0 2023-12-04 06:15:17,605 INFO [train.py:1087] (2/4) Epoch 27, batch 650, loss[loss=0.1694, simple_loss=0.2548, pruned_loss=0.04202, over 24553.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2623, pruned_loss=0.04155, over 4624932.71 frames. ], batch size: 66, lr: 8.99e-03, grad_scale: 16.0 2023-12-04 06:15:27,350 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=159533.33333333334, ans=0.0 2023-12-04 06:15:32,775 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.395e+02 1.557e+02 1.777e+02 2.410e+02, threshold=3.114e+02, percent-clipped=0.0 2023-12-04 06:15:45,117 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=159600.0, ans=0.125 2023-12-04 06:15:52,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159666.66666666666, ans=0.125 2023-12-04 06:15:56,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=159666.66666666666, ans=0.125 2023-12-04 06:15:57,127 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.75 vs. limit=22.5 2023-12-04 06:16:00,997 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=159733.33333333334, ans=0.125 2023-12-04 06:16:12,609 INFO [train.py:1087] (2/4) Epoch 27, batch 700, loss[loss=0.178, simple_loss=0.2662, pruned_loss=0.04496, over 24245.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2624, pruned_loss=0.04153, over 4659990.84 frames. ], batch size: 82, lr: 8.98e-03, grad_scale: 16.0 2023-12-04 06:16:25,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=159866.66666666666, ans=0.125 2023-12-04 06:16:41,391 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=159933.33333333334, ans=0.0 2023-12-04 06:16:50,699 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-12-04 06:17:09,795 INFO [train.py:1087] (2/4) Epoch 27, batch 750, loss[loss=0.163, simple_loss=0.2546, pruned_loss=0.03571, over 24727.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2625, pruned_loss=0.04159, over 4679040.55 frames. ], batch size: 69, lr: 8.97e-03, grad_scale: 16.0 2023-12-04 06:17:21,473 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-12-04 06:17:25,234 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.212e+02 1.358e+02 1.459e+02 1.607e+02 2.648e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 06:17:30,142 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.74 vs. limit=22.5 2023-12-04 06:17:31,353 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-12-04 06:17:39,598 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:17:47,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=160333.33333333334, ans=0.125 2023-12-04 06:17:53,652 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-12-04 06:17:55,223 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=160400.0, ans=0.125 2023-12-04 06:17:58,828 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.21 vs. limit=22.5 2023-12-04 06:18:04,561 INFO [train.py:1087] (2/4) Epoch 27, batch 800, loss[loss=0.1777, simple_loss=0.2666, pruned_loss=0.04436, over 24703.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.2624, pruned_loss=0.04164, over 4712199.81 frames. ], batch size: 69, lr: 8.96e-03, grad_scale: 32.0 2023-12-04 06:18:14,816 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.43 vs. limit=15.0 2023-12-04 06:18:17,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=160533.33333333334, ans=0.125 2023-12-04 06:18:24,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=160533.33333333334, ans=22.5 2023-12-04 06:18:40,246 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160666.66666666666, ans=0.1 2023-12-04 06:18:49,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=160733.33333333334, ans=0.07 2023-12-04 06:18:56,077 INFO [train.py:1087] (2/4) Epoch 27, batch 850, loss[loss=0.1604, simple_loss=0.2517, pruned_loss=0.03454, over 24554.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2626, pruned_loss=0.04176, over 4733806.19 frames. ], batch size: 66, lr: 8.96e-03, grad_scale: 32.0 2023-12-04 06:19:00,268 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160800.0, ans=0.1 2023-12-04 06:19:07,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160866.66666666666, ans=0.125 2023-12-04 06:19:07,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=160866.66666666666, ans=0.0 2023-12-04 06:19:09,189 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=160866.66666666666, ans=0.125 2023-12-04 06:19:09,988 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.345e+02 1.467e+02 1.580e+02 2.866e+02, threshold=2.933e+02, percent-clipped=0.0 2023-12-04 06:19:11,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=160866.66666666666, ans=0.125 2023-12-04 06:19:13,099 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160866.66666666666, ans=0.1 2023-12-04 06:19:21,720 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-12-04 06:19:27,426 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=161000.0, ans=0.0 2023-12-04 06:19:32,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=161000.0, ans=0.0 2023-12-04 06:19:54,850 INFO [train.py:1087] (2/4) Epoch 28, batch 0, loss[loss=0.1713, simple_loss=0.262, pruned_loss=0.04026, over 24570.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.262, pruned_loss=0.04026, over 24570.00 frames. ], batch size: 64, lr: 8.78e-03, grad_scale: 32.0 2023-12-04 06:19:54,851 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 06:20:07,429 INFO [train.py:1119] (2/4) Epoch 28, validation: loss=0.1564, simple_loss=0.2567, pruned_loss=0.02802, over 944034.00 frames. 2023-12-04 06:20:07,429 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 06:20:26,730 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=161166.66666666666, ans=0.0 2023-12-04 06:20:34,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=161233.33333333334, ans=0.2 2023-12-04 06:20:34,688 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-12-04 06:20:42,244 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-12-04 06:20:53,422 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-12-04 06:21:02,988 INFO [train.py:1087] (2/4) Epoch 28, batch 50, loss[loss=0.1583, simple_loss=0.2544, pruned_loss=0.03107, over 24786.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2608, pruned_loss=0.04016, over 1085938.21 frames. ], batch size: 73, lr: 8.78e-03, grad_scale: 32.0 2023-12-04 06:21:06,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=161433.33333333334, ans=0.0 2023-12-04 06:21:23,515 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.413e+02 1.493e+02 1.807e+02 2.971e+02, threshold=2.986e+02, percent-clipped=1.0 2023-12-04 06:21:28,760 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=161566.66666666666, ans=0.125 2023-12-04 06:21:57,351 INFO [train.py:1087] (2/4) Epoch 28, batch 100, loss[loss=0.1725, simple_loss=0.2644, pruned_loss=0.04027, over 24086.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2616, pruned_loss=0.04021, over 1923821.60 frames. ], batch size: 87, lr: 8.77e-03, grad_scale: 32.0 2023-12-04 06:22:09,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=161833.33333333334, ans=0.2 2023-12-04 06:22:13,317 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.20 vs. limit=15.0 2023-12-04 06:22:27,283 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=161900.0, ans=0.0 2023-12-04 06:22:52,755 INFO [train.py:1087] (2/4) Epoch 28, batch 150, loss[loss=0.1629, simple_loss=0.2566, pruned_loss=0.03459, over 24555.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2616, pruned_loss=0.04064, over 2540418.17 frames. ], batch size: 63, lr: 8.76e-03, grad_scale: 32.0 2023-12-04 06:23:02,796 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=162100.0, ans=0.125 2023-12-04 06:23:10,176 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:23:14,104 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.314e+02 1.419e+02 1.618e+02 2.443e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 06:23:48,632 INFO [train.py:1087] (2/4) Epoch 28, batch 200, loss[loss=0.1905, simple_loss=0.2804, pruned_loss=0.05034, over 24099.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2609, pruned_loss=0.03997, over 3051461.35 frames. ], batch size: 58, lr: 8.75e-03, grad_scale: 32.0 2023-12-04 06:23:52,134 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=162433.33333333334, ans=0.09899494936611666 2023-12-04 06:23:54,225 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:24:02,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162500.0, ans=0.1 2023-12-04 06:24:05,824 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=162500.0, ans=0.125 2023-12-04 06:24:12,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=162566.66666666666, ans=0.0 2023-12-04 06:24:34,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=162700.0, ans=0.2 2023-12-04 06:24:45,041 INFO [train.py:1087] (2/4) Epoch 28, batch 250, loss[loss=0.1561, simple_loss=0.2483, pruned_loss=0.03197, over 24863.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2607, pruned_loss=0.04021, over 3444676.25 frames. ], batch size: 68, lr: 8.74e-03, grad_scale: 32.0 2023-12-04 06:24:56,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162833.33333333334, ans=0.125 2023-12-04 06:25:05,487 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.354e+02 1.477e+02 1.650e+02 2.112e+02, threshold=2.954e+02, percent-clipped=0.0 2023-12-04 06:25:07,132 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-12-04 06:25:20,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162966.66666666666, ans=0.1 2023-12-04 06:25:20,813 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-12-04 06:25:33,606 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2023-12-04 06:25:40,322 INFO [train.py:1087] (2/4) Epoch 28, batch 300, loss[loss=0.1652, simple_loss=0.2516, pruned_loss=0.03943, over 24781.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2609, pruned_loss=0.0403, over 3754173.47 frames. ], batch size: 64, lr: 8.73e-03, grad_scale: 32.0 2023-12-04 06:25:43,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=163100.0, ans=0.125 2023-12-04 06:25:45,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=163100.0, ans=0.125 2023-12-04 06:25:46,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163100.0, ans=0.1 2023-12-04 06:26:00,674 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.12 vs. limit=22.5 2023-12-04 06:26:01,843 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-12-04 06:26:08,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163233.33333333334, ans=0.1 2023-12-04 06:26:12,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163300.0, ans=0.1 2023-12-04 06:26:35,160 INFO [train.py:1087] (2/4) Epoch 28, batch 350, loss[loss=0.1653, simple_loss=0.2559, pruned_loss=0.03735, over 24766.00 frames. ], tot_loss[loss=0.1709, simple_loss=0.2609, pruned_loss=0.04044, over 3975095.19 frames. ], batch size: 70, lr: 8.73e-03, grad_scale: 32.0 2023-12-04 06:26:37,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=163433.33333333334, ans=0.125 2023-12-04 06:26:50,336 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=163500.0, ans=0.2 2023-12-04 06:26:53,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163500.0, ans=0.1 2023-12-04 06:26:56,398 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.203e+02 1.381e+02 1.505e+02 1.634e+02 2.897e+02, threshold=3.011e+02, percent-clipped=0.0 2023-12-04 06:27:06,057 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=163566.66666666666, ans=0.025 2023-12-04 06:27:16,961 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-12-04 06:27:30,267 INFO [train.py:1087] (2/4) Epoch 28, batch 400, loss[loss=0.1713, simple_loss=0.2673, pruned_loss=0.0376, over 24148.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2613, pruned_loss=0.04038, over 4165391.74 frames. ], batch size: 58, lr: 8.72e-03, grad_scale: 32.0 2023-12-04 06:27:31,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=163766.66666666666, ans=0.125 2023-12-04 06:27:41,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=163833.33333333334, ans=15.0 2023-12-04 06:27:42,979 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.18 vs. limit=15.0 2023-12-04 06:27:43,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=163833.33333333334, ans=0.0 2023-12-04 06:27:48,795 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.42 vs. limit=22.5 2023-12-04 06:28:00,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=163900.0, ans=0.2 2023-12-04 06:28:01,730 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163900.0, ans=0.125 2023-12-04 06:28:15,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164033.33333333334, ans=0.1 2023-12-04 06:28:22,385 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.83 vs. limit=22.5 2023-12-04 06:28:26,474 INFO [train.py:1087] (2/4) Epoch 28, batch 450, loss[loss=0.1857, simple_loss=0.2738, pruned_loss=0.04879, over 23001.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2609, pruned_loss=0.04032, over 4308928.18 frames. ], batch size: 106, lr: 8.71e-03, grad_scale: 32.0 2023-12-04 06:28:31,261 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.60 vs. limit=10.0 2023-12-04 06:28:36,162 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=164166.66666666666, ans=0.0 2023-12-04 06:28:40,613 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=164166.66666666666, ans=0.125 2023-12-04 06:28:46,906 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.427e+02 1.550e+02 1.706e+02 2.380e+02, threshold=3.099e+02, percent-clipped=0.0 2023-12-04 06:28:57,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=164233.33333333334, ans=0.125 2023-12-04 06:29:06,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=164300.0, ans=0.0 2023-12-04 06:29:07,132 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=164300.0, ans=0.125 2023-12-04 06:29:16,474 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-12-04 06:29:21,528 INFO [train.py:1087] (2/4) Epoch 28, batch 500, loss[loss=0.16, simple_loss=0.2543, pruned_loss=0.03287, over 24775.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2606, pruned_loss=0.04023, over 4416054.72 frames. ], batch size: 73, lr: 8.70e-03, grad_scale: 32.0 2023-12-04 06:29:28,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=164433.33333333334, ans=0.0 2023-12-04 06:29:31,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=164433.33333333334, ans=0.2 2023-12-04 06:29:50,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164566.66666666666, ans=0.1 2023-12-04 06:29:53,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=164566.66666666666, ans=0.1 2023-12-04 06:29:59,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=164633.33333333334, ans=0.125 2023-12-04 06:30:16,787 INFO [train.py:1087] (2/4) Epoch 28, batch 550, loss[loss=0.1617, simple_loss=0.2518, pruned_loss=0.03575, over 24549.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2606, pruned_loss=0.0403, over 4502561.86 frames. ], batch size: 62, lr: 8.69e-03, grad_scale: 32.0 2023-12-04 06:30:35,667 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.74 vs. limit=15.0 2023-12-04 06:30:38,241 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.410e+02 1.492e+02 1.603e+02 2.301e+02, threshold=2.985e+02, percent-clipped=0.0 2023-12-04 06:30:50,953 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-12-04 06:31:11,596 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=165100.0, ans=0.125 2023-12-04 06:31:12,384 INFO [train.py:1087] (2/4) Epoch 28, batch 600, loss[loss=0.1665, simple_loss=0.2533, pruned_loss=0.03984, over 24570.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2609, pruned_loss=0.04059, over 4561236.59 frames. ], batch size: 64, lr: 8.69e-03, grad_scale: 32.0 2023-12-04 06:31:19,038 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=165100.0, ans=0.2 2023-12-04 06:31:27,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=165166.66666666666, ans=0.125 2023-12-04 06:31:39,620 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=12.0 2023-12-04 06:31:41,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=165233.33333333334, ans=0.0 2023-12-04 06:31:44,904 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.16 vs. limit=15.0 2023-12-04 06:31:57,167 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=165366.66666666666, ans=0.125 2023-12-04 06:32:07,895 INFO [train.py:1087] (2/4) Epoch 28, batch 650, loss[loss=0.1714, simple_loss=0.2633, pruned_loss=0.03973, over 24694.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2606, pruned_loss=0.04025, over 4628864.94 frames. ], batch size: 74, lr: 8.68e-03, grad_scale: 32.0 2023-12-04 06:32:10,253 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=165433.33333333334, ans=0.125 2023-12-04 06:32:18,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=165500.0, ans=0.125 2023-12-04 06:32:22,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=165500.0, ans=0.125 2023-12-04 06:32:28,974 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.188e+02 1.357e+02 1.538e+02 1.703e+02 2.135e+02, threshold=3.076e+02, percent-clipped=0.0 2023-12-04 06:32:38,027 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.05 vs. limit=22.5 2023-12-04 06:33:03,414 INFO [train.py:1087] (2/4) Epoch 28, batch 700, loss[loss=0.1648, simple_loss=0.2599, pruned_loss=0.03484, over 24561.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2605, pruned_loss=0.04006, over 4674549.48 frames. ], batch size: 64, lr: 8.67e-03, grad_scale: 32.0 2023-12-04 06:33:20,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=165833.33333333334, ans=0.125 2023-12-04 06:33:26,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=165900.0, ans=0.125 2023-12-04 06:33:27,113 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.36 vs. limit=6.0 2023-12-04 06:33:36,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=165966.66666666666, ans=0.125 2023-12-04 06:33:46,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=166033.33333333334, ans=0.125 2023-12-04 06:33:53,267 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=166033.33333333334, ans=0.0 2023-12-04 06:33:58,746 INFO [train.py:1087] (2/4) Epoch 28, batch 750, loss[loss=0.1663, simple_loss=0.255, pruned_loss=0.03881, over 24546.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2605, pruned_loss=0.04013, over 4706308.26 frames. ], batch size: 62, lr: 8.66e-03, grad_scale: 32.0 2023-12-04 06:34:06,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=166100.0, ans=0.125 2023-12-04 06:34:06,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166100.0, ans=0.0 2023-12-04 06:34:20,247 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.371e+02 1.508e+02 1.622e+02 2.380e+02, threshold=3.017e+02, percent-clipped=0.0 2023-12-04 06:34:25,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=166233.33333333334, ans=0.025 2023-12-04 06:34:27,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=166233.33333333334, ans=0.5 2023-12-04 06:34:36,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=166300.0, ans=0.0 2023-12-04 06:34:37,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166300.0, ans=0.1 2023-12-04 06:34:45,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=166366.66666666666, ans=0.0 2023-12-04 06:34:48,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=166366.66666666666, ans=0.05 2023-12-04 06:34:51,885 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-12-04 06:34:53,460 INFO [train.py:1087] (2/4) Epoch 28, batch 800, loss[loss=0.1796, simple_loss=0.2701, pruned_loss=0.04448, over 23453.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2609, pruned_loss=0.04034, over 4729638.68 frames. ], batch size: 94, lr: 8.65e-03, grad_scale: 32.0 2023-12-04 06:35:04,249 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-12-04 06:35:28,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166633.33333333334, ans=0.0 2023-12-04 06:35:29,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=166633.33333333334, ans=0.125 2023-12-04 06:35:29,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166633.33333333334, ans=0.1 2023-12-04 06:35:30,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=166633.33333333334, ans=0.5 2023-12-04 06:35:35,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166700.0, ans=0.1 2023-12-04 06:35:36,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166700.0, ans=0.125 2023-12-04 06:35:45,560 INFO [train.py:1087] (2/4) Epoch 28, batch 850, loss[loss=0.1924, simple_loss=0.2761, pruned_loss=0.05428, over 24221.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2612, pruned_loss=0.0405, over 4747792.98 frames. ], batch size: 82, lr: 8.65e-03, grad_scale: 32.0 2023-12-04 06:36:03,517 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=166833.33333333334, ans=0.125 2023-12-04 06:36:05,283 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.336e+02 1.417e+02 1.578e+02 2.215e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 06:36:23,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=166966.66666666666, ans=0.125 2023-12-04 06:36:45,588 INFO [train.py:1087] (2/4) Epoch 29, batch 0, loss[loss=0.1753, simple_loss=0.2653, pruned_loss=0.04267, over 24001.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2653, pruned_loss=0.04267, over 24001.00 frames. ], batch size: 87, lr: 8.49e-03, grad_scale: 32.0 2023-12-04 06:36:45,588 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 06:36:56,942 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.6886, 3.3705, 3.3795, 5.3714], device='cuda:2') 2023-12-04 06:36:57,635 INFO [train.py:1119] (2/4) Epoch 29, validation: loss=0.1551, simple_loss=0.2558, pruned_loss=0.02721, over 944034.00 frames. 2023-12-04 06:36:57,636 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 06:36:57,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167066.66666666666, ans=0.0 2023-12-04 06:37:08,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=167133.33333333334, ans=0.125 2023-12-04 06:37:51,876 INFO [train.py:1087] (2/4) Epoch 29, batch 50, loss[loss=0.1628, simple_loss=0.2541, pruned_loss=0.03573, over 24752.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2613, pruned_loss=0.04088, over 1082737.57 frames. ], batch size: 66, lr: 8.48e-03, grad_scale: 32.0 2023-12-04 06:37:54,748 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.65 vs. limit=22.5 2023-12-04 06:37:55,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=167400.0, ans=0.125 2023-12-04 06:38:04,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167466.66666666666, ans=0.1 2023-12-04 06:38:10,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167466.66666666666, ans=0.125 2023-12-04 06:38:11,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167466.66666666666, ans=0.1 2023-12-04 06:38:18,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=167533.33333333334, ans=0.0 2023-12-04 06:38:19,740 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.470e+02 1.626e+02 1.821e+02 2.720e+02, threshold=3.251e+02, percent-clipped=0.0 2023-12-04 06:38:19,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=167533.33333333334, ans=0.2 2023-12-04 06:38:19,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=167533.33333333334, ans=0.125 2023-12-04 06:38:40,137 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.49 vs. limit=22.5 2023-12-04 06:38:43,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=167666.66666666666, ans=0.125 2023-12-04 06:38:46,863 INFO [train.py:1087] (2/4) Epoch 29, batch 100, loss[loss=0.1771, simple_loss=0.2649, pruned_loss=0.04468, over 24480.00 frames. ], tot_loss[loss=0.1715, simple_loss=0.2613, pruned_loss=0.0408, over 1905488.52 frames. ], batch size: 77, lr: 8.47e-03, grad_scale: 32.0 2023-12-04 06:39:17,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=167866.66666666666, ans=0.0 2023-12-04 06:39:23,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=167933.33333333334, ans=0.0 2023-12-04 06:39:31,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=168000.0, ans=0.125 2023-12-04 06:39:36,597 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-12-04 06:39:42,348 INFO [train.py:1087] (2/4) Epoch 29, batch 150, loss[loss=0.1706, simple_loss=0.2596, pruned_loss=0.04085, over 24553.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.261, pruned_loss=0.04056, over 2563525.71 frames. ], batch size: 66, lr: 8.46e-03, grad_scale: 32.0 2023-12-04 06:40:03,612 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=168200.0, ans=0.125 2023-12-04 06:40:10,045 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.375e+02 1.505e+02 1.698e+02 3.083e+02, threshold=3.010e+02, percent-clipped=0.0 2023-12-04 06:40:24,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=168266.66666666666, ans=0.0 2023-12-04 06:40:37,330 INFO [train.py:1087] (2/4) Epoch 29, batch 200, loss[loss=0.18, simple_loss=0.2696, pruned_loss=0.04521, over 23482.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2611, pruned_loss=0.04072, over 3058292.35 frames. ], batch size: 94, lr: 8.46e-03, grad_scale: 32.0 2023-12-04 06:41:17,353 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=168600.0, ans=0.0 2023-12-04 06:41:19,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=168600.0, ans=0.125 2023-12-04 06:41:27,495 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.20 vs. limit=15.0 2023-12-04 06:41:32,668 INFO [train.py:1087] (2/4) Epoch 29, batch 250, loss[loss=0.1829, simple_loss=0.2728, pruned_loss=0.04647, over 24468.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2611, pruned_loss=0.04056, over 3468297.56 frames. ], batch size: 77, lr: 8.45e-03, grad_scale: 32.0 2023-12-04 06:41:35,960 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=168733.33333333334, ans=0.125 2023-12-04 06:41:41,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=168733.33333333334, ans=0.125 2023-12-04 06:41:54,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=168866.66666666666, ans=0.2 2023-12-04 06:41:58,877 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168866.66666666666, ans=0.1 2023-12-04 06:41:59,580 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.376e+02 1.495e+02 1.723e+02 2.429e+02, threshold=2.991e+02, percent-clipped=0.0 2023-12-04 06:42:26,619 INFO [train.py:1087] (2/4) Epoch 29, batch 300, loss[loss=0.1634, simple_loss=0.2529, pruned_loss=0.03697, over 24725.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2612, pruned_loss=0.04064, over 3755633.40 frames. ], batch size: 69, lr: 8.44e-03, grad_scale: 32.0 2023-12-04 06:42:38,388 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:42:43,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169133.33333333334, ans=0.1 2023-12-04 06:42:46,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=169133.33333333334, ans=0.125 2023-12-04 06:43:22,059 INFO [train.py:1087] (2/4) Epoch 29, batch 350, loss[loss=0.1656, simple_loss=0.2582, pruned_loss=0.0365, over 24784.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2611, pruned_loss=0.04057, over 3989634.11 frames. ], batch size: 73, lr: 8.43e-03, grad_scale: 32.0 2023-12-04 06:43:24,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=169400.0, ans=0.125 2023-12-04 06:43:43,909 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=15.0 2023-12-04 06:43:45,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=169533.33333333334, ans=0.05 2023-12-04 06:43:49,505 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.359e+02 1.460e+02 1.582e+02 2.441e+02, threshold=2.921e+02, percent-clipped=0.0 2023-12-04 06:44:00,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=169600.0, ans=0.125 2023-12-04 06:44:15,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=169666.66666666666, ans=0.0 2023-12-04 06:44:17,599 INFO [train.py:1087] (2/4) Epoch 29, batch 400, loss[loss=0.1574, simple_loss=0.2533, pruned_loss=0.03072, over 24759.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2605, pruned_loss=0.04033, over 4177239.97 frames. ], batch size: 65, lr: 8.42e-03, grad_scale: 32.0 2023-12-04 06:44:20,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=169733.33333333334, ans=0.0 2023-12-04 06:44:28,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=169800.0, ans=0.125 2023-12-04 06:44:29,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=169800.0, ans=10.0 2023-12-04 06:44:32,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=169800.0, ans=0.2 2023-12-04 06:44:33,256 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-12-04 06:44:36,795 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=15.0 2023-12-04 06:44:37,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=169800.0, ans=0.07 2023-12-04 06:44:55,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=169933.33333333334, ans=0.0 2023-12-04 06:45:12,743 INFO [train.py:1087] (2/4) Epoch 29, batch 450, loss[loss=0.1639, simple_loss=0.2598, pruned_loss=0.03399, over 24714.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2605, pruned_loss=0.04021, over 4314871.08 frames. ], batch size: 67, lr: 8.42e-03, grad_scale: 32.0 2023-12-04 06:45:19,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=170066.66666666666, ans=0.0 2023-12-04 06:45:31,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=170133.33333333334, ans=0.0 2023-12-04 06:45:40,585 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.390e+02 1.484e+02 1.747e+02 2.437e+02, threshold=2.967e+02, percent-clipped=0.0 2023-12-04 06:45:44,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=170200.0, ans=0.125 2023-12-04 06:45:48,670 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=170266.66666666666, ans=0.0 2023-12-04 06:46:01,771 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.29 vs. limit=15.0 2023-12-04 06:46:07,941 INFO [train.py:1087] (2/4) Epoch 29, batch 500, loss[loss=0.1793, simple_loss=0.267, pruned_loss=0.04584, over 24758.00 frames. ], tot_loss[loss=0.1708, simple_loss=0.2607, pruned_loss=0.04045, over 4401160.50 frames. ], batch size: 63, lr: 8.41e-03, grad_scale: 32.0 2023-12-04 06:46:23,246 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=170466.66666666666, ans=0.0 2023-12-04 06:46:24,501 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-12-04 06:46:26,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=170466.66666666666, ans=0.125 2023-12-04 06:46:38,315 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=170533.33333333334, ans=0.0 2023-12-04 06:46:51,255 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.07 vs. limit=10.0 2023-12-04 06:46:57,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=170666.66666666666, ans=0.125 2023-12-04 06:47:03,540 INFO [train.py:1087] (2/4) Epoch 29, batch 550, loss[loss=0.1524, simple_loss=0.2446, pruned_loss=0.03014, over 24745.00 frames. ], tot_loss[loss=0.1713, simple_loss=0.2611, pruned_loss=0.04077, over 4469961.67 frames. ], batch size: 61, lr: 8.40e-03, grad_scale: 32.0 2023-12-04 06:47:14,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=170800.0, ans=0.125 2023-12-04 06:47:22,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=170800.0, ans=0.0 2023-12-04 06:47:31,297 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.371e+02 1.473e+02 1.639e+02 2.975e+02, threshold=2.945e+02, percent-clipped=1.0 2023-12-04 06:47:47,521 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-12-04 06:47:58,651 INFO [train.py:1087] (2/4) Epoch 29, batch 600, loss[loss=0.1682, simple_loss=0.2539, pruned_loss=0.04127, over 24765.00 frames. ], tot_loss[loss=0.1709, simple_loss=0.2608, pruned_loss=0.04049, over 4548968.18 frames. ], batch size: 64, lr: 8.39e-03, grad_scale: 32.0 2023-12-04 06:48:26,218 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=171200.0, ans=0.2 2023-12-04 06:48:54,930 INFO [train.py:1087] (2/4) Epoch 29, batch 650, loss[loss=0.172, simple_loss=0.2575, pruned_loss=0.0432, over 24467.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2606, pruned_loss=0.04029, over 4604233.63 frames. ], batch size: 75, lr: 8.39e-03, grad_scale: 32.0 2023-12-04 06:49:03,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=171400.0, ans=0.0 2023-12-04 06:49:03,708 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171400.0, ans=0.125 2023-12-04 06:49:05,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=171466.66666666666, ans=0.0 2023-12-04 06:49:20,724 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.20 vs. limit=15.0 2023-12-04 06:49:22,501 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.349e+02 1.422e+02 1.549e+02 2.709e+02, threshold=2.844e+02, percent-clipped=0.0 2023-12-04 06:49:22,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171533.33333333334, ans=0.1 2023-12-04 06:49:24,322 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.72 vs. limit=10.0 2023-12-04 06:49:32,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=171600.0, ans=0.125 2023-12-04 06:49:45,426 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.27 vs. limit=22.5 2023-12-04 06:49:50,067 INFO [train.py:1087] (2/4) Epoch 29, batch 700, loss[loss=0.1716, simple_loss=0.2626, pruned_loss=0.04024, over 24218.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2607, pruned_loss=0.04022, over 4639084.84 frames. ], batch size: 58, lr: 8.38e-03, grad_scale: 32.0 2023-12-04 06:49:53,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=171733.33333333334, ans=0.2 2023-12-04 06:49:59,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=171733.33333333334, ans=0.0 2023-12-04 06:50:01,916 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.19 vs. limit=15.0 2023-12-04 06:50:19,361 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=171866.66666666666, ans=0.125 2023-12-04 06:50:44,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=172000.0, ans=0.0 2023-12-04 06:50:46,329 INFO [train.py:1087] (2/4) Epoch 29, batch 750, loss[loss=0.166, simple_loss=0.2559, pruned_loss=0.03799, over 24726.00 frames. ], tot_loss[loss=0.1702, simple_loss=0.2604, pruned_loss=0.04001, over 4678849.44 frames. ], batch size: 67, lr: 8.37e-03, grad_scale: 32.0 2023-12-04 06:50:51,329 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=172066.66666666666, ans=0.1 2023-12-04 06:50:52,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=172066.66666666666, ans=0.2 2023-12-04 06:50:54,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172066.66666666666, ans=0.0 2023-12-04 06:51:13,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=172200.0, ans=0.0 2023-12-04 06:51:13,866 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.338e+02 1.442e+02 1.731e+02 2.430e+02, threshold=2.884e+02, percent-clipped=0.0 2023-12-04 06:51:33,176 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=172333.33333333334, ans=0.0 2023-12-04 06:51:41,779 INFO [train.py:1087] (2/4) Epoch 29, batch 800, loss[loss=0.1721, simple_loss=0.2621, pruned_loss=0.04106, over 21268.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2606, pruned_loss=0.04004, over 4698349.77 frames. ], batch size: 127, lr: 8.36e-03, grad_scale: 32.0 2023-12-04 06:52:16,619 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=172600.0, ans=0.2 2023-12-04 06:52:20,876 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.92 vs. limit=12.0 2023-12-04 06:52:33,318 INFO [train.py:1087] (2/4) Epoch 29, batch 850, loss[loss=0.1636, simple_loss=0.2524, pruned_loss=0.03736, over 24573.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2607, pruned_loss=0.04011, over 4722997.58 frames. ], batch size: 64, lr: 8.36e-03, grad_scale: 32.0 2023-12-04 06:52:58,525 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.185e+02 1.381e+02 1.487e+02 1.682e+02 2.314e+02, threshold=2.974e+02, percent-clipped=0.0 2023-12-04 06:53:08,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=172933.33333333334, ans=0.125 2023-12-04 06:53:11,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=172933.33333333334, ans=0.0 2023-12-04 06:53:14,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=173000.0, ans=0.0 2023-12-04 06:53:15,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=173000.0, ans=0.2 2023-12-04 06:53:34,384 INFO [train.py:1087] (2/4) Epoch 30, batch 0, loss[loss=0.1719, simple_loss=0.2671, pruned_loss=0.03836, over 24808.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.2671, pruned_loss=0.03836, over 24808.00 frames. ], batch size: 72, lr: 8.21e-03, grad_scale: 32.0 2023-12-04 06:53:34,385 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 06:53:42,344 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([6.6923, 6.4716, 6.0522, 6.1210], device='cuda:2') 2023-12-04 06:53:46,442 INFO [train.py:1119] (2/4) Epoch 30, validation: loss=0.155, simple_loss=0.2554, pruned_loss=0.02733, over 944034.00 frames. 2023-12-04 06:53:46,443 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 06:54:07,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173166.66666666666, ans=0.1 2023-12-04 06:54:09,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=173166.66666666666, ans=0.0 2023-12-04 06:54:24,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=173233.33333333334, ans=0.125 2023-12-04 06:54:25,395 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-12-04 06:54:32,714 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.37 vs. limit=15.0 2023-12-04 06:54:41,379 INFO [train.py:1087] (2/4) Epoch 30, batch 50, loss[loss=0.1611, simple_loss=0.2578, pruned_loss=0.03217, over 24844.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2618, pruned_loss=0.04049, over 1079289.95 frames. ], batch size: 68, lr: 8.20e-03, grad_scale: 32.0 2023-12-04 06:54:41,550 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173366.66666666666, ans=0.125 2023-12-04 06:54:42,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=173366.66666666666, ans=0.0 2023-12-04 06:54:51,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=173433.33333333334, ans=0.125 2023-12-04 06:55:05,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=173500.0, ans=0.0 2023-12-04 06:55:07,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=173500.0, ans=0.2 2023-12-04 06:55:14,576 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.323e+02 1.434e+02 1.672e+02 2.619e+02, threshold=2.868e+02, percent-clipped=0.0 2023-12-04 06:55:19,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=173566.66666666666, ans=0.0 2023-12-04 06:55:22,742 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2023-12-04 06:55:26,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=173633.33333333334, ans=0.2 2023-12-04 06:55:36,608 INFO [train.py:1087] (2/4) Epoch 30, batch 100, loss[loss=0.155, simple_loss=0.2461, pruned_loss=0.03198, over 24797.00 frames. ], tot_loss[loss=0.171, simple_loss=0.2615, pruned_loss=0.04029, over 1916188.82 frames. ], batch size: 72, lr: 8.19e-03, grad_scale: 32.0 2023-12-04 06:55:44,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=173700.0, ans=0.125 2023-12-04 06:56:18,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=173900.0, ans=0.125 2023-12-04 06:56:21,548 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=173966.66666666666, ans=0.125 2023-12-04 06:56:25,641 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=15.0 2023-12-04 06:56:26,329 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173966.66666666666, ans=0.1 2023-12-04 06:56:31,349 INFO [train.py:1087] (2/4) Epoch 30, batch 150, loss[loss=0.1716, simple_loss=0.259, pruned_loss=0.04207, over 24574.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2605, pruned_loss=0.04007, over 2555635.01 frames. ], batch size: 62, lr: 8.19e-03, grad_scale: 32.0 2023-12-04 06:56:39,361 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=174033.33333333334, ans=0.07 2023-12-04 06:57:02,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=174166.66666666666, ans=12.0 2023-12-04 06:57:05,075 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.321e+02 1.434e+02 1.605e+02 3.805e+02, threshold=2.868e+02, percent-clipped=1.0 2023-12-04 06:57:07,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174233.33333333334, ans=0.1 2023-12-04 06:57:18,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174300.0, ans=0.1 2023-12-04 06:57:21,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174300.0, ans=0.125 2023-12-04 06:57:22,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=174300.0, ans=0.2 2023-12-04 06:57:24,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=174300.0, ans=0.09899494936611666 2023-12-04 06:57:26,223 INFO [train.py:1087] (2/4) Epoch 30, batch 200, loss[loss=0.1721, simple_loss=0.2644, pruned_loss=0.03994, over 24688.00 frames. ], tot_loss[loss=0.1695, simple_loss=0.2597, pruned_loss=0.03963, over 3051717.37 frames. ], batch size: 74, lr: 8.18e-03, grad_scale: 16.0 2023-12-04 06:57:31,171 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=174366.66666666666, ans=0.0 2023-12-04 06:58:04,830 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.69 vs. limit=15.0 2023-12-04 06:58:18,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=174633.33333333334, ans=0.0 2023-12-04 06:58:19,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=174633.33333333334, ans=0.2 2023-12-04 06:58:21,717 INFO [train.py:1087] (2/4) Epoch 30, batch 250, loss[loss=0.1703, simple_loss=0.266, pruned_loss=0.0373, over 24157.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.26, pruned_loss=0.03987, over 3437839.88 frames. ], batch size: 58, lr: 8.17e-03, grad_scale: 16.0 2023-12-04 06:58:45,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=174833.33333333334, ans=0.125 2023-12-04 06:58:55,598 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=15.0 2023-12-04 06:58:56,036 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.388e+02 1.491e+02 1.678e+02 2.390e+02, threshold=2.982e+02, percent-clipped=0.0 2023-12-04 06:59:00,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=174900.0, ans=0.0 2023-12-04 06:59:17,731 INFO [train.py:1087] (2/4) Epoch 30, batch 300, loss[loss=0.1646, simple_loss=0.2568, pruned_loss=0.03615, over 24797.00 frames. ], tot_loss[loss=0.1701, simple_loss=0.2602, pruned_loss=0.04002, over 3725842.17 frames. ], batch size: 72, lr: 8.16e-03, grad_scale: 16.0 2023-12-04 06:59:31,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=175100.0, ans=0.125 2023-12-04 06:59:33,354 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=175100.0, ans=0.125 2023-12-04 06:59:34,784 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.89 vs. limit=22.5 2023-12-04 06:59:59,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=175233.33333333334, ans=0.0 2023-12-04 07:00:03,479 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175300.0, ans=0.0 2023-12-04 07:00:07,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=175300.0, ans=0.0 2023-12-04 07:00:09,938 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175300.0, ans=0.1 2023-12-04 07:00:13,213 INFO [train.py:1087] (2/4) Epoch 30, batch 350, loss[loss=0.1752, simple_loss=0.2607, pruned_loss=0.04481, over 24498.00 frames. ], tot_loss[loss=0.17, simple_loss=0.2598, pruned_loss=0.04009, over 3952118.84 frames. ], batch size: 77, lr: 8.16e-03, grad_scale: 16.0 2023-12-04 07:00:25,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=175433.33333333334, ans=0.125 2023-12-04 07:00:28,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=175433.33333333334, ans=0.125 2023-12-04 07:00:43,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=175500.0, ans=0.125 2023-12-04 07:00:48,226 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.393e+02 1.497e+02 1.658e+02 2.053e+02, threshold=2.994e+02, percent-clipped=0.0 2023-12-04 07:00:49,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=175566.66666666666, ans=0.04949747468305833 2023-12-04 07:00:54,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=175566.66666666666, ans=0.125 2023-12-04 07:01:09,273 INFO [train.py:1087] (2/4) Epoch 30, batch 400, loss[loss=0.162, simple_loss=0.2553, pruned_loss=0.03432, over 24787.00 frames. ], tot_loss[loss=0.17, simple_loss=0.26, pruned_loss=0.03998, over 4138205.08 frames. ], batch size: 73, lr: 8.15e-03, grad_scale: 32.0 2023-12-04 07:01:20,798 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:01:32,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=175833.33333333334, ans=0.0 2023-12-04 07:01:48,101 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=175900.0, ans=0.125 2023-12-04 07:01:56,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=175966.66666666666, ans=0.0 2023-12-04 07:02:04,272 INFO [train.py:1087] (2/4) Epoch 30, batch 450, loss[loss=0.1744, simple_loss=0.2612, pruned_loss=0.04378, over 24000.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2598, pruned_loss=0.03975, over 4290011.93 frames. ], batch size: 87, lr: 8.14e-03, grad_scale: 32.0 2023-12-04 07:02:15,752 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=176100.0, ans=0.05 2023-12-04 07:02:18,714 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=176100.0, ans=0.0 2023-12-04 07:02:18,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=176100.0, ans=0.125 2023-12-04 07:02:38,482 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.195e+02 1.360e+02 1.451e+02 1.639e+02 2.185e+02, threshold=2.903e+02, percent-clipped=0.0 2023-12-04 07:02:48,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=176300.0, ans=0.0 2023-12-04 07:02:50,238 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=176300.0, ans=0.125 2023-12-04 07:03:00,276 INFO [train.py:1087] (2/4) Epoch 30, batch 500, loss[loss=0.1753, simple_loss=0.2637, pruned_loss=0.04345, over 24255.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2598, pruned_loss=0.03993, over 4393122.27 frames. ], batch size: 82, lr: 8.13e-03, grad_scale: 32.0 2023-12-04 07:03:16,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-12-04 07:03:20,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=176433.33333333334, ans=0.125 2023-12-04 07:03:28,391 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=176500.0, ans=0.05 2023-12-04 07:03:55,704 INFO [train.py:1087] (2/4) Epoch 30, batch 550, loss[loss=0.1611, simple_loss=0.2531, pruned_loss=0.03451, over 24849.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2595, pruned_loss=0.03956, over 4495380.99 frames. ], batch size: 68, lr: 8.13e-03, grad_scale: 32.0 2023-12-04 07:04:12,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=176766.66666666666, ans=0.0 2023-12-04 07:04:15,288 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=15.0 2023-12-04 07:04:30,798 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.234e+02 1.385e+02 1.470e+02 1.609e+02 2.556e+02, threshold=2.939e+02, percent-clipped=0.0 2023-12-04 07:04:37,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=176900.0, ans=0.125 2023-12-04 07:04:51,870 INFO [train.py:1087] (2/4) Epoch 30, batch 600, loss[loss=0.1602, simple_loss=0.2531, pruned_loss=0.03366, over 24763.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.259, pruned_loss=0.03927, over 4590875.73 frames. ], batch size: 66, lr: 8.12e-03, grad_scale: 32.0 2023-12-04 07:04:51,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=177033.33333333334, ans=0.125 2023-12-04 07:04:59,895 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=177033.33333333334, ans=0.2 2023-12-04 07:05:38,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177300.0, ans=0.0 2023-12-04 07:05:47,483 INFO [train.py:1087] (2/4) Epoch 30, batch 650, loss[loss=0.1559, simple_loss=0.2491, pruned_loss=0.0314, over 24777.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.259, pruned_loss=0.03929, over 4636770.87 frames. ], batch size: 71, lr: 8.11e-03, grad_scale: 32.0 2023-12-04 07:06:09,906 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=177500.0, ans=0.0 2023-12-04 07:06:21,614 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.163e+02 1.367e+02 1.457e+02 1.593e+02 2.870e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 07:06:24,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177566.66666666666, ans=0.1 2023-12-04 07:06:27,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=177566.66666666666, ans=0.2 2023-12-04 07:06:36,863 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-12-04 07:06:41,373 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=177633.33333333334, ans=0.2 2023-12-04 07:06:43,207 INFO [train.py:1087] (2/4) Epoch 30, batch 700, loss[loss=0.1736, simple_loss=0.2612, pruned_loss=0.04301, over 24758.00 frames. ], tot_loss[loss=0.169, simple_loss=0.259, pruned_loss=0.03954, over 4660897.56 frames. ], batch size: 61, lr: 8.11e-03, grad_scale: 32.0 2023-12-04 07:06:44,667 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=177700.0, ans=0.125 2023-12-04 07:06:51,318 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:07:17,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177900.0, ans=0.1 2023-12-04 07:07:21,126 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=177900.0, ans=0.125 2023-12-04 07:07:26,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=177966.66666666666, ans=0.0 2023-12-04 07:07:37,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=178033.33333333334, ans=0.125 2023-12-04 07:07:37,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=178033.33333333334, ans=0.0 2023-12-04 07:07:38,658 INFO [train.py:1087] (2/4) Epoch 30, batch 750, loss[loss=0.1708, simple_loss=0.2615, pruned_loss=0.04007, over 24749.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2589, pruned_loss=0.03944, over 4695364.78 frames. ], batch size: 61, lr: 8.10e-03, grad_scale: 32.0 2023-12-04 07:07:47,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=178033.33333333334, ans=0.125 2023-12-04 07:07:57,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=178100.0, ans=0.0 2023-12-04 07:08:13,528 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.341e+02 1.453e+02 1.706e+02 2.595e+02, threshold=2.907e+02, percent-clipped=0.0 2023-12-04 07:08:34,608 INFO [train.py:1087] (2/4) Epoch 30, batch 800, loss[loss=0.1812, simple_loss=0.2708, pruned_loss=0.04584, over 23922.00 frames. ], tot_loss[loss=0.169, simple_loss=0.259, pruned_loss=0.03951, over 4722544.39 frames. ], batch size: 87, lr: 8.09e-03, grad_scale: 32.0 2023-12-04 07:08:58,873 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:09:06,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=178566.66666666666, ans=0.125 2023-12-04 07:09:26,660 INFO [train.py:1087] (2/4) Epoch 30, batch 850, loss[loss=0.1706, simple_loss=0.2595, pruned_loss=0.04089, over 21583.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.2594, pruned_loss=0.0397, over 4727782.96 frames. ], batch size: 128, lr: 8.09e-03, grad_scale: 16.0 2023-12-04 07:09:28,008 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.44 vs. limit=22.5 2023-12-04 07:09:41,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=178766.66666666666, ans=0.5 2023-12-04 07:09:44,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=178766.66666666666, ans=0.0 2023-12-04 07:09:58,683 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.343e+02 1.447e+02 1.578e+02 2.521e+02, threshold=2.894e+02, percent-clipped=0.0 2023-12-04 07:10:08,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=178966.66666666666, ans=0.0 2023-12-04 07:10:25,785 INFO [train.py:1087] (2/4) Epoch 31, batch 0, loss[loss=0.1615, simple_loss=0.255, pruned_loss=0.03395, over 24695.00 frames. ], tot_loss[loss=0.1615, simple_loss=0.255, pruned_loss=0.03395, over 24695.00 frames. ], batch size: 74, lr: 7.95e-03, grad_scale: 32.0 2023-12-04 07:10:25,786 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 07:10:36,967 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.7416, 4.2297, 3.6570, 4.5079, 4.1488, 3.8769, 4.3478, 4.1163], device='cuda:2') 2023-12-04 07:10:37,262 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0970, 4.4301, 4.0829, 4.6861], device='cuda:2') 2023-12-04 07:10:38,037 INFO [train.py:1119] (2/4) Epoch 31, validation: loss=0.1549, simple_loss=0.2551, pruned_loss=0.02731, over 944034.00 frames. 2023-12-04 07:10:38,037 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 07:10:41,398 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=179000.0, ans=0.125 2023-12-04 07:10:43,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.15 vs. limit=15.0 2023-12-04 07:10:45,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=179000.0, ans=0.125 2023-12-04 07:11:11,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=179200.0, ans=0.0 2023-12-04 07:11:17,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=179200.0, ans=0.125 2023-12-04 07:11:26,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=179266.66666666666, ans=0.0 2023-12-04 07:11:28,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=179266.66666666666, ans=0.125 2023-12-04 07:11:32,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=179333.33333333334, ans=0.2 2023-12-04 07:11:33,320 INFO [train.py:1087] (2/4) Epoch 31, batch 50, loss[loss=0.1743, simple_loss=0.2676, pruned_loss=0.04053, over 24463.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2601, pruned_loss=0.0395, over 1078992.62 frames. ], batch size: 77, lr: 7.94e-03, grad_scale: 32.0 2023-12-04 07:11:34,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=179333.33333333334, ans=0.09899494936611666 2023-12-04 07:11:45,588 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:11:54,353 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.51 vs. limit=15.0 2023-12-04 07:12:01,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=179466.66666666666, ans=0.2 2023-12-04 07:12:03,513 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179466.66666666666, ans=0.1 2023-12-04 07:12:03,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=179466.66666666666, ans=0.125 2023-12-04 07:12:13,926 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.356e+02 1.502e+02 1.739e+02 2.837e+02, threshold=3.004e+02, percent-clipped=0.0 2023-12-04 07:12:28,789 INFO [train.py:1087] (2/4) Epoch 31, batch 100, loss[loss=0.1898, simple_loss=0.2742, pruned_loss=0.0527, over 16840.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2592, pruned_loss=0.03933, over 1894998.59 frames. ], batch size: 179, lr: 7.93e-03, grad_scale: 32.0 2023-12-04 07:13:13,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=179933.33333333334, ans=0.2 2023-12-04 07:13:24,536 INFO [train.py:1087] (2/4) Epoch 31, batch 150, loss[loss=0.1501, simple_loss=0.2421, pruned_loss=0.02906, over 24758.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2585, pruned_loss=0.0388, over 2546459.88 frames. ], batch size: 64, lr: 7.92e-03, grad_scale: 32.0 2023-12-04 07:13:35,093 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:13:56,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=180133.33333333334, ans=0.0 2023-12-04 07:14:03,102 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=180200.0, ans=0.0 2023-12-04 07:14:06,421 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.368e+02 1.478e+02 1.660e+02 2.413e+02, threshold=2.957e+02, percent-clipped=0.0 2023-12-04 07:14:17,187 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=180266.66666666666, ans=0.125 2023-12-04 07:14:20,133 INFO [train.py:1087] (2/4) Epoch 31, batch 200, loss[loss=0.1654, simple_loss=0.2561, pruned_loss=0.0374, over 24809.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2586, pruned_loss=0.0388, over 3039485.55 frames. ], batch size: 73, lr: 7.92e-03, grad_scale: 32.0 2023-12-04 07:14:20,407 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=180333.33333333334, ans=0.125 2023-12-04 07:14:26,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=180333.33333333334, ans=0.5 2023-12-04 07:14:40,370 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-12-04 07:15:14,216 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-12-04 07:15:15,745 INFO [train.py:1087] (2/4) Epoch 31, batch 250, loss[loss=0.1683, simple_loss=0.2597, pruned_loss=0.03848, over 24576.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2588, pruned_loss=0.03908, over 3435263.89 frames. ], batch size: 65, lr: 7.91e-03, grad_scale: 32.0 2023-12-04 07:15:31,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=180733.33333333334, ans=0.125 2023-12-04 07:15:55,903 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.359e+02 1.512e+02 1.653e+02 2.162e+02, threshold=3.025e+02, percent-clipped=0.0 2023-12-04 07:16:00,628 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-12-04 07:16:02,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=180933.33333333334, ans=0.125 2023-12-04 07:16:10,752 INFO [train.py:1087] (2/4) Epoch 31, batch 300, loss[loss=0.1766, simple_loss=0.2686, pruned_loss=0.04229, over 24782.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2591, pruned_loss=0.03911, over 3749109.05 frames. ], batch size: 71, lr: 7.90e-03, grad_scale: 32.0 2023-12-04 07:16:23,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=181066.66666666666, ans=0.0 2023-12-04 07:16:36,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=181133.33333333334, ans=0.0 2023-12-04 07:16:37,403 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=181133.33333333334, ans=0.2 2023-12-04 07:16:38,728 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.26 vs. limit=15.0 2023-12-04 07:16:41,191 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=22.5 2023-12-04 07:16:46,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=181200.0, ans=0.125 2023-12-04 07:17:03,917 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=181333.33333333334, ans=0.015 2023-12-04 07:17:05,138 INFO [train.py:1087] (2/4) Epoch 31, batch 350, loss[loss=0.1588, simple_loss=0.2502, pruned_loss=0.03377, over 24758.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.2591, pruned_loss=0.03891, over 3994941.74 frames. ], batch size: 65, lr: 7.90e-03, grad_scale: 32.0 2023-12-04 07:17:15,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=181333.33333333334, ans=0.1 2023-12-04 07:17:46,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=181533.33333333334, ans=0.125 2023-12-04 07:17:47,357 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.342e+02 1.424e+02 1.549e+02 2.896e+02, threshold=2.849e+02, percent-clipped=0.0 2023-12-04 07:18:01,136 INFO [train.py:1087] (2/4) Epoch 31, batch 400, loss[loss=0.1583, simple_loss=0.2489, pruned_loss=0.03384, over 24712.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2592, pruned_loss=0.03894, over 4168804.11 frames. ], batch size: 67, lr: 7.89e-03, grad_scale: 32.0 2023-12-04 07:18:26,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181800.0, ans=0.1 2023-12-04 07:18:33,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=181866.66666666666, ans=0.0 2023-12-04 07:18:37,695 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=181866.66666666666, ans=0.125 2023-12-04 07:18:39,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=181866.66666666666, ans=0.125 2023-12-04 07:18:55,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=182000.0, ans=0.2 2023-12-04 07:18:56,750 INFO [train.py:1087] (2/4) Epoch 31, batch 450, loss[loss=0.1647, simple_loss=0.254, pruned_loss=0.03774, over 24707.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2588, pruned_loss=0.03878, over 4312669.37 frames. ], batch size: 69, lr: 7.88e-03, grad_scale: 32.0 2023-12-04 07:18:57,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=182000.0, ans=0.05 2023-12-04 07:19:15,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=182066.66666666666, ans=0.0 2023-12-04 07:19:27,069 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=182133.33333333334, ans=0.125 2023-12-04 07:19:33,814 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.44 vs. limit=15.0 2023-12-04 07:19:36,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=182200.0, ans=0.5 2023-12-04 07:19:37,346 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.319e+02 1.420e+02 1.584e+02 2.231e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 07:19:38,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=182200.0, ans=0.0 2023-12-04 07:19:52,202 INFO [train.py:1087] (2/4) Epoch 31, batch 500, loss[loss=0.1704, simple_loss=0.2536, pruned_loss=0.04357, over 24470.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2587, pruned_loss=0.03876, over 4420237.98 frames. ], batch size: 75, lr: 7.88e-03, grad_scale: 32.0 2023-12-04 07:19:56,143 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=22.5 2023-12-04 07:19:56,885 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.39 vs. limit=22.5 2023-12-04 07:19:59,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=182333.33333333334, ans=0.125 2023-12-04 07:20:02,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=182400.0, ans=0.0 2023-12-04 07:20:08,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=182400.0, ans=0.0 2023-12-04 07:20:27,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=182533.33333333334, ans=0.125 2023-12-04 07:20:29,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=182533.33333333334, ans=0.125 2023-12-04 07:20:38,410 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=182600.0, ans=0.2 2023-12-04 07:20:44,176 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.66 vs. limit=22.5 2023-12-04 07:20:46,817 INFO [train.py:1087] (2/4) Epoch 31, batch 550, loss[loss=0.1659, simple_loss=0.2617, pruned_loss=0.03509, over 24793.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2589, pruned_loss=0.0388, over 4513213.99 frames. ], batch size: 72, lr: 7.87e-03, grad_scale: 32.0 2023-12-04 07:21:06,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=182733.33333333334, ans=0.2 2023-12-04 07:21:09,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=182800.0, ans=0.125 2023-12-04 07:21:28,708 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.420e+02 1.529e+02 1.688e+02 2.421e+02, threshold=3.059e+02, percent-clipped=0.0 2023-12-04 07:21:35,942 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.58 vs. limit=10.0 2023-12-04 07:21:36,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=182933.33333333334, ans=0.0 2023-12-04 07:21:37,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=182933.33333333334, ans=0.0 2023-12-04 07:21:37,689 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=182933.33333333334, ans=0.0 2023-12-04 07:21:42,749 INFO [train.py:1087] (2/4) Epoch 31, batch 600, loss[loss=0.1648, simple_loss=0.2558, pruned_loss=0.03693, over 21587.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.259, pruned_loss=0.03906, over 4573518.47 frames. ], batch size: 127, lr: 7.86e-03, grad_scale: 32.0 2023-12-04 07:22:11,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=183133.33333333334, ans=0.0 2023-12-04 07:22:17,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=183200.0, ans=0.09899494936611666 2023-12-04 07:22:26,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183266.66666666666, ans=0.1 2023-12-04 07:22:35,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=183266.66666666666, ans=0.2 2023-12-04 07:22:36,456 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.68 vs. limit=15.0 2023-12-04 07:22:38,098 INFO [train.py:1087] (2/4) Epoch 31, batch 650, loss[loss=0.1677, simple_loss=0.2604, pruned_loss=0.03749, over 24709.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2591, pruned_loss=0.03901, over 4622726.51 frames. ], batch size: 67, lr: 7.86e-03, grad_scale: 32.0 2023-12-04 07:22:40,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=183333.33333333334, ans=0.0 2023-12-04 07:22:52,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=183400.0, ans=0.125 2023-12-04 07:23:04,147 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183466.66666666666, ans=0.1 2023-12-04 07:23:20,162 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.330e+02 1.393e+02 1.586e+02 2.143e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-04 07:23:20,673 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2023-12-04 07:23:30,244 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=183600.0, ans=0.0 2023-12-04 07:23:34,292 INFO [train.py:1087] (2/4) Epoch 31, batch 700, loss[loss=0.1611, simple_loss=0.2509, pruned_loss=0.03568, over 24798.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2591, pruned_loss=0.03906, over 4657048.12 frames. ], batch size: 62, lr: 7.85e-03, grad_scale: 16.0 2023-12-04 07:23:34,738 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-12-04 07:23:42,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=183666.66666666666, ans=0.0 2023-12-04 07:23:47,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=183733.33333333334, ans=0.0 2023-12-04 07:23:47,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=183733.33333333334, ans=0.125 2023-12-04 07:23:50,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=183733.33333333334, ans=0.125 2023-12-04 07:24:02,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=183800.0, ans=0.0 2023-12-04 07:24:06,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=183866.66666666666, ans=0.125 2023-12-04 07:24:17,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183933.33333333334, ans=0.1 2023-12-04 07:24:28,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=183933.33333333334, ans=0.125 2023-12-04 07:24:30,235 INFO [train.py:1087] (2/4) Epoch 31, batch 750, loss[loss=0.1633, simple_loss=0.2543, pruned_loss=0.03611, over 24554.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2586, pruned_loss=0.03874, over 4706895.76 frames. ], batch size: 62, lr: 7.84e-03, grad_scale: 16.0 2023-12-04 07:24:37,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=184000.0, ans=0.0 2023-12-04 07:24:41,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=184066.66666666666, ans=0.125 2023-12-04 07:24:59,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=184133.33333333334, ans=0.125 2023-12-04 07:25:01,714 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=184133.33333333334, ans=0.0 2023-12-04 07:25:10,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=184200.0, ans=0.0 2023-12-04 07:25:12,340 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.325e+02 1.411e+02 1.526e+02 2.166e+02, threshold=2.822e+02, percent-clipped=0.0 2023-12-04 07:25:24,014 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=184333.33333333334, ans=0.125 2023-12-04 07:25:25,311 INFO [train.py:1087] (2/4) Epoch 31, batch 800, loss[loss=0.1604, simple_loss=0.2533, pruned_loss=0.03377, over 24742.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2583, pruned_loss=0.03864, over 4711774.52 frames. ], batch size: 63, lr: 7.84e-03, grad_scale: 32.0 2023-12-04 07:25:36,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=184400.0, ans=0.2 2023-12-04 07:25:57,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=184533.33333333334, ans=0.2 2023-12-04 07:25:58,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=184533.33333333334, ans=0.125 2023-12-04 07:25:58,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=184533.33333333334, ans=0.125 2023-12-04 07:26:01,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=184533.33333333334, ans=0.2 2023-12-04 07:26:16,534 INFO [train.py:1087] (2/4) Epoch 31, batch 850, loss[loss=0.1726, simple_loss=0.2661, pruned_loss=0.03951, over 24719.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2585, pruned_loss=0.03873, over 4745251.10 frames. ], batch size: 69, lr: 7.83e-03, grad_scale: 32.0 2023-12-04 07:26:20,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=184666.66666666666, ans=0.1 2023-12-04 07:26:20,667 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184666.66666666666, ans=0.0 2023-12-04 07:26:24,956 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.67 vs. limit=22.5 2023-12-04 07:26:54,694 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.395e+02 1.510e+02 1.646e+02 2.236e+02, threshold=3.020e+02, percent-clipped=0.0 2023-12-04 07:26:58,353 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.59 vs. limit=15.0 2023-12-04 07:26:59,384 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.44 vs. limit=15.0 2023-12-04 07:27:16,716 INFO [train.py:1087] (2/4) Epoch 32, batch 0, loss[loss=0.1724, simple_loss=0.2659, pruned_loss=0.03942, over 23554.00 frames. ], tot_loss[loss=0.1724, simple_loss=0.2659, pruned_loss=0.03942, over 23554.00 frames. ], batch size: 94, lr: 7.70e-03, grad_scale: 32.0 2023-12-04 07:27:16,717 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 07:27:28,739 INFO [train.py:1119] (2/4) Epoch 32, validation: loss=0.154, simple_loss=0.2543, pruned_loss=0.02682, over 944034.00 frames. 2023-12-04 07:27:28,740 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 07:27:30,053 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=184966.66666666666, ans=0.125 2023-12-04 07:27:35,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=184966.66666666666, ans=0.0 2023-12-04 07:27:40,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=185033.33333333334, ans=0.125 2023-12-04 07:27:57,386 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=185100.0, ans=0.125 2023-12-04 07:27:58,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=185100.0, ans=0.125 2023-12-04 07:28:01,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=185166.66666666666, ans=0.0 2023-12-04 07:28:01,609 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185166.66666666666, ans=0.125 2023-12-04 07:28:04,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=185166.66666666666, ans=0.125 2023-12-04 07:28:16,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=185233.33333333334, ans=0.125 2023-12-04 07:28:18,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185233.33333333334, ans=0.1 2023-12-04 07:28:22,742 INFO [train.py:1087] (2/4) Epoch 32, batch 50, loss[loss=0.1622, simple_loss=0.2552, pruned_loss=0.03466, over 24569.00 frames. ], tot_loss[loss=0.1707, simple_loss=0.261, pruned_loss=0.04027, over 1069215.54 frames. ], batch size: 65, lr: 7.69e-03, grad_scale: 32.0 2023-12-04 07:28:32,271 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=185300.0, ans=0.07 2023-12-04 07:28:39,588 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=185366.66666666666, ans=0.125 2023-12-04 07:28:47,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=185433.33333333334, ans=0.0 2023-12-04 07:28:52,491 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.33 vs. limit=22.5 2023-12-04 07:28:53,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=185433.33333333334, ans=0.1 2023-12-04 07:28:54,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=185433.33333333334, ans=0.0 2023-12-04 07:29:10,205 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.353e+02 1.432e+02 1.598e+02 2.705e+02, threshold=2.864e+02, percent-clipped=0.0 2023-12-04 07:29:10,740 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-12-04 07:29:16,167 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2023-12-04 07:29:17,656 INFO [train.py:1087] (2/4) Epoch 32, batch 100, loss[loss=0.1793, simple_loss=0.2697, pruned_loss=0.04446, over 22726.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2596, pruned_loss=0.03925, over 1909717.12 frames. ], batch size: 106, lr: 7.69e-03, grad_scale: 32.0 2023-12-04 07:29:38,461 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=185700.0, ans=0.0 2023-12-04 07:29:41,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=185766.66666666666, ans=0.125 2023-12-04 07:29:45,820 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185766.66666666666, ans=0.0 2023-12-04 07:30:05,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=185900.0, ans=0.1 2023-12-04 07:30:12,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=185966.66666666666, ans=0.125 2023-12-04 07:30:13,318 INFO [train.py:1087] (2/4) Epoch 32, batch 150, loss[loss=0.1671, simple_loss=0.2582, pruned_loss=0.03796, over 24764.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2595, pruned_loss=0.03883, over 2555878.34 frames. ], batch size: 66, lr: 7.68e-03, grad_scale: 32.0 2023-12-04 07:30:28,792 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-12-04 07:30:30,454 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:31:01,349 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.314e+02 1.382e+02 1.497e+02 2.003e+02, threshold=2.764e+02, percent-clipped=0.0 2023-12-04 07:31:03,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=186233.33333333334, ans=0.125 2023-12-04 07:31:08,724 INFO [train.py:1087] (2/4) Epoch 32, batch 200, loss[loss=0.1602, simple_loss=0.2525, pruned_loss=0.03393, over 24688.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2588, pruned_loss=0.03837, over 3060507.45 frames. ], batch size: 74, lr: 7.67e-03, grad_scale: 32.0 2023-12-04 07:31:24,177 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=186366.66666666666, ans=0.125 2023-12-04 07:32:04,818 INFO [train.py:1087] (2/4) Epoch 32, batch 250, loss[loss=0.1612, simple_loss=0.2542, pruned_loss=0.03408, over 24558.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2581, pruned_loss=0.03825, over 3457458.46 frames. ], batch size: 63, lr: 7.67e-03, grad_scale: 32.0 2023-12-04 07:32:06,078 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=186633.33333333334, ans=0.0 2023-12-04 07:32:52,775 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-12-04 07:32:54,438 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.314e+02 1.423e+02 1.624e+02 2.429e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 07:32:58,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186900.0, ans=0.125 2023-12-04 07:33:01,847 INFO [train.py:1087] (2/4) Epoch 32, batch 300, loss[loss=0.1734, simple_loss=0.2655, pruned_loss=0.0406, over 24797.00 frames. ], tot_loss[loss=0.1676, simple_loss=0.2585, pruned_loss=0.03835, over 3760421.67 frames. ], batch size: 71, lr: 7.66e-03, grad_scale: 32.0 2023-12-04 07:33:03,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=186966.66666666666, ans=0.125 2023-12-04 07:33:37,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=187166.66666666666, ans=0.125 2023-12-04 07:33:42,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187166.66666666666, ans=0.1 2023-12-04 07:33:44,339 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.40 vs. limit=15.0 2023-12-04 07:33:52,836 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.69 vs. limit=22.5 2023-12-04 07:33:57,892 INFO [train.py:1087] (2/4) Epoch 32, batch 350, loss[loss=0.1844, simple_loss=0.2685, pruned_loss=0.0502, over 17496.00 frames. ], tot_loss[loss=0.1675, simple_loss=0.2583, pruned_loss=0.03838, over 3979948.55 frames. ], batch size: 176, lr: 7.65e-03, grad_scale: 16.0 2023-12-04 07:34:36,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-12-04 07:34:46,780 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.371e+02 1.458e+02 1.584e+02 2.134e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 07:34:48,137 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=187566.66666666666, ans=0.0 2023-12-04 07:34:50,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=187566.66666666666, ans=0.5 2023-12-04 07:34:53,083 INFO [train.py:1087] (2/4) Epoch 32, batch 400, loss[loss=0.1666, simple_loss=0.2569, pruned_loss=0.03819, over 24740.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2586, pruned_loss=0.03879, over 4142054.93 frames. ], batch size: 63, lr: 7.65e-03, grad_scale: 32.0 2023-12-04 07:34:58,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=187633.33333333334, ans=10.0 2023-12-04 07:35:02,183 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187633.33333333334, ans=0.1 2023-12-04 07:35:20,852 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:35:26,056 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=187833.33333333334, ans=0.09899494936611666 2023-12-04 07:35:29,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=187833.33333333334, ans=0.0 2023-12-04 07:35:35,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=187833.33333333334, ans=0.125 2023-12-04 07:35:38,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=187900.0, ans=0.125 2023-12-04 07:35:42,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=187900.0, ans=0.125 2023-12-04 07:35:48,525 INFO [train.py:1087] (2/4) Epoch 32, batch 450, loss[loss=0.1713, simple_loss=0.2652, pruned_loss=0.03869, over 24854.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2588, pruned_loss=0.03883, over 4286616.18 frames. ], batch size: 68, lr: 7.64e-03, grad_scale: 32.0 2023-12-04 07:35:55,576 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187966.66666666666, ans=0.125 2023-12-04 07:36:37,067 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.326e+02 1.426e+02 1.575e+02 2.216e+02, threshold=2.852e+02, percent-clipped=0.0 2023-12-04 07:36:44,498 INFO [train.py:1087] (2/4) Epoch 32, batch 500, loss[loss=0.1512, simple_loss=0.241, pruned_loss=0.03071, over 24730.00 frames. ], tot_loss[loss=0.1676, simple_loss=0.2583, pruned_loss=0.03849, over 4410237.18 frames. ], batch size: 69, lr: 7.64e-03, grad_scale: 32.0 2023-12-04 07:36:46,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=188300.0, ans=0.0 2023-12-04 07:37:02,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188366.66666666666, ans=0.1 2023-12-04 07:37:31,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=188566.66666666666, ans=0.0 2023-12-04 07:37:38,051 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-12-04 07:37:38,695 INFO [train.py:1087] (2/4) Epoch 32, batch 550, loss[loss=0.1595, simple_loss=0.2473, pruned_loss=0.03581, over 24736.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.258, pruned_loss=0.03841, over 4500278.48 frames. ], batch size: 63, lr: 7.63e-03, grad_scale: 32.0 2023-12-04 07:37:58,951 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188700.0, ans=0.1 2023-12-04 07:38:05,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=188766.66666666666, ans=0.0 2023-12-04 07:38:10,660 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=188833.33333333334, ans=0.125 2023-12-04 07:38:27,418 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.197e+02 1.355e+02 1.536e+02 1.656e+02 2.166e+02, threshold=3.073e+02, percent-clipped=0.0 2023-12-04 07:38:32,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=188966.66666666666, ans=0.0 2023-12-04 07:38:33,697 INFO [train.py:1087] (2/4) Epoch 32, batch 600, loss[loss=0.1541, simple_loss=0.2458, pruned_loss=0.03121, over 24793.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2578, pruned_loss=0.03842, over 4572110.49 frames. ], batch size: 73, lr: 7.62e-03, grad_scale: 32.0 2023-12-04 07:38:33,934 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=188966.66666666666, ans=0.0 2023-12-04 07:38:34,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=188966.66666666666, ans=0.05 2023-12-04 07:38:39,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=188966.66666666666, ans=0.0 2023-12-04 07:38:44,306 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=189033.33333333334, ans=0.0 2023-12-04 07:39:15,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=189166.66666666666, ans=0.2 2023-12-04 07:39:29,169 INFO [train.py:1087] (2/4) Epoch 32, batch 650, loss[loss=0.1721, simple_loss=0.2615, pruned_loss=0.04132, over 24477.00 frames. ], tot_loss[loss=0.1672, simple_loss=0.2577, pruned_loss=0.03833, over 4632602.37 frames. ], batch size: 77, lr: 7.62e-03, grad_scale: 32.0 2023-12-04 07:39:29,361 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=189300.0, ans=0.0 2023-12-04 07:39:34,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=189300.0, ans=0.0 2023-12-04 07:39:54,215 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:40:14,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=189566.66666666666, ans=0.0 2023-12-04 07:40:18,423 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.335e+02 1.423e+02 1.569e+02 2.985e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 07:40:25,273 INFO [train.py:1087] (2/4) Epoch 32, batch 700, loss[loss=0.1555, simple_loss=0.2515, pruned_loss=0.02978, over 24692.00 frames. ], tot_loss[loss=0.1672, simple_loss=0.2579, pruned_loss=0.03827, over 4675554.87 frames. ], batch size: 69, lr: 7.61e-03, grad_scale: 32.0 2023-12-04 07:40:29,162 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-12-04 07:40:45,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=189766.66666666666, ans=0.1 2023-12-04 07:40:48,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=189766.66666666666, ans=0.125 2023-12-04 07:41:05,800 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.04 vs. limit=22.5 2023-12-04 07:41:10,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=189900.0, ans=0.125 2023-12-04 07:41:18,543 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=15.0 2023-12-04 07:41:20,637 INFO [train.py:1087] (2/4) Epoch 32, batch 750, loss[loss=0.1679, simple_loss=0.2579, pruned_loss=0.03897, over 24734.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.258, pruned_loss=0.0383, over 4710780.87 frames. ], batch size: 63, lr: 7.60e-03, grad_scale: 32.0 2023-12-04 07:41:20,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=189966.66666666666, ans=0.07 2023-12-04 07:41:22,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=189966.66666666666, ans=0.2 2023-12-04 07:42:09,087 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.040e+02 1.317e+02 1.435e+02 1.596e+02 2.082e+02, threshold=2.871e+02, percent-clipped=0.0 2023-12-04 07:42:09,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=190233.33333333334, ans=0.125 2023-12-04 07:42:15,436 INFO [train.py:1087] (2/4) Epoch 32, batch 800, loss[loss=0.1823, simple_loss=0.2752, pruned_loss=0.04469, over 21563.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2587, pruned_loss=0.03875, over 4713361.22 frames. ], batch size: 127, lr: 7.60e-03, grad_scale: 32.0 2023-12-04 07:42:17,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=190300.0, ans=0.125 2023-12-04 07:42:32,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=190366.66666666666, ans=0.125 2023-12-04 07:42:34,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=190366.66666666666, ans=0.09899494936611666 2023-12-04 07:42:45,104 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=22.5 2023-12-04 07:42:51,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=190500.0, ans=0.125 2023-12-04 07:42:54,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=190500.0, ans=0.125 2023-12-04 07:42:57,544 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=190566.66666666666, ans=0.09899494936611666 2023-12-04 07:43:06,174 INFO [train.py:1087] (2/4) Epoch 32, batch 850, loss[loss=0.162, simple_loss=0.252, pruned_loss=0.036, over 21206.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2586, pruned_loss=0.03892, over 4717023.66 frames. ], batch size: 128, lr: 7.59e-03, grad_scale: 32.0 2023-12-04 07:43:10,886 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.23 vs. limit=15.0 2023-12-04 07:43:15,936 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190700.0, ans=0.1 2023-12-04 07:43:20,891 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=190700.0, ans=0.125 2023-12-04 07:43:24,450 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.91 vs. limit=5.0 2023-12-04 07:43:30,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=190766.66666666666, ans=0.0 2023-12-04 07:43:35,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=190833.33333333334, ans=0.05 2023-12-04 07:43:48,724 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.30 vs. limit=10.0 2023-12-04 07:43:57,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190933.33333333334, ans=0.1 2023-12-04 07:44:04,799 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.355e+02 1.461e+02 1.606e+02 3.064e+02, threshold=2.922e+02, percent-clipped=2.0 2023-12-04 07:44:04,826 INFO [train.py:1087] (2/4) Epoch 33, batch 0, loss[loss=0.1643, simple_loss=0.2579, pruned_loss=0.03529, over 24313.00 frames. ], tot_loss[loss=0.1643, simple_loss=0.2579, pruned_loss=0.03529, over 24313.00 frames. ], batch size: 79, lr: 7.47e-03, grad_scale: 32.0 2023-12-04 07:44:04,827 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 07:44:16,803 INFO [train.py:1119] (2/4) Epoch 33, validation: loss=0.154, simple_loss=0.2541, pruned_loss=0.02696, over 944034.00 frames. 2023-12-04 07:44:16,804 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 07:44:20,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=190933.33333333334, ans=0.125 2023-12-04 07:44:34,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=191000.0, ans=0.125 2023-12-04 07:44:36,667 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=191000.0, ans=0.125 2023-12-04 07:44:42,513 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191066.66666666666, ans=0.1 2023-12-04 07:44:49,663 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=191133.33333333334, ans=10.0 2023-12-04 07:44:53,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191133.33333333334, ans=0.1 2023-12-04 07:45:00,272 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=191200.0, ans=0.0 2023-12-04 07:45:09,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=191200.0, ans=0.09899494936611666 2023-12-04 07:45:12,133 INFO [train.py:1087] (2/4) Epoch 33, batch 50, loss[loss=0.1594, simple_loss=0.2481, pruned_loss=0.03537, over 24359.00 frames. ], tot_loss[loss=0.1682, simple_loss=0.2584, pruned_loss=0.03898, over 1062930.49 frames. ], batch size: 79, lr: 7.46e-03, grad_scale: 32.0 2023-12-04 07:45:12,358 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=191266.66666666666, ans=0.0 2023-12-04 07:45:17,029 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-12-04 07:45:18,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=191266.66666666666, ans=0.125 2023-12-04 07:45:20,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=191266.66666666666, ans=0.0 2023-12-04 07:45:33,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=191400.0, ans=0.125 2023-12-04 07:45:36,819 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=191400.0, ans=0.0 2023-12-04 07:45:51,158 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-12-04 07:46:07,473 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.212e+02 1.378e+02 1.539e+02 1.731e+02 2.265e+02, threshold=3.079e+02, percent-clipped=0.0 2023-12-04 07:46:07,499 INFO [train.py:1087] (2/4) Epoch 33, batch 100, loss[loss=0.1649, simple_loss=0.2603, pruned_loss=0.03474, over 24581.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2587, pruned_loss=0.0387, over 1888110.02 frames. ], batch size: 65, lr: 7.46e-03, grad_scale: 32.0 2023-12-04 07:46:11,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=191600.0, ans=0.05 2023-12-04 07:46:16,520 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=191600.0, ans=0.2 2023-12-04 07:46:16,551 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:46:18,610 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=191666.66666666666, ans=0.125 2023-12-04 07:46:20,756 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=191666.66666666666, ans=0.125 2023-12-04 07:46:21,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=191666.66666666666, ans=0.09899494936611666 2023-12-04 07:46:21,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=191666.66666666666, ans=0.04949747468305833 2023-12-04 07:46:32,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=191733.33333333334, ans=0.125 2023-12-04 07:46:37,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=191733.33333333334, ans=0.125 2023-12-04 07:46:50,936 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.75 vs. limit=22.5 2023-12-04 07:46:53,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=191866.66666666666, ans=0.125 2023-12-04 07:47:01,780 INFO [train.py:1087] (2/4) Epoch 33, batch 150, loss[loss=0.1691, simple_loss=0.2535, pruned_loss=0.0423, over 24540.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2584, pruned_loss=0.03881, over 2528256.95 frames. ], batch size: 62, lr: 7.45e-03, grad_scale: 32.0 2023-12-04 07:47:17,308 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=192000.0, ans=0.125 2023-12-04 07:47:43,566 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-12-04 07:47:46,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192200.0, ans=0.1 2023-12-04 07:47:48,437 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=192200.0, ans=0.125 2023-12-04 07:47:52,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=192200.0, ans=0.0 2023-12-04 07:47:58,015 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.177e+02 1.283e+02 1.371e+02 1.480e+02 2.275e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 07:47:58,045 INFO [train.py:1087] (2/4) Epoch 33, batch 200, loss[loss=0.1584, simple_loss=0.2532, pruned_loss=0.03175, over 24784.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2574, pruned_loss=0.03784, over 3040276.77 frames. ], batch size: 71, lr: 7.44e-03, grad_scale: 32.0 2023-12-04 07:48:05,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=192266.66666666666, ans=0.125 2023-12-04 07:48:42,679 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=192533.33333333334, ans=0.2 2023-12-04 07:48:46,838 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.89 vs. limit=5.0 2023-12-04 07:48:53,310 INFO [train.py:1087] (2/4) Epoch 33, batch 250, loss[loss=0.1572, simple_loss=0.2534, pruned_loss=0.0305, over 24746.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2576, pruned_loss=0.03823, over 3429859.68 frames. ], batch size: 66, lr: 7.44e-03, grad_scale: 32.0 2023-12-04 07:49:05,861 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=22.5 2023-12-04 07:49:11,231 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=192666.66666666666, ans=0.125 2023-12-04 07:49:14,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=192733.33333333334, ans=0.0 2023-12-04 07:49:39,097 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-12-04 07:49:49,259 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.292e+02 1.428e+02 1.566e+02 2.178e+02, threshold=2.856e+02, percent-clipped=0.0 2023-12-04 07:49:49,285 INFO [train.py:1087] (2/4) Epoch 33, batch 300, loss[loss=0.1654, simple_loss=0.2575, pruned_loss=0.03664, over 24570.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2576, pruned_loss=0.03807, over 3741161.91 frames. ], batch size: 65, lr: 7.43e-03, grad_scale: 32.0 2023-12-04 07:49:56,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192933.33333333334, ans=0.125 2023-12-04 07:50:07,730 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=193000.0, ans=0.0 2023-12-04 07:50:08,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=193000.0, ans=0.125 2023-12-04 07:50:44,308 INFO [train.py:1087] (2/4) Epoch 33, batch 350, loss[loss=0.159, simple_loss=0.2501, pruned_loss=0.03397, over 24718.00 frames. ], tot_loss[loss=0.1671, simple_loss=0.2579, pruned_loss=0.03815, over 3979635.64 frames. ], batch size: 67, lr: 7.43e-03, grad_scale: 32.0 2023-12-04 07:51:18,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=193466.66666666666, ans=0.125 2023-12-04 07:51:29,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=193533.33333333334, ans=0.1 2023-12-04 07:51:33,023 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.00 vs. limit=15.0 2023-12-04 07:51:39,778 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.316e+02 1.411e+02 1.589e+02 2.112e+02, threshold=2.823e+02, percent-clipped=0.0 2023-12-04 07:51:39,804 INFO [train.py:1087] (2/4) Epoch 33, batch 400, loss[loss=0.1646, simple_loss=0.2564, pruned_loss=0.03642, over 24807.00 frames. ], tot_loss[loss=0.1672, simple_loss=0.258, pruned_loss=0.03822, over 4163366.53 frames. ], batch size: 71, lr: 7.42e-03, grad_scale: 32.0 2023-12-04 07:51:41,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=193600.0, ans=0.025 2023-12-04 07:51:52,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=193666.66666666666, ans=0.125 2023-12-04 07:51:58,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193666.66666666666, ans=0.1 2023-12-04 07:52:04,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=193733.33333333334, ans=0.0 2023-12-04 07:52:14,413 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.85 vs. limit=10.0 2023-12-04 07:52:17,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=193800.0, ans=0.0 2023-12-04 07:52:33,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193866.66666666666, ans=0.125 2023-12-04 07:52:35,359 INFO [train.py:1087] (2/4) Epoch 33, batch 450, loss[loss=0.1508, simple_loss=0.2469, pruned_loss=0.02732, over 24710.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2575, pruned_loss=0.03801, over 4309403.68 frames. ], batch size: 74, lr: 7.41e-03, grad_scale: 32.0 2023-12-04 07:52:35,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=193933.33333333334, ans=0.0 2023-12-04 07:52:39,796 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=193933.33333333334, ans=0.125 2023-12-04 07:52:41,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=193933.33333333334, ans=0.2 2023-12-04 07:53:06,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=194133.33333333334, ans=0.125 2023-12-04 07:53:15,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=194133.33333333334, ans=0.125 2023-12-04 07:53:16,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=194133.33333333334, ans=0.125 2023-12-04 07:53:20,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=194200.0, ans=0.2 2023-12-04 07:53:25,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=194200.0, ans=0.125 2023-12-04 07:53:30,271 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.385e+02 1.524e+02 1.698e+02 2.358e+02, threshold=3.049e+02, percent-clipped=0.0 2023-12-04 07:53:30,297 INFO [train.py:1087] (2/4) Epoch 33, batch 500, loss[loss=0.1528, simple_loss=0.2406, pruned_loss=0.03252, over 24766.00 frames. ], tot_loss[loss=0.167, simple_loss=0.2575, pruned_loss=0.03822, over 4404077.85 frames. ], batch size: 70, lr: 7.41e-03, grad_scale: 32.0 2023-12-04 07:53:36,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194266.66666666666, ans=0.125 2023-12-04 07:53:37,145 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.86 vs. limit=12.0 2023-12-04 07:54:24,728 INFO [train.py:1087] (2/4) Epoch 33, batch 550, loss[loss=0.1593, simple_loss=0.2505, pruned_loss=0.03407, over 24751.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2573, pruned_loss=0.03798, over 4508622.29 frames. ], batch size: 65, lr: 7.40e-03, grad_scale: 32.0 2023-12-04 07:54:40,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=194666.66666666666, ans=0.125 2023-12-04 07:54:56,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194733.33333333334, ans=0.1 2023-12-04 07:55:01,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=194800.0, ans=0.125 2023-12-04 07:55:20,564 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.284e+02 1.392e+02 1.505e+02 1.929e+02, threshold=2.784e+02, percent-clipped=0.0 2023-12-04 07:55:20,589 INFO [train.py:1087] (2/4) Epoch 33, batch 600, loss[loss=0.1533, simple_loss=0.2449, pruned_loss=0.03088, over 24745.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2568, pruned_loss=0.0377, over 4572263.43 frames. ], batch size: 63, lr: 7.40e-03, grad_scale: 32.0 2023-12-04 07:55:21,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=194933.33333333334, ans=0.125 2023-12-04 07:55:26,043 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=194933.33333333334, ans=0.0 2023-12-04 07:55:39,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195000.0, ans=0.1 2023-12-04 07:56:10,284 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=195200.0, ans=0.125 2023-12-04 07:56:16,544 INFO [train.py:1087] (2/4) Epoch 33, batch 650, loss[loss=0.1679, simple_loss=0.2602, pruned_loss=0.03782, over 24578.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2571, pruned_loss=0.03786, over 4626969.26 frames. ], batch size: 65, lr: 7.39e-03, grad_scale: 32.0 2023-12-04 07:56:25,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=195266.66666666666, ans=0.0 2023-12-04 07:56:32,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=195333.33333333334, ans=0.95 2023-12-04 07:56:59,482 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=195533.33333333334, ans=0.0 2023-12-04 07:56:59,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=195533.33333333334, ans=0.2 2023-12-04 07:57:06,188 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-12-04 07:57:12,034 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.333e+02 1.459e+02 1.652e+02 2.138e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 07:57:12,060 INFO [train.py:1087] (2/4) Epoch 33, batch 700, loss[loss=0.1627, simple_loss=0.2477, pruned_loss=0.03884, over 24129.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2566, pruned_loss=0.03765, over 4677756.95 frames. ], batch size: 58, lr: 7.38e-03, grad_scale: 32.0 2023-12-04 07:57:19,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195600.0, ans=0.0 2023-12-04 07:57:30,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=195666.66666666666, ans=0.1 2023-12-04 07:57:48,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=195800.0, ans=0.04949747468305833 2023-12-04 07:57:48,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=195800.0, ans=0.0 2023-12-04 07:58:07,766 INFO [train.py:1087] (2/4) Epoch 33, batch 750, loss[loss=0.1707, simple_loss=0.2596, pruned_loss=0.04089, over 24314.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2567, pruned_loss=0.03775, over 4717329.17 frames. ], batch size: 79, lr: 7.38e-03, grad_scale: 32.0 2023-12-04 07:58:10,538 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=195933.33333333334, ans=0.125 2023-12-04 07:58:10,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=195933.33333333334, ans=0.2 2023-12-04 07:58:25,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=196000.0, ans=0.125 2023-12-04 07:58:31,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=196066.66666666666, ans=0.125 2023-12-04 07:58:32,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=196066.66666666666, ans=0.035 2023-12-04 07:58:32,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.72 vs. limit=12.0 2023-12-04 07:59:03,165 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.329e+02 1.417e+02 1.532e+02 2.129e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 07:59:03,191 INFO [train.py:1087] (2/4) Epoch 33, batch 800, loss[loss=0.2003, simple_loss=0.2755, pruned_loss=0.06256, over 16564.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2566, pruned_loss=0.03763, over 4737210.07 frames. ], batch size: 177, lr: 7.37e-03, grad_scale: 32.0 2023-12-04 07:59:18,767 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-12-04 07:59:31,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=196400.0, ans=0.125 2023-12-04 07:59:34,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=196466.66666666666, ans=0.0 2023-12-04 07:59:40,654 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=196466.66666666666, ans=0.2 2023-12-04 07:59:45,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=196533.33333333334, ans=0.2 2023-12-04 07:59:54,526 INFO [train.py:1087] (2/4) Epoch 33, batch 850, loss[loss=0.155, simple_loss=0.2528, pruned_loss=0.02855, over 24730.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2565, pruned_loss=0.0376, over 4766467.78 frames. ], batch size: 67, lr: 7.37e-03, grad_scale: 32.0 2023-12-04 07:59:55,739 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:59:58,041 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.81 vs. limit=22.5 2023-12-04 08:00:00,586 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.45 vs. limit=12.0 2023-12-04 08:00:18,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=196733.33333333334, ans=0.1 2023-12-04 08:00:24,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=196800.0, ans=0.0 2023-12-04 08:00:32,025 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=196800.0, ans=0.0 2023-12-04 08:00:52,440 INFO [train.py:1087] (2/4) Epoch 34, batch 0, loss[loss=0.1632, simple_loss=0.2553, pruned_loss=0.03553, over 24590.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.2553, pruned_loss=0.03553, over 24590.00 frames. ], batch size: 65, lr: 7.25e-03, grad_scale: 32.0 2023-12-04 08:00:52,441 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 08:01:04,660 INFO [train.py:1119] (2/4) Epoch 34, validation: loss=0.1541, simple_loss=0.2542, pruned_loss=0.02698, over 944034.00 frames. 2023-12-04 08:01:04,661 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 08:01:07,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=196900.0, ans=0.2 2023-12-04 08:01:09,872 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.334e+02 1.484e+02 1.658e+02 2.498e+02, threshold=2.968e+02, percent-clipped=0.0 2023-12-04 08:01:25,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197033.33333333334, ans=0.1 2023-12-04 08:01:26,843 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=197033.33333333334, ans=0.0 2023-12-04 08:01:59,551 INFO [train.py:1087] (2/4) Epoch 34, batch 50, loss[loss=0.1757, simple_loss=0.2668, pruned_loss=0.0423, over 23459.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.258, pruned_loss=0.03765, over 1088767.51 frames. ], batch size: 94, lr: 7.24e-03, grad_scale: 32.0 2023-12-04 08:02:03,308 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:02:09,708 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197300.0, ans=0.1 2023-12-04 08:02:16,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=197300.0, ans=0.1 2023-12-04 08:02:17,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=197300.0, ans=0.0 2023-12-04 08:02:19,679 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:02:25,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=197366.66666666666, ans=0.2 2023-12-04 08:02:33,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=197433.33333333334, ans=0.125 2023-12-04 08:02:41,781 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2023-12-04 08:02:55,518 INFO [train.py:1087] (2/4) Epoch 34, batch 100, loss[loss=0.1518, simple_loss=0.2404, pruned_loss=0.03158, over 24550.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2573, pruned_loss=0.0375, over 1910079.39 frames. ], batch size: 66, lr: 7.24e-03, grad_scale: 32.0 2023-12-04 08:03:00,824 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.348e+02 1.470e+02 1.630e+02 2.768e+02, threshold=2.940e+02, percent-clipped=0.0 2023-12-04 08:03:05,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=197633.33333333334, ans=0.0 2023-12-04 08:03:10,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=197633.33333333334, ans=0.0 2023-12-04 08:03:13,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=197633.33333333334, ans=0.2 2023-12-04 08:03:20,698 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=197700.0, ans=0.125 2023-12-04 08:03:35,867 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.42 vs. limit=15.0 2023-12-04 08:03:46,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=197833.33333333334, ans=0.125 2023-12-04 08:03:47,250 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197833.33333333334, ans=0.1 2023-12-04 08:03:50,207 INFO [train.py:1087] (2/4) Epoch 34, batch 150, loss[loss=0.1636, simple_loss=0.2502, pruned_loss=0.03854, over 24523.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.2569, pruned_loss=0.03773, over 2551511.19 frames. ], batch size: 75, lr: 7.23e-03, grad_scale: 32.0 2023-12-04 08:03:57,763 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-12-04 08:04:30,075 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.42 vs. limit=15.0 2023-12-04 08:04:41,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=198166.66666666666, ans=0.0 2023-12-04 08:04:45,517 INFO [train.py:1087] (2/4) Epoch 34, batch 200, loss[loss=0.1606, simple_loss=0.2544, pruned_loss=0.03339, over 24763.00 frames. ], tot_loss[loss=0.1652, simple_loss=0.2563, pruned_loss=0.03707, over 3069355.72 frames. ], batch size: 70, lr: 7.23e-03, grad_scale: 32.0 2023-12-04 08:04:51,204 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.341e+02 1.461e+02 1.618e+02 2.204e+02, threshold=2.921e+02, percent-clipped=0.0 2023-12-04 08:04:53,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=198233.33333333334, ans=0.125 2023-12-04 08:05:01,870 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-12-04 08:05:06,974 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=198366.66666666666, ans=0.035 2023-12-04 08:05:12,189 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=198366.66666666666, ans=0.0 2023-12-04 08:05:19,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=198433.33333333334, ans=0.125 2023-12-04 08:05:22,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=198433.33333333334, ans=0.2 2023-12-04 08:05:26,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=198433.33333333334, ans=0.0 2023-12-04 08:05:26,576 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:05:32,930 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-12-04 08:05:36,767 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:05:36,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=198500.0, ans=0.0 2023-12-04 08:05:40,863 INFO [train.py:1087] (2/4) Epoch 34, batch 250, loss[loss=0.1734, simple_loss=0.2605, pruned_loss=0.0432, over 24471.00 frames. ], tot_loss[loss=0.1653, simple_loss=0.2561, pruned_loss=0.03722, over 3456831.11 frames. ], batch size: 75, lr: 7.22e-03, grad_scale: 32.0 2023-12-04 08:05:45,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=198566.66666666666, ans=0.125 2023-12-04 08:05:55,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198633.33333333334, ans=0.125 2023-12-04 08:06:03,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=198700.0, ans=0.1 2023-12-04 08:06:06,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=198700.0, ans=0.125 2023-12-04 08:06:19,800 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.49 vs. limit=5.0 2023-12-04 08:06:20,366 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198766.66666666666, ans=0.1 2023-12-04 08:06:21,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=198766.66666666666, ans=0.0 2023-12-04 08:06:35,544 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=198833.33333333334, ans=0.125 2023-12-04 08:06:37,414 INFO [train.py:1087] (2/4) Epoch 34, batch 300, loss[loss=0.1586, simple_loss=0.2528, pruned_loss=0.03225, over 24738.00 frames. ], tot_loss[loss=0.1646, simple_loss=0.2557, pruned_loss=0.03679, over 3769693.17 frames. ], batch size: 61, lr: 7.22e-03, grad_scale: 32.0 2023-12-04 08:06:37,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=198900.0, ans=0.125 2023-12-04 08:06:42,627 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.316e+02 1.407e+02 1.530e+02 2.162e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 08:07:03,544 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=22.5 2023-12-04 08:07:27,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=199166.66666666666, ans=0.1 2023-12-04 08:07:33,270 INFO [train.py:1087] (2/4) Epoch 34, batch 350, loss[loss=0.1675, simple_loss=0.2586, pruned_loss=0.03824, over 24568.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2558, pruned_loss=0.03696, over 4003567.79 frames. ], batch size: 65, lr: 7.21e-03, grad_scale: 32.0 2023-12-04 08:07:33,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=199233.33333333334, ans=0.1 2023-12-04 08:07:41,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=199233.33333333334, ans=0.125 2023-12-04 08:07:56,026 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=22.5 2023-12-04 08:08:03,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=199366.66666666666, ans=0.0 2023-12-04 08:08:22,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199500.0, ans=0.1 2023-12-04 08:08:29,121 INFO [train.py:1087] (2/4) Epoch 34, batch 400, loss[loss=0.1718, simple_loss=0.2599, pruned_loss=0.04186, over 24507.00 frames. ], tot_loss[loss=0.165, simple_loss=0.256, pruned_loss=0.03702, over 4185562.60 frames. ], batch size: 77, lr: 7.20e-03, grad_scale: 32.0 2023-12-04 08:08:34,932 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.322e+02 1.440e+02 1.552e+02 2.302e+02, threshold=2.880e+02, percent-clipped=0.0 2023-12-04 08:08:39,826 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=199633.33333333334, ans=0.1 2023-12-04 08:08:59,616 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=199700.0, ans=0.2 2023-12-04 08:09:02,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=199766.66666666666, ans=0.1 2023-12-04 08:09:04,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=199766.66666666666, ans=0.0 2023-12-04 08:09:08,737 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-12-04 08:09:15,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=199833.33333333334, ans=0.5 2023-12-04 08:09:17,818 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:09:19,957 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=199833.33333333334, ans=0.2 2023-12-04 08:09:24,987 INFO [train.py:1087] (2/4) Epoch 34, batch 450, loss[loss=0.1595, simple_loss=0.2526, pruned_loss=0.03325, over 24796.00 frames. ], tot_loss[loss=0.1655, simple_loss=0.2564, pruned_loss=0.03729, over 4319017.64 frames. ], batch size: 72, lr: 7.20e-03, grad_scale: 32.0 2023-12-04 08:09:28,792 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.97 vs. limit=15.0 2023-12-04 08:09:30,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=199900.0, ans=0.0 2023-12-04 08:09:48,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=200033.33333333334, ans=0.0 2023-12-04 08:09:51,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=200033.33333333334, ans=0.0 2023-12-04 08:09:52,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=200033.33333333334, ans=0.125 2023-12-04 08:10:02,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=200100.0, ans=0.0 2023-12-04 08:10:09,419 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.61 vs. limit=6.0 2023-12-04 08:10:15,528 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=200166.66666666666, ans=0.125 2023-12-04 08:10:20,537 INFO [train.py:1087] (2/4) Epoch 34, batch 500, loss[loss=0.1744, simple_loss=0.2667, pruned_loss=0.04109, over 23546.00 frames. ], tot_loss[loss=0.1658, simple_loss=0.2566, pruned_loss=0.03745, over 4429249.85 frames. ], batch size: 94, lr: 7.19e-03, grad_scale: 32.0 2023-12-04 08:10:21,134 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=200233.33333333334, ans=6.0 2023-12-04 08:10:25,726 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.353e+02 1.445e+02 1.601e+02 2.190e+02, threshold=2.891e+02, percent-clipped=0.0 2023-12-04 08:10:31,631 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.53 vs. limit=15.0 2023-12-04 08:10:48,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=200366.66666666666, ans=0.0 2023-12-04 08:10:55,061 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.48 vs. limit=15.0 2023-12-04 08:11:01,176 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=200433.33333333334, ans=0.125 2023-12-04 08:11:15,041 INFO [train.py:1087] (2/4) Epoch 34, batch 550, loss[loss=0.1736, simple_loss=0.2647, pruned_loss=0.04126, over 24796.00 frames. ], tot_loss[loss=0.1658, simple_loss=0.2568, pruned_loss=0.03738, over 4521865.42 frames. ], batch size: 73, lr: 7.19e-03, grad_scale: 32.0 2023-12-04 08:11:37,830 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.61 vs. limit=10.0 2023-12-04 08:11:40,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=200700.0, ans=0.2 2023-12-04 08:11:49,344 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200766.66666666666, ans=0.1 2023-12-04 08:11:59,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=200833.33333333334, ans=0.125 2023-12-04 08:12:03,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=200833.33333333334, ans=0.125 2023-12-04 08:12:04,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=200833.33333333334, ans=0.125 2023-12-04 08:12:06,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=200833.33333333334, ans=0.125 2023-12-04 08:12:09,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=200900.0, ans=0.1 2023-12-04 08:12:10,642 INFO [train.py:1087] (2/4) Epoch 34, batch 600, loss[loss=0.1621, simple_loss=0.2526, pruned_loss=0.03586, over 24794.00 frames. ], tot_loss[loss=0.1654, simple_loss=0.2562, pruned_loss=0.03725, over 4589510.60 frames. ], batch size: 72, lr: 7.18e-03, grad_scale: 32.0 2023-12-04 08:12:11,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=200900.0, ans=0.0 2023-12-04 08:12:13,272 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.49 vs. limit=10.0 2023-12-04 08:12:16,342 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.191e+02 1.333e+02 1.438e+02 1.560e+02 2.058e+02, threshold=2.876e+02, percent-clipped=0.0 2023-12-04 08:12:36,824 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=201033.33333333334, ans=0.125 2023-12-04 08:12:42,541 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-12-04 08:12:50,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.79 vs. limit=15.0 2023-12-04 08:12:57,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=201166.66666666666, ans=0.125 2023-12-04 08:13:02,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=201166.66666666666, ans=0.5 2023-12-04 08:13:06,574 INFO [train.py:1087] (2/4) Epoch 34, batch 650, loss[loss=0.1569, simple_loss=0.2522, pruned_loss=0.03081, over 24849.00 frames. ], tot_loss[loss=0.1652, simple_loss=0.2561, pruned_loss=0.03717, over 4637720.15 frames. ], batch size: 68, lr: 7.18e-03, grad_scale: 32.0 2023-12-04 08:13:33,962 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:14:02,195 INFO [train.py:1087] (2/4) Epoch 34, batch 700, loss[loss=0.169, simple_loss=0.2563, pruned_loss=0.04091, over 24846.00 frames. ], tot_loss[loss=0.1651, simple_loss=0.256, pruned_loss=0.0371, over 4677539.04 frames. ], batch size: 68, lr: 7.17e-03, grad_scale: 32.0 2023-12-04 08:14:07,803 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.314e+02 1.390e+02 1.555e+02 2.195e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 08:14:19,885 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:14:34,544 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:14:41,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=201766.66666666666, ans=0.2 2023-12-04 08:14:57,571 INFO [train.py:1087] (2/4) Epoch 34, batch 750, loss[loss=0.1759, simple_loss=0.2674, pruned_loss=0.04221, over 22678.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2555, pruned_loss=0.0367, over 4712386.49 frames. ], batch size: 106, lr: 7.17e-03, grad_scale: 32.0 2023-12-04 08:14:57,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=201900.0, ans=0.125 2023-12-04 08:15:07,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=201900.0, ans=0.0 2023-12-04 08:15:32,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=202100.0, ans=0.0 2023-12-04 08:15:32,117 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=202100.0, ans=0.0 2023-12-04 08:15:53,553 INFO [train.py:1087] (2/4) Epoch 34, batch 800, loss[loss=0.1684, simple_loss=0.2568, pruned_loss=0.04, over 24555.00 frames. ], tot_loss[loss=0.1644, simple_loss=0.2555, pruned_loss=0.03665, over 4732778.13 frames. ], batch size: 62, lr: 7.16e-03, grad_scale: 32.0 2023-12-04 08:15:59,238 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.175e+02 1.341e+02 1.452e+02 1.557e+02 2.379e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 08:16:01,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=202233.33333333334, ans=0.125 2023-12-04 08:16:36,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=202500.0, ans=0.0 2023-12-04 08:16:36,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=202500.0, ans=0.0 2023-12-04 08:16:45,298 INFO [train.py:1087] (2/4) Epoch 34, batch 850, loss[loss=0.1647, simple_loss=0.2558, pruned_loss=0.03679, over 24865.00 frames. ], tot_loss[loss=0.165, simple_loss=0.256, pruned_loss=0.03698, over 4748134.28 frames. ], batch size: 68, lr: 7.15e-03, grad_scale: 32.0 2023-12-04 08:17:05,361 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=202700.0, ans=0.0 2023-12-04 08:17:21,402 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=202766.66666666666, ans=0.125 2023-12-04 08:17:45,356 INFO [train.py:1087] (2/4) Epoch 35, batch 0, loss[loss=0.1519, simple_loss=0.2454, pruned_loss=0.02923, over 24553.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2454, pruned_loss=0.02923, over 24553.00 frames. ], batch size: 66, lr: 7.04e-03, grad_scale: 32.0 2023-12-04 08:17:45,357 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 08:17:57,372 INFO [train.py:1119] (2/4) Epoch 35, validation: loss=0.1534, simple_loss=0.2532, pruned_loss=0.02686, over 944034.00 frames. 2023-12-04 08:17:57,373 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 08:18:07,931 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.315e+02 1.438e+02 1.571e+02 2.308e+02, threshold=2.875e+02, percent-clipped=0.0 2023-12-04 08:18:13,150 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.68 vs. limit=15.0 2023-12-04 08:18:24,995 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.08 vs. limit=22.5 2023-12-04 08:18:25,808 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=203000.0, ans=0.0 2023-12-04 08:18:32,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=203066.66666666666, ans=0.2 2023-12-04 08:18:36,203 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=203066.66666666666, ans=0.1 2023-12-04 08:18:39,764 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.84 vs. limit=15.0 2023-12-04 08:18:53,067 INFO [train.py:1087] (2/4) Epoch 35, batch 50, loss[loss=0.1596, simple_loss=0.2526, pruned_loss=0.03332, over 24752.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2555, pruned_loss=0.03626, over 1102332.61 frames. ], batch size: 70, lr: 7.04e-03, grad_scale: 32.0 2023-12-04 08:18:54,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=203200.0, ans=0.125 2023-12-04 08:19:09,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=203266.66666666666, ans=0.125 2023-12-04 08:19:09,567 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.72 vs. limit=22.5 2023-12-04 08:19:39,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=203466.66666666666, ans=0.2 2023-12-04 08:19:46,873 INFO [train.py:1087] (2/4) Epoch 35, batch 100, loss[loss=0.1667, simple_loss=0.2579, pruned_loss=0.03774, over 24712.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2557, pruned_loss=0.03616, over 1926973.40 frames. ], batch size: 74, lr: 7.03e-03, grad_scale: 32.0 2023-12-04 08:19:48,151 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203533.33333333334, ans=0.0 2023-12-04 08:19:58,777 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.339e+02 1.417e+02 1.550e+02 3.139e+02, threshold=2.834e+02, percent-clipped=1.0 2023-12-04 08:20:03,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203600.0, ans=0.125 2023-12-04 08:20:04,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=203600.0, ans=0.04949747468305833 2023-12-04 08:20:42,457 INFO [train.py:1087] (2/4) Epoch 35, batch 150, loss[loss=0.1625, simple_loss=0.2503, pruned_loss=0.03735, over 24786.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2559, pruned_loss=0.03653, over 2559604.64 frames. ], batch size: 62, lr: 7.03e-03, grad_scale: 32.0 2023-12-04 08:21:03,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=204000.0, ans=0.125 2023-12-04 08:21:05,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=204000.0, ans=0.07 2023-12-04 08:21:14,586 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-12-04 08:21:15,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204066.66666666666, ans=0.1 2023-12-04 08:21:17,282 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=204066.66666666666, ans=0.125 2023-12-04 08:21:25,755 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=204066.66666666666, ans=0.1 2023-12-04 08:21:25,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204066.66666666666, ans=0.1 2023-12-04 08:21:38,516 INFO [train.py:1087] (2/4) Epoch 35, batch 200, loss[loss=0.1569, simple_loss=0.2477, pruned_loss=0.03303, over 24564.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.2559, pruned_loss=0.03673, over 3058884.12 frames. ], batch size: 63, lr: 7.02e-03, grad_scale: 32.0 2023-12-04 08:21:47,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204200.0, ans=0.1 2023-12-04 08:21:49,182 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.298e+02 1.418e+02 1.532e+02 2.225e+02, threshold=2.837e+02, percent-clipped=0.0 2023-12-04 08:21:51,533 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=204266.66666666666, ans=0.0 2023-12-04 08:21:51,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=204266.66666666666, ans=0.125 2023-12-04 08:21:59,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204333.33333333334, ans=0.1 2023-12-04 08:22:01,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204333.33333333334, ans=0.1 2023-12-04 08:22:34,153 INFO [train.py:1087] (2/4) Epoch 35, batch 250, loss[loss=0.1777, simple_loss=0.2727, pruned_loss=0.04141, over 24317.00 frames. ], tot_loss[loss=0.1651, simple_loss=0.2565, pruned_loss=0.0369, over 3454251.48 frames. ], batch size: 79, lr: 7.02e-03, grad_scale: 64.0 2023-12-04 08:22:35,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=204533.33333333334, ans=0.09899494936611666 2023-12-04 08:22:36,512 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=204533.33333333334, ans=0.125 2023-12-04 08:22:49,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=204600.0, ans=0.125 2023-12-04 08:22:51,068 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-12-04 08:22:52,720 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=204600.0, ans=0.125 2023-12-04 08:23:04,422 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=204666.66666666666, ans=0.125 2023-12-04 08:23:11,914 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=204733.33333333334, ans=0.125 2023-12-04 08:23:23,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=204800.0, ans=0.0 2023-12-04 08:23:29,179 INFO [train.py:1087] (2/4) Epoch 35, batch 300, loss[loss=0.1603, simple_loss=0.2507, pruned_loss=0.03491, over 24727.00 frames. ], tot_loss[loss=0.165, simple_loss=0.2563, pruned_loss=0.03685, over 3763788.63 frames. ], batch size: 61, lr: 7.01e-03, grad_scale: 32.0 2023-12-04 08:23:41,636 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.318e+02 1.389e+02 1.563e+02 2.065e+02, threshold=2.778e+02, percent-clipped=0.0 2023-12-04 08:23:44,474 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-12-04 08:23:57,425 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=8.0 2023-12-04 08:24:09,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=205066.66666666666, ans=0.0 2023-12-04 08:24:20,494 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-12-04 08:24:24,354 INFO [train.py:1087] (2/4) Epoch 35, batch 350, loss[loss=0.1674, simple_loss=0.2597, pruned_loss=0.03753, over 24570.00 frames. ], tot_loss[loss=0.1651, simple_loss=0.2562, pruned_loss=0.03701, over 3993268.88 frames. ], batch size: 64, lr: 7.01e-03, grad_scale: 32.0 2023-12-04 08:24:37,200 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=205266.66666666666, ans=0.2 2023-12-04 08:24:58,338 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-12-04 08:25:07,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=205400.0, ans=0.125 2023-12-04 08:25:19,669 INFO [train.py:1087] (2/4) Epoch 35, batch 400, loss[loss=0.1717, simple_loss=0.2652, pruned_loss=0.03912, over 23971.00 frames. ], tot_loss[loss=0.1652, simple_loss=0.2562, pruned_loss=0.03713, over 4171837.48 frames. ], batch size: 87, lr: 7.00e-03, grad_scale: 32.0 2023-12-04 08:25:27,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=205533.33333333334, ans=0.125 2023-12-04 08:25:30,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=205600.0, ans=0.125 2023-12-04 08:25:31,777 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.342e+02 1.464e+02 1.631e+02 2.167e+02, threshold=2.928e+02, percent-clipped=0.0 2023-12-04 08:25:40,432 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=205600.0, ans=0.2 2023-12-04 08:25:45,664 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=205666.66666666666, ans=0.0 2023-12-04 08:25:47,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205666.66666666666, ans=0.1 2023-12-04 08:26:04,479 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.34 vs. limit=15.0 2023-12-04 08:26:15,231 INFO [train.py:1087] (2/4) Epoch 35, batch 450, loss[loss=0.1645, simple_loss=0.2563, pruned_loss=0.03634, over 24775.00 frames. ], tot_loss[loss=0.1654, simple_loss=0.2562, pruned_loss=0.03731, over 4303776.35 frames. ], batch size: 71, lr: 7.00e-03, grad_scale: 32.0 2023-12-04 08:26:15,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=205866.66666666666, ans=0.0 2023-12-04 08:26:21,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205866.66666666666, ans=0.1 2023-12-04 08:26:51,679 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=206066.66666666666, ans=0.5 2023-12-04 08:26:53,154 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=12.0 2023-12-04 08:27:02,638 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-12-04 08:27:10,535 INFO [train.py:1087] (2/4) Epoch 35, batch 500, loss[loss=0.1578, simple_loss=0.2496, pruned_loss=0.03299, over 24853.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2559, pruned_loss=0.03694, over 4414086.76 frames. ], batch size: 68, lr: 6.99e-03, grad_scale: 32.0 2023-12-04 08:27:13,984 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=206200.0, ans=0.125 2023-12-04 08:27:22,124 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.328e+02 1.465e+02 1.669e+02 2.352e+02, threshold=2.929e+02, percent-clipped=0.0 2023-12-04 08:27:22,386 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=206266.66666666666, ans=0.0 2023-12-04 08:27:32,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206333.33333333334, ans=0.125 2023-12-04 08:27:38,262 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=206333.33333333334, ans=0.125 2023-12-04 08:27:51,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206400.0, ans=0.0 2023-12-04 08:28:01,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=206466.66666666666, ans=0.125 2023-12-04 08:28:04,619 INFO [train.py:1087] (2/4) Epoch 35, batch 550, loss[loss=0.1616, simple_loss=0.253, pruned_loss=0.03508, over 24482.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2559, pruned_loss=0.03693, over 4506526.87 frames. ], batch size: 77, lr: 6.98e-03, grad_scale: 32.0 2023-12-04 08:28:13,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=206533.33333333334, ans=0.09899494936611666 2023-12-04 08:28:17,482 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206600.0, ans=0.0 2023-12-04 08:28:17,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.36 vs. limit=15.0 2023-12-04 08:28:20,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=206600.0, ans=0.09899494936611666 2023-12-04 08:28:20,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=206600.0, ans=0.0 2023-12-04 08:28:22,241 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.41 vs. limit=15.0 2023-12-04 08:28:28,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=206666.66666666666, ans=0.125 2023-12-04 08:28:44,482 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-12-04 08:28:45,551 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-12-04 08:28:59,956 INFO [train.py:1087] (2/4) Epoch 35, batch 600, loss[loss=0.1672, simple_loss=0.2548, pruned_loss=0.03976, over 24784.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2559, pruned_loss=0.03699, over 4575382.38 frames. ], batch size: 62, lr: 6.98e-03, grad_scale: 32.0 2023-12-04 08:29:01,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=206866.66666666666, ans=0.1 2023-12-04 08:29:04,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.47 vs. limit=5.0 2023-12-04 08:29:06,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=206866.66666666666, ans=0.0 2023-12-04 08:29:12,029 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.323e+02 1.412e+02 1.529e+02 2.400e+02, threshold=2.825e+02, percent-clipped=0.0 2023-12-04 08:29:25,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=207000.0, ans=15.0 2023-12-04 08:29:55,536 INFO [train.py:1087] (2/4) Epoch 35, batch 650, loss[loss=0.1644, simple_loss=0.255, pruned_loss=0.03695, over 22118.00 frames. ], tot_loss[loss=0.1648, simple_loss=0.2557, pruned_loss=0.03692, over 4623047.47 frames. ], batch size: 53, lr: 6.97e-03, grad_scale: 32.0 2023-12-04 08:29:56,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=207200.0, ans=0.2 2023-12-04 08:29:59,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=207200.0, ans=0.0 2023-12-04 08:30:12,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=207266.66666666666, ans=0.5 2023-12-04 08:30:13,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=207266.66666666666, ans=0.05 2023-12-04 08:30:22,167 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207333.33333333334, ans=0.1 2023-12-04 08:30:24,997 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.73 vs. limit=22.5 2023-12-04 08:30:28,947 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=207400.0, ans=0.05 2023-12-04 08:30:45,947 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=207466.66666666666, ans=0.125 2023-12-04 08:30:49,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=207533.33333333334, ans=0.0 2023-12-04 08:30:50,705 INFO [train.py:1087] (2/4) Epoch 35, batch 700, loss[loss=0.1756, simple_loss=0.2669, pruned_loss=0.04218, over 23532.00 frames. ], tot_loss[loss=0.1646, simple_loss=0.2556, pruned_loss=0.03678, over 4684262.70 frames. ], batch size: 94, lr: 6.97e-03, grad_scale: 32.0 2023-12-04 08:30:59,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=207533.33333333334, ans=0.0 2023-12-04 08:31:01,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=207600.0, ans=0.125 2023-12-04 08:31:02,123 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=207600.0, ans=0.125 2023-12-04 08:31:02,811 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.354e+02 1.423e+02 1.583e+02 2.167e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 08:31:04,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=207600.0, ans=0.125 2023-12-04 08:31:09,669 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-12-04 08:31:27,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=207733.33333333334, ans=0.025 2023-12-04 08:31:45,651 INFO [train.py:1087] (2/4) Epoch 35, batch 750, loss[loss=0.154, simple_loss=0.2494, pruned_loss=0.0293, over 24716.00 frames. ], tot_loss[loss=0.1646, simple_loss=0.2556, pruned_loss=0.03684, over 4716775.85 frames. ], batch size: 69, lr: 6.96e-03, grad_scale: 32.0 2023-12-04 08:31:55,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=207866.66666666666, ans=0.125 2023-12-04 08:32:19,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=208066.66666666666, ans=0.0 2023-12-04 08:32:22,112 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:32:29,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=208133.33333333334, ans=0.05 2023-12-04 08:32:34,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=208133.33333333334, ans=0.2 2023-12-04 08:32:41,127 INFO [train.py:1087] (2/4) Epoch 35, batch 800, loss[loss=0.1637, simple_loss=0.2527, pruned_loss=0.03729, over 24716.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2555, pruned_loss=0.03674, over 4735314.26 frames. ], batch size: 69, lr: 6.96e-03, grad_scale: 32.0 2023-12-04 08:32:46,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=208200.0, ans=0.0 2023-12-04 08:32:52,906 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.339e+02 1.480e+02 1.653e+02 2.128e+02, threshold=2.959e+02, percent-clipped=0.0 2023-12-04 08:33:00,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=208266.66666666666, ans=0.125 2023-12-04 08:33:08,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=208333.33333333334, ans=0.125 2023-12-04 08:33:32,760 INFO [train.py:1087] (2/4) Epoch 35, batch 850, loss[loss=0.1674, simple_loss=0.2606, pruned_loss=0.03708, over 24362.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2555, pruned_loss=0.03677, over 4756114.95 frames. ], batch size: 79, lr: 6.95e-03, grad_scale: 32.0 2023-12-04 08:33:42,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=208600.0, ans=0.125 2023-12-04 08:34:06,971 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=208733.33333333334, ans=0.125 2023-12-04 08:34:33,683 INFO [train.py:1087] (2/4) Epoch 36, batch 0, loss[loss=0.1609, simple_loss=0.254, pruned_loss=0.03391, over 24556.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.254, pruned_loss=0.03391, over 24556.00 frames. ], batch size: 63, lr: 6.85e-03, grad_scale: 32.0 2023-12-04 08:34:33,684 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 08:34:45,809 INFO [train.py:1119] (2/4) Epoch 36, validation: loss=0.1524, simple_loss=0.2526, pruned_loss=0.0261, over 944034.00 frames. 2023-12-04 08:34:45,809 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 08:34:51,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=208833.33333333334, ans=0.0 2023-12-04 08:34:58,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=208900.0, ans=0.1 2023-12-04 08:35:02,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=208900.0, ans=0.125 2023-12-04 08:35:03,384 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.345e+02 1.425e+02 1.631e+02 2.922e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 08:35:03,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=208900.0, ans=0.125 2023-12-04 08:35:08,276 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-12-04 08:35:09,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=208966.66666666666, ans=0.125 2023-12-04 08:35:19,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=209033.33333333334, ans=0.125 2023-12-04 08:35:41,321 INFO [train.py:1087] (2/4) Epoch 36, batch 50, loss[loss=0.1623, simple_loss=0.2565, pruned_loss=0.03409, over 24170.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2548, pruned_loss=0.03613, over 1089759.67 frames. ], batch size: 82, lr: 6.84e-03, grad_scale: 32.0 2023-12-04 08:35:51,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=209233.33333333334, ans=0.0 2023-12-04 08:35:58,642 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=209233.33333333334, ans=0.125 2023-12-04 08:36:01,248 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.29 vs. limit=15.0 2023-12-04 08:36:05,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=209300.0, ans=0.0 2023-12-04 08:36:05,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=209300.0, ans=0.125 2023-12-04 08:36:07,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=209300.0, ans=0.05 2023-12-04 08:36:20,363 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:36:26,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=209433.33333333334, ans=0.5 2023-12-04 08:36:35,711 INFO [train.py:1087] (2/4) Epoch 36, batch 100, loss[loss=0.1727, simple_loss=0.2559, pruned_loss=0.04474, over 24513.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.2552, pruned_loss=0.03631, over 1921321.30 frames. ], batch size: 75, lr: 6.84e-03, grad_scale: 32.0 2023-12-04 08:36:36,691 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.21 vs. limit=22.5 2023-12-04 08:36:53,708 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.355e+02 1.450e+02 1.613e+02 2.179e+02, threshold=2.899e+02, percent-clipped=0.0 2023-12-04 08:36:55,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=209566.66666666666, ans=0.0 2023-12-04 08:36:56,016 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209566.66666666666, ans=0.1 2023-12-04 08:37:00,756 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-12-04 08:37:05,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209633.33333333334, ans=0.125 2023-12-04 08:37:27,016 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209766.66666666666, ans=0.1 2023-12-04 08:37:30,869 INFO [train.py:1087] (2/4) Epoch 36, batch 150, loss[loss=0.1636, simple_loss=0.2523, pruned_loss=0.03743, over 24564.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2551, pruned_loss=0.03639, over 2574483.56 frames. ], batch size: 63, lr: 6.83e-03, grad_scale: 32.0 2023-12-04 08:37:40,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209833.33333333334, ans=0.1 2023-12-04 08:37:40,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=209833.33333333334, ans=0.0 2023-12-04 08:37:43,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=209900.0, ans=0.0 2023-12-04 08:38:04,829 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=8.0 2023-12-04 08:38:07,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=210033.33333333334, ans=0.0 2023-12-04 08:38:26,339 INFO [train.py:1087] (2/4) Epoch 36, batch 200, loss[loss=0.1639, simple_loss=0.2538, pruned_loss=0.03702, over 24606.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2554, pruned_loss=0.03681, over 3048766.70 frames. ], batch size: 68, lr: 6.83e-03, grad_scale: 32.0 2023-12-04 08:38:35,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210166.66666666666, ans=0.1 2023-12-04 08:38:37,797 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-12-04 08:38:43,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=210233.33333333334, ans=0.0 2023-12-04 08:38:44,192 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.333e+02 1.452e+02 1.584e+02 2.539e+02, threshold=2.905e+02, percent-clipped=0.0 2023-12-04 08:39:21,279 INFO [train.py:1087] (2/4) Epoch 36, batch 250, loss[loss=0.1689, simple_loss=0.2603, pruned_loss=0.03873, over 24215.00 frames. ], tot_loss[loss=0.1651, simple_loss=0.2559, pruned_loss=0.03711, over 3421694.28 frames. ], batch size: 82, lr: 6.82e-03, grad_scale: 16.0 2023-12-04 08:39:24,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=210500.0, ans=0.125 2023-12-04 08:39:43,041 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=210633.33333333334, ans=0.0 2023-12-04 08:40:02,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=210700.0, ans=0.125 2023-12-04 08:40:16,963 INFO [train.py:1087] (2/4) Epoch 36, batch 300, loss[loss=0.1515, simple_loss=0.2417, pruned_loss=0.03066, over 24561.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.2557, pruned_loss=0.03684, over 3741837.07 frames. ], batch size: 66, lr: 6.82e-03, grad_scale: 16.0 2023-12-04 08:40:35,392 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.298e+02 1.397e+02 1.525e+02 2.396e+02, threshold=2.794e+02, percent-clipped=0.0 2023-12-04 08:40:35,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=210900.0, ans=0.125 2023-12-04 08:40:39,120 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=12.0 2023-12-04 08:40:48,493 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-12-04 08:40:55,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=211033.33333333334, ans=0.125 2023-12-04 08:40:57,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=211033.33333333334, ans=0.2 2023-12-04 08:41:11,509 INFO [train.py:1087] (2/4) Epoch 36, batch 350, loss[loss=0.1621, simple_loss=0.2544, pruned_loss=0.03486, over 24703.00 frames. ], tot_loss[loss=0.165, simple_loss=0.2559, pruned_loss=0.037, over 3963227.96 frames. ], batch size: 69, lr: 6.81e-03, grad_scale: 16.0 2023-12-04 08:41:15,734 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=211166.66666666666, ans=0.09899494936611666 2023-12-04 08:41:28,920 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:41:29,074 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=12.0 2023-12-04 08:41:44,211 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:41:55,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=211433.33333333334, ans=0.125 2023-12-04 08:42:02,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=211433.33333333334, ans=0.125 2023-12-04 08:42:07,142 INFO [train.py:1087] (2/4) Epoch 36, batch 400, loss[loss=0.1574, simple_loss=0.2487, pruned_loss=0.03307, over 24555.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2556, pruned_loss=0.0367, over 4155501.46 frames. ], batch size: 66, lr: 6.81e-03, grad_scale: 32.0 2023-12-04 08:42:12,724 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=211500.0, ans=0.09899494936611666 2023-12-04 08:42:26,283 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.148e+02 1.324e+02 1.435e+02 1.583e+02 2.493e+02, threshold=2.870e+02, percent-clipped=0.0 2023-12-04 08:42:26,536 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=211566.66666666666, ans=0.0 2023-12-04 08:42:28,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=211633.33333333334, ans=0.0 2023-12-04 08:42:33,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=211633.33333333334, ans=0.2 2023-12-04 08:43:02,163 INFO [train.py:1087] (2/4) Epoch 36, batch 450, loss[loss=0.1543, simple_loss=0.2427, pruned_loss=0.03297, over 24552.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2549, pruned_loss=0.03629, over 4291449.41 frames. ], batch size: 66, lr: 6.80e-03, grad_scale: 32.0 2023-12-04 08:43:25,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=211966.66666666666, ans=0.125 2023-12-04 08:43:33,037 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=211966.66666666666, ans=0.2 2023-12-04 08:43:36,081 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212033.33333333334, ans=0.1 2023-12-04 08:43:36,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=212033.33333333334, ans=0.125 2023-12-04 08:43:43,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=212033.33333333334, ans=0.0 2023-12-04 08:43:46,762 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=212100.0, ans=0.0 2023-12-04 08:43:49,216 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:43:57,322 INFO [train.py:1087] (2/4) Epoch 36, batch 500, loss[loss=0.1654, simple_loss=0.2535, pruned_loss=0.03864, over 24697.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2547, pruned_loss=0.03619, over 4410572.54 frames. ], batch size: 69, lr: 6.80e-03, grad_scale: 32.0 2023-12-04 08:44:06,976 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=212233.33333333334, ans=0.125 2023-12-04 08:44:15,375 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.299e+02 1.402e+02 1.525e+02 1.849e+02, threshold=2.804e+02, percent-clipped=0.0 2023-12-04 08:44:16,724 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=212233.33333333334, ans=0.0 2023-12-04 08:44:28,820 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-12-04 08:44:36,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=212366.66666666666, ans=0.0 2023-12-04 08:44:51,608 INFO [train.py:1087] (2/4) Epoch 36, batch 550, loss[loss=0.1547, simple_loss=0.2546, pruned_loss=0.02742, over 24786.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.2544, pruned_loss=0.03601, over 4508925.23 frames. ], batch size: 71, lr: 6.79e-03, grad_scale: 32.0 2023-12-04 08:44:59,344 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:45:09,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212566.66666666666, ans=0.1 2023-12-04 08:45:14,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=212633.33333333334, ans=0.0 2023-12-04 08:45:25,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=212700.0, ans=0.125 2023-12-04 08:45:46,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=212833.33333333334, ans=0.0 2023-12-04 08:45:47,207 INFO [train.py:1087] (2/4) Epoch 36, batch 600, loss[loss=0.1683, simple_loss=0.2609, pruned_loss=0.03789, over 24478.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2543, pruned_loss=0.03583, over 4583055.04 frames. ], batch size: 75, lr: 6.79e-03, grad_scale: 32.0 2023-12-04 08:45:51,704 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=212833.33333333334, ans=0.05 2023-12-04 08:46:06,702 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.299e+02 1.430e+02 1.549e+02 1.919e+02, threshold=2.861e+02, percent-clipped=0.0 2023-12-04 08:46:38,885 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=213100.0, ans=0.1 2023-12-04 08:46:42,911 INFO [train.py:1087] (2/4) Epoch 36, batch 650, loss[loss=0.1608, simple_loss=0.2548, pruned_loss=0.03335, over 24762.00 frames. ], tot_loss[loss=0.1631, simple_loss=0.2544, pruned_loss=0.03587, over 4625618.86 frames. ], batch size: 64, lr: 6.78e-03, grad_scale: 32.0 2023-12-04 08:46:43,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=213166.66666666666, ans=0.125 2023-12-04 08:46:50,487 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=213166.66666666666, ans=0.0 2023-12-04 08:46:53,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=213233.33333333334, ans=0.0 2023-12-04 08:46:58,877 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2023-12-04 08:47:07,774 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.28 vs. limit=22.5 2023-12-04 08:47:24,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=213366.66666666666, ans=0.125 2023-12-04 08:47:34,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=213433.33333333334, ans=0.0 2023-12-04 08:47:40,335 INFO [train.py:1087] (2/4) Epoch 36, batch 700, loss[loss=0.1883, simple_loss=0.2707, pruned_loss=0.05291, over 17210.00 frames. ], tot_loss[loss=0.1631, simple_loss=0.2544, pruned_loss=0.0359, over 4673003.87 frames. ], batch size: 176, lr: 6.78e-03, grad_scale: 32.0 2023-12-04 08:47:53,608 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:47:58,628 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.184e+02 1.337e+02 1.455e+02 1.588e+02 2.425e+02, threshold=2.911e+02, percent-clipped=0.0 2023-12-04 08:48:15,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-12-04 08:48:29,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=213766.66666666666, ans=0.125 2023-12-04 08:48:35,891 INFO [train.py:1087] (2/4) Epoch 36, batch 750, loss[loss=0.1655, simple_loss=0.2534, pruned_loss=0.03877, over 24753.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.2541, pruned_loss=0.03569, over 4715163.69 frames. ], batch size: 70, lr: 6.77e-03, grad_scale: 32.0 2023-12-04 08:48:39,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=213833.33333333334, ans=0.2 2023-12-04 08:48:48,493 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-12-04 08:48:51,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=213900.0, ans=0.2 2023-12-04 08:48:52,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=213900.0, ans=15.0 2023-12-04 08:48:58,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213966.66666666666, ans=0.1 2023-12-04 08:48:59,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=213966.66666666666, ans=0.0 2023-12-04 08:49:01,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=213966.66666666666, ans=0.125 2023-12-04 08:49:12,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=214033.33333333334, ans=0.125 2023-12-04 08:49:31,178 INFO [train.py:1087] (2/4) Epoch 36, batch 800, loss[loss=0.159, simple_loss=0.2517, pruned_loss=0.03312, over 24704.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2538, pruned_loss=0.03559, over 4748003.92 frames. ], batch size: 69, lr: 6.77e-03, grad_scale: 32.0 2023-12-04 08:49:36,043 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=214166.66666666666, ans=0.1 2023-12-04 08:49:36,132 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=214166.66666666666, ans=0.125 2023-12-04 08:49:39,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=214166.66666666666, ans=0.1 2023-12-04 08:49:45,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=214233.33333333334, ans=0.2 2023-12-04 08:49:49,604 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.303e+02 1.430e+02 1.543e+02 2.060e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 08:49:49,791 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:50:09,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=214366.66666666666, ans=0.125 2023-12-04 08:50:22,389 INFO [train.py:1087] (2/4) Epoch 36, batch 850, loss[loss=0.1708, simple_loss=0.265, pruned_loss=0.03827, over 24561.00 frames. ], tot_loss[loss=0.1631, simple_loss=0.2543, pruned_loss=0.03599, over 4756347.47 frames. ], batch size: 63, lr: 6.76e-03, grad_scale: 32.0 2023-12-04 08:50:36,551 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.92 vs. limit=5.0 2023-12-04 08:50:36,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=214566.66666666666, ans=0.0 2023-12-04 08:50:41,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=214633.33333333334, ans=0.125 2023-12-04 08:50:53,269 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214700.0, ans=0.1 2023-12-04 08:51:23,558 INFO [train.py:1087] (2/4) Epoch 37, batch 0, loss[loss=0.1938, simple_loss=0.2724, pruned_loss=0.05756, over 16772.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2724, pruned_loss=0.05756, over 16772.00 frames. ], batch size: 178, lr: 6.67e-03, grad_scale: 32.0 2023-12-04 08:51:23,561 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 08:51:34,157 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.6232, 4.3503, 4.4101, 4.3646], device='cuda:2') 2023-12-04 08:51:35,845 INFO [train.py:1119] (2/4) Epoch 37, validation: loss=0.1531, simple_loss=0.2525, pruned_loss=0.02684, over 944034.00 frames. 2023-12-04 08:51:35,845 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 08:51:39,495 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=15.0 2023-12-04 08:51:40,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=214800.0, ans=0.125 2023-12-04 08:51:54,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=214866.66666666666, ans=0.2 2023-12-04 08:52:00,135 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.176e+02 1.321e+02 1.427e+02 1.586e+02 2.182e+02, threshold=2.854e+02, percent-clipped=0.0 2023-12-04 08:52:02,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=214933.33333333334, ans=0.2 2023-12-04 08:52:08,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=215000.0, ans=0.125 2023-12-04 08:52:13,689 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-12-04 08:52:27,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=215066.66666666666, ans=0.02 2023-12-04 08:52:30,597 INFO [train.py:1087] (2/4) Epoch 37, batch 50, loss[loss=0.1634, simple_loss=0.2549, pruned_loss=0.03596, over 24764.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.2542, pruned_loss=0.03555, over 1085060.75 frames. ], batch size: 65, lr: 6.66e-03, grad_scale: 32.0 2023-12-04 08:52:39,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215133.33333333334, ans=0.1 2023-12-04 08:52:44,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=215200.0, ans=0.125 2023-12-04 08:52:45,174 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=215200.0, ans=0.025 2023-12-04 08:52:48,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215200.0, ans=0.1 2023-12-04 08:52:53,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=215266.66666666666, ans=0.125 2023-12-04 08:53:04,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=215333.33333333334, ans=0.02 2023-12-04 08:53:20,919 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=215400.0, ans=0.125 2023-12-04 08:53:21,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215400.0, ans=0.125 2023-12-04 08:53:26,350 INFO [train.py:1087] (2/4) Epoch 37, batch 100, loss[loss=0.1483, simple_loss=0.243, pruned_loss=0.0268, over 24777.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2549, pruned_loss=0.03582, over 1908764.60 frames. ], batch size: 71, lr: 6.66e-03, grad_scale: 32.0 2023-12-04 08:53:31,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=215466.66666666666, ans=0.0 2023-12-04 08:53:34,969 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2023-12-04 08:53:47,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=215600.0, ans=0.125 2023-12-04 08:53:50,664 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.319e+02 1.413e+02 1.518e+02 2.064e+02, threshold=2.826e+02, percent-clipped=0.0 2023-12-04 08:53:57,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=215600.0, ans=0.0 2023-12-04 08:54:21,515 INFO [train.py:1087] (2/4) Epoch 37, batch 150, loss[loss=0.1565, simple_loss=0.2485, pruned_loss=0.03224, over 24561.00 frames. ], tot_loss[loss=0.1644, simple_loss=0.2559, pruned_loss=0.0365, over 2546359.21 frames. ], batch size: 62, lr: 6.65e-03, grad_scale: 16.0 2023-12-04 08:54:51,383 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=215933.33333333334, ans=0.2 2023-12-04 08:55:06,474 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=216066.66666666666, ans=0.0 2023-12-04 08:55:06,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=216066.66666666666, ans=0.125 2023-12-04 08:55:16,872 INFO [train.py:1087] (2/4) Epoch 37, batch 200, loss[loss=0.1573, simple_loss=0.2507, pruned_loss=0.03194, over 24558.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2551, pruned_loss=0.03613, over 3048721.33 frames. ], batch size: 66, lr: 6.65e-03, grad_scale: 16.0 2023-12-04 08:55:31,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=216200.0, ans=0.2 2023-12-04 08:55:32,154 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2023-12-04 08:55:42,489 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.342e+02 1.437e+02 1.645e+02 2.318e+02, threshold=2.874e+02, percent-clipped=0.0 2023-12-04 08:55:45,196 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.50 vs. limit=10.0 2023-12-04 08:55:51,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=216333.33333333334, ans=10.0 2023-12-04 08:55:53,608 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.42 vs. limit=15.0 2023-12-04 08:55:59,059 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216333.33333333334, ans=0.0 2023-12-04 08:56:12,552 INFO [train.py:1087] (2/4) Epoch 37, batch 250, loss[loss=0.1566, simple_loss=0.2514, pruned_loss=0.03087, over 24779.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.2545, pruned_loss=0.03591, over 3456565.39 frames. ], batch size: 71, lr: 6.64e-03, grad_scale: 16.0 2023-12-04 08:56:17,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=216466.66666666666, ans=0.125 2023-12-04 08:56:23,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=216533.33333333334, ans=0.125 2023-12-04 08:56:23,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216533.33333333334, ans=0.1 2023-12-04 08:56:56,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=216733.33333333334, ans=0.125 2023-12-04 08:57:08,618 INFO [train.py:1087] (2/4) Epoch 37, batch 300, loss[loss=0.1827, simple_loss=0.2778, pruned_loss=0.04383, over 21283.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.2546, pruned_loss=0.0359, over 3743917.04 frames. ], batch size: 127, lr: 6.64e-03, grad_scale: 16.0 2023-12-04 08:57:17,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-12-04 08:57:30,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=216933.33333333334, ans=0.125 2023-12-04 08:57:33,743 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.199e+02 1.407e+02 1.504e+02 1.639e+02 2.110e+02, threshold=3.009e+02, percent-clipped=0.0 2023-12-04 08:57:37,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=216933.33333333334, ans=0.125 2023-12-04 08:57:40,171 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=216933.33333333334, ans=0.035 2023-12-04 08:58:03,918 INFO [train.py:1087] (2/4) Epoch 37, batch 350, loss[loss=0.1532, simple_loss=0.243, pruned_loss=0.03175, over 24713.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.255, pruned_loss=0.03624, over 3966482.47 frames. ], batch size: 74, lr: 6.63e-03, grad_scale: 16.0 2023-12-04 08:58:05,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=217133.33333333334, ans=0.0 2023-12-04 08:58:15,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=217200.0, ans=0.125 2023-12-04 08:58:18,821 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:58:22,098 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=217200.0, ans=0.125 2023-12-04 08:58:23,107 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=217200.0, ans=0.125 2023-12-04 08:58:49,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=217400.0, ans=0.125 2023-12-04 08:58:59,068 INFO [train.py:1087] (2/4) Epoch 37, batch 400, loss[loss=0.1623, simple_loss=0.2564, pruned_loss=0.03409, over 24576.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.255, pruned_loss=0.03601, over 4161879.06 frames. ], batch size: 65, lr: 6.63e-03, grad_scale: 32.0 2023-12-04 08:59:06,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=217466.66666666666, ans=0.0 2023-12-04 08:59:20,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=217533.33333333334, ans=0.0 2023-12-04 08:59:24,982 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.319e+02 1.432e+02 1.652e+02 2.173e+02, threshold=2.865e+02, percent-clipped=0.0 2023-12-04 08:59:32,738 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217666.66666666666, ans=0.1 2023-12-04 08:59:36,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=217666.66666666666, ans=0.2 2023-12-04 08:59:53,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=217733.33333333334, ans=0.125 2023-12-04 08:59:55,240 INFO [train.py:1087] (2/4) Epoch 37, batch 450, loss[loss=0.1722, simple_loss=0.2666, pruned_loss=0.03894, over 23732.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2548, pruned_loss=0.03585, over 4304628.31 frames. ], batch size: 57, lr: 6.62e-03, grad_scale: 32.0 2023-12-04 09:00:11,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=217866.66666666666, ans=0.0 2023-12-04 09:00:24,549 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=217933.33333333334, ans=0.0 2023-12-04 09:00:30,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=218000.0, ans=0.125 2023-12-04 09:00:37,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=218000.0, ans=0.125 2023-12-04 09:00:51,479 INFO [train.py:1087] (2/4) Epoch 37, batch 500, loss[loss=0.1638, simple_loss=0.2505, pruned_loss=0.03856, over 24511.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2548, pruned_loss=0.03599, over 4397138.63 frames. ], batch size: 75, lr: 6.62e-03, grad_scale: 16.0 2023-12-04 09:00:53,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=218133.33333333334, ans=0.125 2023-12-04 09:01:16,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=218266.66666666666, ans=0.125 2023-12-04 09:01:18,057 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.330e+02 1.427e+02 1.554e+02 2.935e+02, threshold=2.854e+02, percent-clipped=1.0 2023-12-04 09:01:22,907 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-12-04 09:01:23,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=218333.33333333334, ans=0.0 2023-12-04 09:01:30,427 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=218333.33333333334, ans=0.07 2023-12-04 09:01:34,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=218333.33333333334, ans=15.0 2023-12-04 09:01:46,686 INFO [train.py:1087] (2/4) Epoch 37, batch 550, loss[loss=0.167, simple_loss=0.2571, pruned_loss=0.03843, over 24851.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2547, pruned_loss=0.03597, over 4501105.69 frames. ], batch size: 68, lr: 6.61e-03, grad_scale: 16.0 2023-12-04 09:01:47,205 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.86 vs. limit=15.0 2023-12-04 09:01:58,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=218533.33333333334, ans=0.0 2023-12-04 09:02:04,724 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.68 vs. limit=10.0 2023-12-04 09:02:08,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=218600.0, ans=0.125 2023-12-04 09:02:08,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218600.0, ans=0.1 2023-12-04 09:02:29,207 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=218666.66666666666, ans=0.125 2023-12-04 09:02:41,951 INFO [train.py:1087] (2/4) Epoch 37, batch 600, loss[loss=0.1534, simple_loss=0.2463, pruned_loss=0.03025, over 24752.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2549, pruned_loss=0.03606, over 4572730.85 frames. ], batch size: 63, lr: 6.61e-03, grad_scale: 16.0 2023-12-04 09:02:42,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=218800.0, ans=0.125 2023-12-04 09:02:53,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=218866.66666666666, ans=0.125 2023-12-04 09:03:04,930 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.42 vs. limit=15.0 2023-12-04 09:03:08,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=218933.33333333334, ans=0.125 2023-12-04 09:03:08,804 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.340e+02 1.431e+02 1.602e+02 2.000e+02, threshold=2.862e+02, percent-clipped=0.0 2023-12-04 09:03:33,682 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=219066.66666666666, ans=0.125 2023-12-04 09:03:38,136 INFO [train.py:1087] (2/4) Epoch 37, batch 650, loss[loss=0.1645, simple_loss=0.2561, pruned_loss=0.03648, over 24714.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2546, pruned_loss=0.03602, over 4607711.72 frames. ], batch size: 69, lr: 6.60e-03, grad_scale: 16.0 2023-12-04 09:03:39,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=219133.33333333334, ans=0.0 2023-12-04 09:03:49,251 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.21 vs. limit=15.0 2023-12-04 09:03:55,107 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=219200.0, ans=0.125 2023-12-04 09:04:15,651 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=219333.33333333334, ans=0.125 2023-12-04 09:04:23,342 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.33 vs. limit=10.0 2023-12-04 09:04:28,059 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:04:33,166 INFO [train.py:1087] (2/4) Epoch 37, batch 700, loss[loss=0.1757, simple_loss=0.2645, pruned_loss=0.04347, over 24506.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2553, pruned_loss=0.03638, over 4649228.74 frames. ], batch size: 75, lr: 6.60e-03, grad_scale: 16.0 2023-12-04 09:04:58,238 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-12-04 09:04:59,772 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.349e+02 1.440e+02 1.623e+02 2.299e+02, threshold=2.881e+02, percent-clipped=0.0 2023-12-04 09:05:04,762 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-12-04 09:05:16,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=219733.33333333334, ans=0.0 2023-12-04 09:05:25,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=219733.33333333334, ans=0.125 2023-12-04 09:05:27,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=219800.0, ans=0.125 2023-12-04 09:05:28,568 INFO [train.py:1087] (2/4) Epoch 37, batch 750, loss[loss=0.1734, simple_loss=0.2663, pruned_loss=0.04027, over 24021.00 frames. ], tot_loss[loss=0.1641, simple_loss=0.2553, pruned_loss=0.03648, over 4664702.14 frames. ], batch size: 87, lr: 6.59e-03, grad_scale: 16.0 2023-12-04 09:05:39,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=219866.66666666666, ans=0.0 2023-12-04 09:05:40,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=219866.66666666666, ans=0.0 2023-12-04 09:05:45,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=219866.66666666666, ans=0.0 2023-12-04 09:05:51,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=219933.33333333334, ans=0.2 2023-12-04 09:05:55,245 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2023-12-04 09:06:11,354 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=220000.0, ans=0.2 2023-12-04 09:06:18,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=220066.66666666666, ans=0.2 2023-12-04 09:06:24,254 INFO [train.py:1087] (2/4) Epoch 37, batch 800, loss[loss=0.1633, simple_loss=0.2559, pruned_loss=0.03531, over 22662.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2549, pruned_loss=0.03606, over 4704618.49 frames. ], batch size: 106, lr: 6.59e-03, grad_scale: 32.0 2023-12-04 09:06:35,102 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-12-04 09:06:49,584 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.291e+02 1.369e+02 1.487e+02 2.326e+02, threshold=2.738e+02, percent-clipped=0.0 2023-12-04 09:07:07,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220400.0, ans=0.1 2023-12-04 09:07:07,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=220400.0, ans=0.1 2023-12-04 09:07:08,011 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.46 vs. limit=22.5 2023-12-04 09:07:11,608 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=220400.0, ans=0.1 2023-12-04 09:07:15,677 INFO [train.py:1087] (2/4) Epoch 37, batch 850, loss[loss=0.1631, simple_loss=0.2546, pruned_loss=0.03586, over 24554.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2553, pruned_loss=0.03631, over 4725891.41 frames. ], batch size: 66, lr: 6.58e-03, grad_scale: 32.0 2023-12-04 09:07:18,926 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=220466.66666666666, ans=0.125 2023-12-04 09:07:27,323 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 09:07:48,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=220666.66666666666, ans=0.0 2023-12-04 09:07:58,931 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=220733.33333333334, ans=0.125 2023-12-04 09:08:14,608 INFO [train.py:1087] (2/4) Epoch 38, batch 0, loss[loss=0.1697, simple_loss=0.2678, pruned_loss=0.03583, over 22763.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2678, pruned_loss=0.03583, over 22763.00 frames. ], batch size: 106, lr: 6.49e-03, grad_scale: 32.0 2023-12-04 09:08:14,609 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 09:08:26,961 INFO [train.py:1119] (2/4) Epoch 38, validation: loss=0.1535, simple_loss=0.2525, pruned_loss=0.02723, over 944034.00 frames. 2023-12-04 09:08:26,962 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 09:08:31,497 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=220766.66666666666, ans=0.0 2023-12-04 09:08:32,932 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.19 vs. limit=15.0 2023-12-04 09:08:34,901 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-12-04 09:08:35,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=220766.66666666666, ans=0.0 2023-12-04 09:08:43,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=220833.33333333334, ans=0.0 2023-12-04 09:08:54,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220900.0, ans=0.1 2023-12-04 09:08:58,608 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.075e+02 1.382e+02 1.554e+02 1.707e+02 2.505e+02, threshold=3.109e+02, percent-clipped=0.0 2023-12-04 09:09:07,771 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.54 vs. limit=15.0 2023-12-04 09:09:12,754 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=221033.33333333334, ans=0.0 2023-12-04 09:09:17,079 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.32 vs. limit=15.0 2023-12-04 09:09:22,246 INFO [train.py:1087] (2/4) Epoch 38, batch 50, loss[loss=0.1712, simple_loss=0.2634, pruned_loss=0.03948, over 24797.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.2544, pruned_loss=0.03546, over 1074828.69 frames. ], batch size: 72, lr: 6.49e-03, grad_scale: 32.0 2023-12-04 09:09:23,786 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=12.0 2023-12-04 09:09:33,237 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=221166.66666666666, ans=0.125 2023-12-04 09:09:37,576 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=221166.66666666666, ans=0.0 2023-12-04 09:09:47,244 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.87 vs. limit=22.5 2023-12-04 09:09:58,762 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=221300.0, ans=0.2 2023-12-04 09:09:59,242 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.48 vs. limit=22.5 2023-12-04 09:10:06,439 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=221366.66666666666, ans=0.2 2023-12-04 09:10:12,821 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=221366.66666666666, ans=0.125 2023-12-04 09:10:17,224 INFO [train.py:1087] (2/4) Epoch 38, batch 100, loss[loss=0.1569, simple_loss=0.2507, pruned_loss=0.03153, over 24795.00 frames. ], tot_loss[loss=0.1624, simple_loss=0.2539, pruned_loss=0.03544, over 1909208.63 frames. ], batch size: 73, lr: 6.48e-03, grad_scale: 32.0 2023-12-04 09:10:29,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=221500.0, ans=0.125 2023-12-04 09:10:46,503 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-12-04 09:10:49,262 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.289e+02 1.400e+02 1.532e+02 1.887e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-04 09:10:55,233 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=221633.33333333334, ans=0.1 2023-12-04 09:11:12,383 INFO [train.py:1087] (2/4) Epoch 38, batch 150, loss[loss=0.1756, simple_loss=0.2666, pruned_loss=0.0423, over 22927.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2547, pruned_loss=0.03624, over 2544809.48 frames. ], batch size: 106, lr: 6.48e-03, grad_scale: 32.0 2023-12-04 09:11:41,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=221900.0, ans=0.125 2023-12-04 09:11:47,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221966.66666666666, ans=0.1 2023-12-04 09:12:02,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222033.33333333334, ans=0.125 2023-12-04 09:12:07,437 INFO [train.py:1087] (2/4) Epoch 38, batch 200, loss[loss=0.1632, simple_loss=0.261, pruned_loss=0.03273, over 24762.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2546, pruned_loss=0.03614, over 3050233.74 frames. ], batch size: 66, lr: 6.47e-03, grad_scale: 32.0 2023-12-04 09:12:31,697 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.72 vs. limit=15.0 2023-12-04 09:12:39,652 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.338e+02 1.405e+02 1.498e+02 2.257e+02, threshold=2.810e+02, percent-clipped=0.0 2023-12-04 09:12:41,325 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2023-12-04 09:12:46,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=222300.0, ans=0.0 2023-12-04 09:13:03,355 INFO [train.py:1087] (2/4) Epoch 38, batch 250, loss[loss=0.1675, simple_loss=0.2571, pruned_loss=0.03896, over 24727.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2546, pruned_loss=0.03614, over 3450436.89 frames. ], batch size: 69, lr: 6.47e-03, grad_scale: 32.0 2023-12-04 09:13:26,491 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:13:26,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=222566.66666666666, ans=0.125 2023-12-04 09:13:37,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222633.33333333334, ans=0.125 2023-12-04 09:13:48,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=222700.0, ans=0.125 2023-12-04 09:13:51,037 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.38 vs. limit=22.5 2023-12-04 09:13:58,198 INFO [train.py:1087] (2/4) Epoch 38, batch 300, loss[loss=0.1549, simple_loss=0.2514, pruned_loss=0.02916, over 24805.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.254, pruned_loss=0.03581, over 3763054.04 frames. ], batch size: 72, lr: 6.46e-03, grad_scale: 32.0 2023-12-04 09:14:05,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222766.66666666666, ans=0.125 2023-12-04 09:14:15,207 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-12-04 09:14:22,448 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=222900.0, ans=0.1 2023-12-04 09:14:29,660 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=12.0 2023-12-04 09:14:30,429 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.310e+02 1.414e+02 1.549e+02 2.107e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-04 09:14:53,391 INFO [train.py:1087] (2/4) Epoch 38, batch 350, loss[loss=0.1594, simple_loss=0.2521, pruned_loss=0.0334, over 24555.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2538, pruned_loss=0.03564, over 4008276.71 frames. ], batch size: 62, lr: 6.46e-03, grad_scale: 32.0 2023-12-04 09:14:53,741 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223100.0, ans=0.1 2023-12-04 09:15:06,926 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=223166.66666666666, ans=0.125 2023-12-04 09:15:18,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=223233.33333333334, ans=0.04949747468305833 2023-12-04 09:15:38,319 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.86 vs. limit=22.5 2023-12-04 09:15:48,421 INFO [train.py:1087] (2/4) Epoch 38, batch 400, loss[loss=0.1591, simple_loss=0.2511, pruned_loss=0.0335, over 24850.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2547, pruned_loss=0.03605, over 4170365.95 frames. ], batch size: 68, lr: 6.45e-03, grad_scale: 32.0 2023-12-04 09:15:59,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=223500.0, ans=0.125 2023-12-04 09:16:06,936 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=223500.0, ans=6.0 2023-12-04 09:16:15,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=223566.66666666666, ans=0.0 2023-12-04 09:16:20,067 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.304e+02 1.431e+02 1.583e+02 2.238e+02, threshold=2.862e+02, percent-clipped=0.0 2023-12-04 09:16:38,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=223700.0, ans=0.2 2023-12-04 09:16:43,664 INFO [train.py:1087] (2/4) Epoch 38, batch 450, loss[loss=0.1623, simple_loss=0.2597, pruned_loss=0.03246, over 24549.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2541, pruned_loss=0.03572, over 4307413.68 frames. ], batch size: 66, lr: 6.45e-03, grad_scale: 32.0 2023-12-04 09:16:47,387 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-12-04 09:16:48,294 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=223766.66666666666, ans=0.125 2023-12-04 09:17:27,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=224033.33333333334, ans=0.2 2023-12-04 09:17:28,163 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=12.0 2023-12-04 09:17:29,725 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:17:38,698 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224100.0, ans=0.1 2023-12-04 09:17:39,379 INFO [train.py:1087] (2/4) Epoch 38, batch 500, loss[loss=0.1601, simple_loss=0.2554, pruned_loss=0.03239, over 24718.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.254, pruned_loss=0.0358, over 4412990.85 frames. ], batch size: 67, lr: 6.44e-03, grad_scale: 16.0 2023-12-04 09:17:49,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=224166.66666666666, ans=0.0 2023-12-04 09:17:51,759 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.83 vs. limit=22.5 2023-12-04 09:18:12,356 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.035e+02 1.312e+02 1.439e+02 1.556e+02 2.116e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 09:18:12,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224300.0, ans=0.1 2023-12-04 09:18:20,595 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-12-04 09:18:29,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=224366.66666666666, ans=0.2 2023-12-04 09:18:31,085 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.47 vs. limit=15.0 2023-12-04 09:18:33,754 INFO [train.py:1087] (2/4) Epoch 38, batch 550, loss[loss=0.1564, simple_loss=0.2478, pruned_loss=0.03252, over 24179.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.254, pruned_loss=0.03572, over 4508409.23 frames. ], batch size: 82, lr: 6.44e-03, grad_scale: 16.0 2023-12-04 09:18:43,919 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.31 vs. limit=15.0 2023-12-04 09:18:58,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=224566.66666666666, ans=0.1 2023-12-04 09:19:01,960 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-12-04 09:19:04,864 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=224566.66666666666, ans=0.2 2023-12-04 09:19:06,186 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-12-04 09:19:24,923 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:19:28,928 INFO [train.py:1087] (2/4) Epoch 38, batch 600, loss[loss=0.1576, simple_loss=0.2483, pruned_loss=0.03343, over 24733.00 frames. ], tot_loss[loss=0.1629, simple_loss=0.2541, pruned_loss=0.03588, over 4582719.80 frames. ], batch size: 63, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:19:37,010 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.55 vs. limit=22.5 2023-12-04 09:19:58,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=224900.0, ans=0.0 2023-12-04 09:19:58,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=224900.0, ans=0.2 2023-12-04 09:20:02,201 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.317e+02 1.414e+02 1.597e+02 2.110e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-04 09:20:22,613 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=225033.33333333334, ans=0.125 2023-12-04 09:20:24,616 INFO [train.py:1087] (2/4) Epoch 38, batch 650, loss[loss=0.1619, simple_loss=0.2558, pruned_loss=0.03401, over 24751.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.254, pruned_loss=0.03567, over 4621783.22 frames. ], batch size: 61, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:20:25,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=225100.0, ans=0.1 2023-12-04 09:20:25,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=225100.0, ans=0.125 2023-12-04 09:20:48,472 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.40 vs. limit=15.0 2023-12-04 09:21:04,585 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=225300.0, ans=0.0 2023-12-04 09:21:07,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=225366.66666666666, ans=0.0 2023-12-04 09:21:14,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=225366.66666666666, ans=0.125 2023-12-04 09:21:20,127 INFO [train.py:1087] (2/4) Epoch 38, batch 700, loss[loss=0.1652, simple_loss=0.2567, pruned_loss=0.03689, over 22868.00 frames. ], tot_loss[loss=0.1621, simple_loss=0.2537, pruned_loss=0.0353, over 4674503.24 frames. ], batch size: 106, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:21:36,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=225500.0, ans=0.5 2023-12-04 09:21:44,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=225566.66666666666, ans=0.125 2023-12-04 09:21:53,076 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.338e+02 1.467e+02 1.594e+02 2.367e+02, threshold=2.933e+02, percent-clipped=0.0 2023-12-04 09:22:15,031 INFO [train.py:1087] (2/4) Epoch 38, batch 750, loss[loss=0.1513, simple_loss=0.2398, pruned_loss=0.03144, over 24559.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2539, pruned_loss=0.03558, over 4687302.27 frames. ], batch size: 64, lr: 6.42e-03, grad_scale: 16.0 2023-12-04 09:22:15,337 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=225766.66666666666, ans=0.125 2023-12-04 09:22:41,507 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-12-04 09:22:42,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=225900.0, ans=0.0 2023-12-04 09:22:47,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=225966.66666666666, ans=0.125 2023-12-04 09:22:52,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=225966.66666666666, ans=0.0 2023-12-04 09:22:52,442 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-12-04 09:22:54,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225966.66666666666, ans=0.125 2023-12-04 09:23:00,012 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-12-04 09:23:10,305 INFO [train.py:1087] (2/4) Epoch 38, batch 800, loss[loss=0.16, simple_loss=0.2517, pruned_loss=0.03412, over 24754.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2538, pruned_loss=0.03557, over 4697075.25 frames. ], batch size: 65, lr: 6.42e-03, grad_scale: 32.0 2023-12-04 09:23:13,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226100.0, ans=0.125 2023-12-04 09:23:21,944 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-12-04 09:23:26,659 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=226166.66666666666, ans=0.125 2023-12-04 09:23:32,734 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:23:38,971 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=226233.33333333334, ans=10.0 2023-12-04 09:23:41,318 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.328e+02 1.439e+02 1.587e+02 2.613e+02, threshold=2.879e+02, percent-clipped=0.0 2023-12-04 09:23:47,422 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=226300.0, ans=0.0 2023-12-04 09:23:50,469 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=226366.66666666666, ans=0.125 2023-12-04 09:24:01,253 INFO [train.py:1087] (2/4) Epoch 38, batch 850, loss[loss=0.157, simple_loss=0.2493, pruned_loss=0.03232, over 24785.00 frames. ], tot_loss[loss=0.1621, simple_loss=0.2535, pruned_loss=0.0353, over 4729307.55 frames. ], batch size: 71, lr: 6.41e-03, grad_scale: 16.0 2023-12-04 09:24:12,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=226500.0, ans=0.95 2023-12-04 09:24:24,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226566.66666666666, ans=0.1 2023-12-04 09:24:57,874 INFO [train.py:1087] (2/4) Epoch 39, batch 0, loss[loss=0.1545, simple_loss=0.2514, pruned_loss=0.0288, over 24773.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2514, pruned_loss=0.0288, over 24773.00 frames. ], batch size: 70, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:24:57,875 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 09:25:09,957 INFO [train.py:1119] (2/4) Epoch 39, validation: loss=0.1525, simple_loss=0.252, pruned_loss=0.02647, over 944034.00 frames. 2023-12-04 09:25:09,958 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 09:25:11,420 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-12-04 09:25:22,946 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=12.0 2023-12-04 09:25:31,758 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=15.0 2023-12-04 09:25:39,278 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=226866.66666666666, ans=0.125 2023-12-04 09:25:43,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=226933.33333333334, ans=0.0 2023-12-04 09:25:46,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=226933.33333333334, ans=0.0 2023-12-04 09:25:49,546 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.344e+02 1.473e+02 1.644e+02 2.722e+02, threshold=2.946e+02, percent-clipped=0.0 2023-12-04 09:25:49,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=226933.33333333334, ans=0.2 2023-12-04 09:26:05,136 INFO [train.py:1087] (2/4) Epoch 39, batch 50, loss[loss=0.1936, simple_loss=0.2783, pruned_loss=0.05449, over 17228.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2549, pruned_loss=0.03602, over 1084443.79 frames. ], batch size: 177, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:26:06,922 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-12-04 09:26:32,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227200.0, ans=0.1 2023-12-04 09:26:43,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=227266.66666666666, ans=0.0 2023-12-04 09:26:59,958 INFO [train.py:1087] (2/4) Epoch 39, batch 100, loss[loss=0.1484, simple_loss=0.2432, pruned_loss=0.02681, over 24723.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.2542, pruned_loss=0.03553, over 1918715.18 frames. ], batch size: 69, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:27:19,014 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227466.66666666666, ans=0.125 2023-12-04 09:27:34,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=227600.0, ans=0.125 2023-12-04 09:27:41,490 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.334e+02 1.448e+02 1.608e+02 2.260e+02, threshold=2.895e+02, percent-clipped=0.0 2023-12-04 09:27:53,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=227733.33333333334, ans=0.1 2023-12-04 09:27:54,539 INFO [train.py:1087] (2/4) Epoch 39, batch 150, loss[loss=0.1562, simple_loss=0.2512, pruned_loss=0.03062, over 24542.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.254, pruned_loss=0.03552, over 2554411.05 frames. ], batch size: 66, lr: 6.31e-03, grad_scale: 8.0 2023-12-04 09:28:00,410 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=227733.33333333334, ans=0.95 2023-12-04 09:28:03,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=227733.33333333334, ans=0.125 2023-12-04 09:28:05,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=227800.0, ans=0.0 2023-12-04 09:28:22,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=227866.66666666666, ans=0.0 2023-12-04 09:28:31,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=227933.33333333334, ans=0.0 2023-12-04 09:28:34,085 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-12-04 09:28:49,457 INFO [train.py:1087] (2/4) Epoch 39, batch 200, loss[loss=0.1444, simple_loss=0.2359, pruned_loss=0.02644, over 24798.00 frames. ], tot_loss[loss=0.1616, simple_loss=0.2531, pruned_loss=0.03504, over 3067157.45 frames. ], batch size: 72, lr: 6.31e-03, grad_scale: 8.0 2023-12-04 09:29:10,593 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-12-04 09:29:29,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=228266.66666666666, ans=0.07 2023-12-04 09:29:31,854 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.311e+02 1.418e+02 1.591e+02 2.509e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 09:29:40,997 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:29:44,856 INFO [train.py:1087] (2/4) Epoch 39, batch 250, loss[loss=0.1648, simple_loss=0.258, pruned_loss=0.03585, over 23435.00 frames. ], tot_loss[loss=0.1619, simple_loss=0.2535, pruned_loss=0.03514, over 3445509.04 frames. ], batch size: 94, lr: 6.30e-03, grad_scale: 8.0 2023-12-04 09:29:55,750 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:30:00,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=228466.66666666666, ans=0.125 2023-12-04 09:30:03,930 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.33 vs. limit=12.0 2023-12-04 09:30:04,704 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=228466.66666666666, ans=0.0 2023-12-04 09:30:05,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=228533.33333333334, ans=0.2 2023-12-04 09:30:07,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=228533.33333333334, ans=0.125 2023-12-04 09:30:07,570 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.82 vs. limit=22.5 2023-12-04 09:30:13,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228533.33333333334, ans=0.125 2023-12-04 09:30:33,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=228666.66666666666, ans=0.125 2023-12-04 09:30:39,543 INFO [train.py:1087] (2/4) Epoch 39, batch 300, loss[loss=0.1644, simple_loss=0.2567, pruned_loss=0.03607, over 24566.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2529, pruned_loss=0.03466, over 3757773.60 frames. ], batch size: 65, lr: 6.30e-03, grad_scale: 8.0 2023-12-04 09:30:44,366 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.82 vs. limit=15.0 2023-12-04 09:30:50,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=228800.0, ans=0.125 2023-12-04 09:31:20,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=228933.33333333334, ans=0.125 2023-12-04 09:31:20,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=228933.33333333334, ans=0.125 2023-12-04 09:31:21,184 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.172e+02 1.313e+02 1.429e+02 1.539e+02 2.008e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 09:31:22,572 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=229000.0, ans=0.0 2023-12-04 09:31:23,667 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=229000.0, ans=0.125 2023-12-04 09:31:28,974 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=229000.0, ans=0.0 2023-12-04 09:31:34,279 INFO [train.py:1087] (2/4) Epoch 39, batch 350, loss[loss=0.1573, simple_loss=0.2481, pruned_loss=0.03323, over 24583.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2527, pruned_loss=0.03463, over 4004653.47 frames. ], batch size: 65, lr: 6.29e-03, grad_scale: 8.0 2023-12-04 09:31:36,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=229066.66666666666, ans=0.125 2023-12-04 09:31:41,196 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=229066.66666666666, ans=0.0 2023-12-04 09:31:42,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=229066.66666666666, ans=0.0 2023-12-04 09:31:43,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=229066.66666666666, ans=0.2 2023-12-04 09:31:48,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229133.33333333334, ans=0.125 2023-12-04 09:31:49,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=229133.33333333334, ans=0.125 2023-12-04 09:31:52,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=229133.33333333334, ans=0.0 2023-12-04 09:31:53,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=229133.33333333334, ans=0.0 2023-12-04 09:32:12,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229266.66666666666, ans=0.1 2023-12-04 09:32:19,205 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-12-04 09:32:23,118 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-12-04 09:32:28,984 INFO [train.py:1087] (2/4) Epoch 39, batch 400, loss[loss=0.1617, simple_loss=0.2481, pruned_loss=0.03766, over 24556.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.2531, pruned_loss=0.0349, over 4186626.51 frames. ], batch size: 62, lr: 6.29e-03, grad_scale: 16.0 2023-12-04 09:32:30,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=229400.0, ans=0.0 2023-12-04 09:32:43,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=229466.66666666666, ans=0.2 2023-12-04 09:33:00,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=229600.0, ans=0.035 2023-12-04 09:33:10,766 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.303e+02 1.447e+02 1.594e+02 2.149e+02, threshold=2.894e+02, percent-clipped=0.0 2023-12-04 09:33:18,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=229666.66666666666, ans=0.125 2023-12-04 09:33:24,250 INFO [train.py:1087] (2/4) Epoch 39, batch 450, loss[loss=0.1556, simple_loss=0.2477, pruned_loss=0.03175, over 24780.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.2531, pruned_loss=0.03483, over 4324376.65 frames. ], batch size: 70, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:33:27,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=229733.33333333334, ans=0.125 2023-12-04 09:33:31,134 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=229733.33333333334, ans=0.0 2023-12-04 09:33:35,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=229800.0, ans=0.1 2023-12-04 09:33:43,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=229800.0, ans=0.125 2023-12-04 09:33:45,151 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=229800.0, ans=0.125 2023-12-04 09:34:06,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=229933.33333333334, ans=0.0 2023-12-04 09:34:19,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=230066.66666666666, ans=0.125 2023-12-04 09:34:19,957 INFO [train.py:1087] (2/4) Epoch 39, batch 500, loss[loss=0.155, simple_loss=0.2429, pruned_loss=0.03353, over 24743.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2529, pruned_loss=0.03463, over 4443987.11 frames. ], batch size: 63, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:34:40,576 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-12-04 09:34:53,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=230266.66666666666, ans=0.125 2023-12-04 09:34:54,428 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=230266.66666666666, ans=0.04949747468305833 2023-12-04 09:35:01,529 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.360e+02 1.542e+02 1.706e+02 2.514e+02, threshold=3.084e+02, percent-clipped=0.0 2023-12-04 09:35:11,218 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=230333.33333333334, ans=0.125 2023-12-04 09:35:13,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=230400.0, ans=0.125 2023-12-04 09:35:14,024 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-12-04 09:35:14,376 INFO [train.py:1087] (2/4) Epoch 39, batch 550, loss[loss=0.1621, simple_loss=0.255, pruned_loss=0.03462, over 24753.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2528, pruned_loss=0.03468, over 4521981.38 frames. ], batch size: 70, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:35:14,663 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=230400.0, ans=0.125 2023-12-04 09:35:28,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230466.66666666666, ans=0.125 2023-12-04 09:35:55,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=230600.0, ans=0.125 2023-12-04 09:36:01,867 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=230666.66666666666, ans=0.95 2023-12-04 09:36:02,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=230666.66666666666, ans=0.125 2023-12-04 09:36:11,140 INFO [train.py:1087] (2/4) Epoch 39, batch 600, loss[loss=0.15, simple_loss=0.2431, pruned_loss=0.02849, over 24602.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2528, pruned_loss=0.03475, over 4582273.60 frames. ], batch size: 68, lr: 6.27e-03, grad_scale: 16.0 2023-12-04 09:36:28,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=230800.0, ans=0.5 2023-12-04 09:36:36,242 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=230866.66666666666, ans=0.0 2023-12-04 09:36:41,553 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=230866.66666666666, ans=0.125 2023-12-04 09:36:43,708 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230933.33333333334, ans=0.1 2023-12-04 09:36:53,451 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.292e+02 1.379e+02 1.519e+02 2.695e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-04 09:37:07,290 INFO [train.py:1087] (2/4) Epoch 39, batch 650, loss[loss=0.1515, simple_loss=0.2437, pruned_loss=0.02964, over 24484.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2527, pruned_loss=0.03465, over 4628936.31 frames. ], batch size: 77, lr: 6.27e-03, grad_scale: 16.0 2023-12-04 09:37:24,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=231133.33333333334, ans=0.0 2023-12-04 09:37:27,877 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-12-04 09:37:29,790 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.72 vs. limit=15.0 2023-12-04 09:37:37,485 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-12-04 09:37:40,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=231266.66666666666, ans=0.125 2023-12-04 09:38:02,645 INFO [train.py:1087] (2/4) Epoch 39, batch 700, loss[loss=0.1456, simple_loss=0.2416, pruned_loss=0.02487, over 24758.00 frames. ], tot_loss[loss=0.1607, simple_loss=0.2526, pruned_loss=0.03447, over 4680857.06 frames. ], batch size: 66, lr: 6.26e-03, grad_scale: 16.0 2023-12-04 09:38:03,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=231400.0, ans=0.125 2023-12-04 09:38:26,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=231533.33333333334, ans=0.125 2023-12-04 09:38:28,055 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-12-04 09:38:28,738 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=231533.33333333334, ans=0.125 2023-12-04 09:38:33,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=231533.33333333334, ans=0.0 2023-12-04 09:38:45,048 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.352e+02 1.512e+02 1.676e+02 2.495e+02, threshold=3.025e+02, percent-clipped=0.0 2023-12-04 09:38:47,553 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=231666.66666666666, ans=0.0 2023-12-04 09:38:53,922 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=231666.66666666666, ans=0.2 2023-12-04 09:38:57,488 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=231733.33333333334, ans=0.125 2023-12-04 09:38:58,348 INFO [train.py:1087] (2/4) Epoch 39, batch 750, loss[loss=0.1615, simple_loss=0.2558, pruned_loss=0.03364, over 24787.00 frames. ], tot_loss[loss=0.1617, simple_loss=0.2533, pruned_loss=0.03504, over 4686200.97 frames. ], batch size: 62, lr: 6.26e-03, grad_scale: 16.0 2023-12-04 09:39:12,175 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.42 vs. limit=10.0 2023-12-04 09:39:16,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=231800.0, ans=0.125 2023-12-04 09:39:27,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=231866.66666666666, ans=0.0 2023-12-04 09:39:29,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=231866.66666666666, ans=0.125 2023-12-04 09:39:34,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231933.33333333334, ans=0.1 2023-12-04 09:39:53,561 INFO [train.py:1087] (2/4) Epoch 39, batch 800, loss[loss=0.1511, simple_loss=0.2475, pruned_loss=0.02733, over 24775.00 frames. ], tot_loss[loss=0.162, simple_loss=0.2535, pruned_loss=0.03521, over 4715981.65 frames. ], batch size: 71, lr: 6.25e-03, grad_scale: 32.0 2023-12-04 09:39:56,307 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-12-04 09:39:59,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=232066.66666666666, ans=0.125 2023-12-04 09:40:10,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=232133.33333333334, ans=0.125 2023-12-04 09:40:16,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=232200.0, ans=0.125 2023-12-04 09:40:32,427 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.325e+02 1.411e+02 1.635e+02 2.142e+02, threshold=2.822e+02, percent-clipped=0.0 2023-12-04 09:40:33,682 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=232333.33333333334, ans=0.1 2023-12-04 09:40:37,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232333.33333333334, ans=0.1 2023-12-04 09:40:44,324 INFO [train.py:1087] (2/4) Epoch 39, batch 850, loss[loss=0.1658, simple_loss=0.2576, pruned_loss=0.03701, over 24210.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.2538, pruned_loss=0.03567, over 4735860.79 frames. ], batch size: 82, lr: 6.25e-03, grad_scale: 32.0 2023-12-04 09:40:46,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=232400.0, ans=0.0 2023-12-04 09:41:06,227 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.91 vs. limit=15.0 2023-12-04 09:41:16,932 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:41:25,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232666.66666666666, ans=0.1 2023-12-04 09:41:42,987 INFO [train.py:1087] (2/4) Epoch 40, batch 0, loss[loss=0.1467, simple_loss=0.2421, pruned_loss=0.02562, over 24782.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2421, pruned_loss=0.02562, over 24782.00 frames. ], batch size: 71, lr: 6.17e-03, grad_scale: 32.0 2023-12-04 09:41:42,988 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 09:41:55,145 INFO [train.py:1119] (2/4) Epoch 40, validation: loss=0.1533, simple_loss=0.2521, pruned_loss=0.02723, over 944034.00 frames. 2023-12-04 09:41:55,146 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 09:42:02,054 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.33 vs. limit=15.0 2023-12-04 09:42:07,267 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-12-04 09:42:16,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=232833.33333333334, ans=0.07 2023-12-04 09:42:23,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=232833.33333333334, ans=0.125 2023-12-04 09:42:37,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=232966.66666666666, ans=0.05 2023-12-04 09:42:41,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232966.66666666666, ans=0.1 2023-12-04 09:42:42,087 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.284e+02 1.404e+02 1.572e+02 2.699e+02, threshold=2.808e+02, percent-clipped=0.0 2023-12-04 09:42:50,000 INFO [train.py:1087] (2/4) Epoch 40, batch 50, loss[loss=0.1705, simple_loss=0.2653, pruned_loss=0.03786, over 24767.00 frames. ], tot_loss[loss=0.1645, simple_loss=0.2557, pruned_loss=0.03665, over 1072979.66 frames. ], batch size: 65, lr: 6.16e-03, grad_scale: 32.0 2023-12-04 09:43:03,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=233100.0, ans=0.125 2023-12-04 09:43:21,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=233166.66666666666, ans=0.2 2023-12-04 09:43:22,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=233233.33333333334, ans=0.0 2023-12-04 09:43:23,550 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=15.0 2023-12-04 09:43:36,266 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:43:42,327 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=12.0 2023-12-04 09:43:42,460 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-12-04 09:43:45,068 INFO [train.py:1087] (2/4) Epoch 40, batch 100, loss[loss=0.1486, simple_loss=0.2421, pruned_loss=0.02754, over 24725.00 frames. ], tot_loss[loss=0.1622, simple_loss=0.2541, pruned_loss=0.03514, over 1902546.57 frames. ], batch size: 67, lr: 6.16e-03, grad_scale: 16.0 2023-12-04 09:43:55,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=233433.33333333334, ans=0.05 2023-12-04 09:44:03,845 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=233433.33333333334, ans=0.125 2023-12-04 09:44:26,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-12-04 09:44:32,786 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.305e+02 1.412e+02 1.578e+02 2.389e+02, threshold=2.824e+02, percent-clipped=0.0 2023-12-04 09:44:39,545 INFO [train.py:1087] (2/4) Epoch 40, batch 150, loss[loss=0.1588, simple_loss=0.2557, pruned_loss=0.03093, over 22805.00 frames. ], tot_loss[loss=0.162, simple_loss=0.2534, pruned_loss=0.03526, over 2536977.81 frames. ], batch size: 106, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:44:39,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=233700.0, ans=0.0 2023-12-04 09:44:47,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=233700.0, ans=0.125 2023-12-04 09:45:17,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=233900.0, ans=0.0 2023-12-04 09:45:34,894 INFO [train.py:1087] (2/4) Epoch 40, batch 200, loss[loss=0.1718, simple_loss=0.2611, pruned_loss=0.04131, over 24737.00 frames. ], tot_loss[loss=0.162, simple_loss=0.2534, pruned_loss=0.03532, over 3027002.14 frames. ], batch size: 61, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:45:45,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234100.0, ans=0.1 2023-12-04 09:45:56,077 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.27 vs. limit=15.0 2023-12-04 09:45:58,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=234166.66666666666, ans=0.0 2023-12-04 09:46:07,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=234233.33333333334, ans=0.1 2023-12-04 09:46:21,128 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:46:24,458 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.063e+02 1.299e+02 1.417e+02 1.559e+02 2.264e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 09:46:27,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=234300.0, ans=0.125 2023-12-04 09:46:30,820 INFO [train.py:1087] (2/4) Epoch 40, batch 250, loss[loss=0.1555, simple_loss=0.2487, pruned_loss=0.03114, over 24568.00 frames. ], tot_loss[loss=0.1622, simple_loss=0.2538, pruned_loss=0.03532, over 3411228.36 frames. ], batch size: 64, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:46:35,346 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:46:38,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=234366.66666666666, ans=0.125 2023-12-04 09:46:39,512 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=234366.66666666666, ans=0.0 2023-12-04 09:46:46,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234433.33333333334, ans=0.1 2023-12-04 09:46:52,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=234500.0, ans=0.125 2023-12-04 09:46:53,536 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-12-04 09:47:00,447 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.07 vs. limit=22.5 2023-12-04 09:47:07,750 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:47:27,385 INFO [train.py:1087] (2/4) Epoch 40, batch 300, loss[loss=0.1672, simple_loss=0.2576, pruned_loss=0.03837, over 24474.00 frames. ], tot_loss[loss=0.1616, simple_loss=0.2532, pruned_loss=0.03502, over 3722681.54 frames. ], batch size: 75, lr: 6.14e-03, grad_scale: 16.0 2023-12-04 09:48:00,331 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-12-04 09:48:15,706 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.298e+02 1.389e+02 1.491e+02 2.155e+02, threshold=2.778e+02, percent-clipped=0.0 2023-12-04 09:48:16,989 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.256e-03 2023-12-04 09:48:22,223 INFO [train.py:1087] (2/4) Epoch 40, batch 350, loss[loss=0.1742, simple_loss=0.2652, pruned_loss=0.0416, over 24436.00 frames. ], tot_loss[loss=0.1619, simple_loss=0.2533, pruned_loss=0.03523, over 3966451.39 frames. ], batch size: 77, lr: 6.14e-03, grad_scale: 16.0 2023-12-04 09:48:41,407 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=12.0 2023-12-04 09:48:48,911 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-12-04 09:48:52,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=235166.66666666666, ans=0.125 2023-12-04 09:49:10,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235300.0, ans=0.125 2023-12-04 09:49:17,920 INFO [train.py:1087] (2/4) Epoch 40, batch 400, loss[loss=0.1943, simple_loss=0.2763, pruned_loss=0.05614, over 16707.00 frames. ], tot_loss[loss=0.1623, simple_loss=0.2537, pruned_loss=0.03543, over 4129801.15 frames. ], batch size: 177, lr: 6.13e-03, grad_scale: 32.0 2023-12-04 09:49:21,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=235366.66666666666, ans=0.0 2023-12-04 09:49:22,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=235366.66666666666, ans=0.125 2023-12-04 09:49:23,703 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-12-04 09:49:36,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=235433.33333333334, ans=0.0 2023-12-04 09:49:45,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=235500.0, ans=0.0 2023-12-04 09:50:05,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=235633.33333333334, ans=0.125 2023-12-04 09:50:07,953 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.305e+02 1.392e+02 1.533e+02 2.243e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 09:50:14,513 INFO [train.py:1087] (2/4) Epoch 40, batch 450, loss[loss=0.1818, simple_loss=0.2723, pruned_loss=0.04562, over 23473.00 frames. ], tot_loss[loss=0.1624, simple_loss=0.2536, pruned_loss=0.03555, over 4280688.32 frames. ], batch size: 94, lr: 6.13e-03, grad_scale: 32.0 2023-12-04 09:50:40,642 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-12-04 09:50:46,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=235833.33333333334, ans=0.05 2023-12-04 09:50:51,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=235900.0, ans=0.5 2023-12-04 09:50:53,244 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-12-04 09:50:56,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=235900.0, ans=0.1 2023-12-04 09:51:10,700 INFO [train.py:1087] (2/4) Epoch 40, batch 500, loss[loss=0.1679, simple_loss=0.2544, pruned_loss=0.04072, over 23503.00 frames. ], tot_loss[loss=0.1618, simple_loss=0.2531, pruned_loss=0.03524, over 4402230.05 frames. ], batch size: 94, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:51:15,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=236033.33333333334, ans=0.125 2023-12-04 09:51:18,032 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.29 vs. limit=15.0 2023-12-04 09:51:21,986 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=236100.0, ans=0.0 2023-12-04 09:51:26,711 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-12-04 09:51:27,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236100.0, ans=0.1 2023-12-04 09:51:28,790 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.03 vs. limit=15.0 2023-12-04 09:51:37,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=236166.66666666666, ans=0.125 2023-12-04 09:51:37,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=236166.66666666666, ans=0.125 2023-12-04 09:51:39,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=236166.66666666666, ans=0.125 2023-12-04 09:51:45,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=236233.33333333334, ans=0.125 2023-12-04 09:51:48,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=236233.33333333334, ans=0.2 2023-12-04 09:51:55,257 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-12-04 09:51:59,778 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.340e+02 1.474e+02 1.657e+02 2.312e+02, threshold=2.948e+02, percent-clipped=0.0 2023-12-04 09:52:00,011 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:52:06,741 INFO [train.py:1087] (2/4) Epoch 40, batch 550, loss[loss=0.1674, simple_loss=0.259, pruned_loss=0.03788, over 24142.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.2526, pruned_loss=0.03493, over 4491625.83 frames. ], batch size: 82, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:52:16,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=236366.66666666666, ans=0.0 2023-12-04 09:52:28,279 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=236500.0, ans=0.0 2023-12-04 09:52:35,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=236500.0, ans=0.0 2023-12-04 09:52:40,777 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=236566.66666666666, ans=0.0 2023-12-04 09:53:02,505 INFO [train.py:1087] (2/4) Epoch 40, batch 600, loss[loss=0.1613, simple_loss=0.2565, pruned_loss=0.03302, over 22832.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.2528, pruned_loss=0.03496, over 4565162.81 frames. ], batch size: 106, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:53:07,257 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.64 vs. limit=15.0 2023-12-04 09:53:19,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=236766.66666666666, ans=0.125 2023-12-04 09:53:23,114 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=236766.66666666666, ans=0.125 2023-12-04 09:53:52,195 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.328e+02 1.464e+02 1.572e+02 2.479e+02, threshold=2.928e+02, percent-clipped=0.0 2023-12-04 09:53:59,066 INFO [train.py:1087] (2/4) Epoch 40, batch 650, loss[loss=0.171, simple_loss=0.2571, pruned_loss=0.04247, over 24742.00 frames. ], tot_loss[loss=0.1615, simple_loss=0.2528, pruned_loss=0.03505, over 4601982.20 frames. ], batch size: 61, lr: 6.11e-03, grad_scale: 32.0 2023-12-04 09:54:02,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=237033.33333333334, ans=0.125 2023-12-04 09:54:06,738 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=237033.33333333334, ans=0.125 2023-12-04 09:54:09,356 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237100.0, ans=0.1 2023-12-04 09:54:14,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=237100.0, ans=0.1 2023-12-04 09:54:17,685 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:54:38,103 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=237233.33333333334, ans=0.2 2023-12-04 09:54:48,842 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=237300.0, ans=0.1 2023-12-04 09:54:52,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=237300.0, ans=0.2 2023-12-04 09:54:55,141 INFO [train.py:1087] (2/4) Epoch 40, batch 700, loss[loss=0.1513, simple_loss=0.2394, pruned_loss=0.03157, over 24566.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2527, pruned_loss=0.03474, over 4643765.74 frames. ], batch size: 64, lr: 6.11e-03, grad_scale: 32.0 2023-12-04 09:55:43,683 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=237633.33333333334, ans=0.0 2023-12-04 09:55:44,493 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.347e+02 1.457e+02 1.642e+02 2.295e+02, threshold=2.914e+02, percent-clipped=0.0 2023-12-04 09:55:51,577 INFO [train.py:1087] (2/4) Epoch 40, batch 750, loss[loss=0.1614, simple_loss=0.257, pruned_loss=0.03293, over 22826.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.253, pruned_loss=0.03475, over 4664215.74 frames. ], batch size: 106, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:56:09,308 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-12-04 09:56:11,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=237833.33333333334, ans=0.035 2023-12-04 09:56:13,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=237833.33333333334, ans=0.125 2023-12-04 09:56:25,885 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=237900.0, ans=0.025 2023-12-04 09:56:34,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=237966.66666666666, ans=0.025 2023-12-04 09:56:44,211 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=237966.66666666666, ans=0.2 2023-12-04 09:56:46,402 INFO [train.py:1087] (2/4) Epoch 40, batch 800, loss[loss=0.1606, simple_loss=0.2493, pruned_loss=0.03592, over 24766.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2529, pruned_loss=0.03477, over 4690314.27 frames. ], batch size: 66, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:56:52,100 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=238033.33333333334, ans=0.125 2023-12-04 09:57:10,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=238166.66666666666, ans=0.0 2023-12-04 09:57:15,559 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:57:23,926 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=12.0 2023-12-04 09:57:25,689 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=238233.33333333334, ans=0.125 2023-12-04 09:57:31,414 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.267e+02 1.373e+02 1.459e+02 2.137e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-04 09:57:37,470 INFO [train.py:1087] (2/4) Epoch 40, batch 850, loss[loss=0.1597, simple_loss=0.2519, pruned_loss=0.03378, over 24804.00 frames. ], tot_loss[loss=0.1608, simple_loss=0.2525, pruned_loss=0.03452, over 4728097.66 frames. ], batch size: 73, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:57:44,616 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:57:51,479 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=238433.33333333334, ans=0.125 2023-12-04 09:57:53,884 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=238433.33333333334, ans=0.2 2023-12-04 09:57:55,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=238433.33333333334, ans=0.125 2023-12-04 09:58:06,885 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=238566.66666666666, ans=0.2 2023-12-04 09:58:08,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=238566.66666666666, ans=0.0 2023-12-04 09:58:36,652 INFO [train.py:1087] (2/4) Epoch 41, batch 0, loss[loss=0.1582, simple_loss=0.253, pruned_loss=0.03167, over 24774.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.253, pruned_loss=0.03167, over 24774.00 frames. ], batch size: 61, lr: 6.02e-03, grad_scale: 32.0 2023-12-04 09:58:36,652 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 09:58:48,636 INFO [train.py:1119] (2/4) Epoch 41, validation: loss=0.1523, simple_loss=0.2513, pruned_loss=0.02667, over 944034.00 frames. 2023-12-04 09:58:48,636 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 09:59:16,685 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238800.0, ans=0.1 2023-12-04 09:59:43,913 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.329e+02 1.429e+02 1.632e+02 2.414e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 09:59:43,939 INFO [train.py:1087] (2/4) Epoch 41, batch 50, loss[loss=0.1554, simple_loss=0.2463, pruned_loss=0.03228, over 24571.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2531, pruned_loss=0.034, over 1096355.04 frames. ], batch size: 65, lr: 6.01e-03, grad_scale: 32.0 2023-12-04 09:59:57,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=239066.66666666666, ans=0.0 2023-12-04 10:00:26,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=239200.0, ans=0.025 2023-12-04 10:00:39,562 INFO [train.py:1087] (2/4) Epoch 41, batch 100, loss[loss=0.1413, simple_loss=0.2338, pruned_loss=0.02446, over 24575.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2525, pruned_loss=0.03427, over 1916491.17 frames. ], batch size: 64, lr: 6.01e-03, grad_scale: 16.0 2023-12-04 10:00:52,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=239400.0, ans=0.125 2023-12-04 10:00:54,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239400.0, ans=0.1 2023-12-04 10:01:14,812 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:01:18,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239533.33333333334, ans=0.1 2023-12-04 10:01:34,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=239666.66666666666, ans=0.1 2023-12-04 10:01:35,550 INFO [train.py:1087] (2/4) Epoch 41, batch 150, loss[loss=0.1573, simple_loss=0.2507, pruned_loss=0.03198, over 24330.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.2521, pruned_loss=0.03437, over 2560481.52 frames. ], batch size: 79, lr: 6.00e-03, grad_scale: 16.0 2023-12-04 10:01:36,539 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.192e+02 1.310e+02 1.399e+02 1.503e+02 2.273e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-04 10:01:49,409 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-12-04 10:01:57,052 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=239800.0, ans=0.125 2023-12-04 10:02:09,914 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=239866.66666666666, ans=0.125 2023-12-04 10:02:11,022 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=239866.66666666666, ans=0.09899494936611666 2023-12-04 10:02:26,202 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-12-04 10:02:29,519 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.74 vs. limit=22.5 2023-12-04 10:02:33,836 INFO [train.py:1087] (2/4) Epoch 41, batch 200, loss[loss=0.1546, simple_loss=0.25, pruned_loss=0.02959, over 24769.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2524, pruned_loss=0.03475, over 3042985.47 frames. ], batch size: 65, lr: 6.00e-03, grad_scale: 16.0 2023-12-04 10:02:58,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=240133.33333333334, ans=0.0 2023-12-04 10:03:01,664 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=240133.33333333334, ans=0.0 2023-12-04 10:03:02,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=240133.33333333334, ans=0.2 2023-12-04 10:03:29,204 INFO [train.py:1087] (2/4) Epoch 41, batch 250, loss[loss=0.1616, simple_loss=0.2557, pruned_loss=0.03368, over 24715.00 frames. ], tot_loss[loss=0.1607, simple_loss=0.2522, pruned_loss=0.03455, over 3443322.37 frames. ], batch size: 67, lr: 6.00e-03, grad_scale: 8.0 2023-12-04 10:03:31,685 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.301e+02 1.390e+02 1.563e+02 2.001e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 10:04:09,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=240533.33333333334, ans=0.0 2023-12-04 10:04:22,041 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=240600.0, ans=0.125 2023-12-04 10:04:25,024 INFO [train.py:1087] (2/4) Epoch 41, batch 300, loss[loss=0.1631, simple_loss=0.2568, pruned_loss=0.03475, over 24006.00 frames. ], tot_loss[loss=0.1608, simple_loss=0.2526, pruned_loss=0.03457, over 3739722.86 frames. ], batch size: 87, lr: 5.99e-03, grad_scale: 8.0 2023-12-04 10:04:40,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=240733.33333333334, ans=0.125 2023-12-04 10:05:12,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=240933.33333333334, ans=0.125 2023-12-04 10:05:13,100 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.43 vs. limit=10.0 2023-12-04 10:05:18,730 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-12-04 10:05:20,349 INFO [train.py:1087] (2/4) Epoch 41, batch 350, loss[loss=0.1597, simple_loss=0.252, pruned_loss=0.03373, over 24770.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2526, pruned_loss=0.0346, over 3968652.38 frames. ], batch size: 70, lr: 5.99e-03, grad_scale: 8.0 2023-12-04 10:05:22,138 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=241000.0, ans=0.0 2023-12-04 10:05:22,890 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.319e+02 1.439e+02 1.596e+02 2.414e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 10:05:53,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=241200.0, ans=0.0 2023-12-04 10:06:06,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=241266.66666666666, ans=0.2 2023-12-04 10:06:09,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=241266.66666666666, ans=0.125 2023-12-04 10:06:13,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=241266.66666666666, ans=0.0 2023-12-04 10:06:15,456 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=241333.33333333334, ans=0.125 2023-12-04 10:06:16,602 INFO [train.py:1087] (2/4) Epoch 41, batch 400, loss[loss=0.1964, simple_loss=0.2791, pruned_loss=0.05689, over 17072.00 frames. ], tot_loss[loss=0.1606, simple_loss=0.2522, pruned_loss=0.03449, over 4144458.50 frames. ], batch size: 178, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:06:36,178 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=15.0 2023-12-04 10:06:40,554 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=241466.66666666666, ans=0.125 2023-12-04 10:06:41,599 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241466.66666666666, ans=0.1 2023-12-04 10:06:41,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=241466.66666666666, ans=0.125 2023-12-04 10:07:01,262 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=241600.0, ans=0.125 2023-12-04 10:07:01,709 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-12-04 10:07:02,373 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:07:12,054 INFO [train.py:1087] (2/4) Epoch 41, batch 450, loss[loss=0.1547, simple_loss=0.2489, pruned_loss=0.03021, over 22846.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2525, pruned_loss=0.03464, over 4281149.80 frames. ], batch size: 106, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:07:14,197 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.276e+02 1.409e+02 1.560e+02 1.984e+02, threshold=2.818e+02, percent-clipped=0.0 2023-12-04 10:07:24,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=241733.33333333334, ans=0.2 2023-12-04 10:07:36,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=241800.0, ans=0.2 2023-12-04 10:07:58,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=241933.33333333334, ans=0.125 2023-12-04 10:08:07,666 INFO [train.py:1087] (2/4) Epoch 41, batch 500, loss[loss=0.1456, simple_loss=0.2383, pruned_loss=0.02645, over 24552.00 frames. ], tot_loss[loss=0.1606, simple_loss=0.2523, pruned_loss=0.03449, over 4393636.42 frames. ], batch size: 63, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:08:21,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=242066.66666666666, ans=0.125 2023-12-04 10:08:34,854 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-12-04 10:08:41,327 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=242200.0, ans=0.125 2023-12-04 10:08:47,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=242200.0, ans=0.125 2023-12-04 10:08:50,214 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=242200.0, ans=0.04949747468305833 2023-12-04 10:08:53,423 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=242266.66666666666, ans=0.125 2023-12-04 10:09:02,791 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=8.0 2023-12-04 10:09:02,997 INFO [train.py:1087] (2/4) Epoch 41, batch 550, loss[loss=0.1545, simple_loss=0.2435, pruned_loss=0.03277, over 24741.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2525, pruned_loss=0.03462, over 4498786.39 frames. ], batch size: 61, lr: 5.97e-03, grad_scale: 16.0 2023-12-04 10:09:03,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=242333.33333333334, ans=0.0 2023-12-04 10:09:04,645 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=15.0 2023-12-04 10:09:05,084 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.265e+02 1.332e+02 1.489e+02 1.835e+02, threshold=2.663e+02, percent-clipped=0.0 2023-12-04 10:09:50,499 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=242600.0, ans=0.0 2023-12-04 10:09:58,807 INFO [train.py:1087] (2/4) Epoch 41, batch 600, loss[loss=0.151, simple_loss=0.2446, pruned_loss=0.02866, over 24792.00 frames. ], tot_loss[loss=0.1607, simple_loss=0.2523, pruned_loss=0.03454, over 4577211.88 frames. ], batch size: 73, lr: 5.97e-03, grad_scale: 16.0 2023-12-04 10:10:04,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=242666.66666666666, ans=0.0 2023-12-04 10:10:09,960 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=242733.33333333334, ans=0.125 2023-12-04 10:10:17,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=242733.33333333334, ans=0.0 2023-12-04 10:10:23,159 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=15.0 2023-12-04 10:10:36,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=242866.66666666666, ans=0.125 2023-12-04 10:10:48,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=242933.33333333334, ans=0.125 2023-12-04 10:10:50,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=242933.33333333334, ans=0.125 2023-12-04 10:10:52,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=242933.33333333334, ans=0.125 2023-12-04 10:10:54,635 INFO [train.py:1087] (2/4) Epoch 41, batch 650, loss[loss=0.1602, simple_loss=0.2556, pruned_loss=0.03236, over 24738.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2517, pruned_loss=0.03412, over 4636129.99 frames. ], batch size: 63, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:10:56,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243000.0, ans=0.1 2023-12-04 10:10:56,711 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.277e+02 1.430e+02 1.557e+02 2.173e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 10:11:17,449 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.46 vs. limit=22.5 2023-12-04 10:11:26,237 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=243133.33333333334, ans=0.0 2023-12-04 10:11:27,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=243200.0, ans=0.0 2023-12-04 10:11:49,907 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=15.0 2023-12-04 10:11:50,473 INFO [train.py:1087] (2/4) Epoch 41, batch 700, loss[loss=0.1555, simple_loss=0.2508, pruned_loss=0.0301, over 24701.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2519, pruned_loss=0.03429, over 4670055.24 frames. ], batch size: 69, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:12:13,975 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=243466.66666666666, ans=0.07 2023-12-04 10:12:24,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=243533.33333333334, ans=0.0 2023-12-04 10:12:32,016 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=243533.33333333334, ans=0.125 2023-12-04 10:12:35,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=243600.0, ans=0.2 2023-12-04 10:12:45,928 INFO [train.py:1087] (2/4) Epoch 41, batch 750, loss[loss=0.1557, simple_loss=0.2488, pruned_loss=0.03129, over 24607.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2517, pruned_loss=0.03416, over 4698030.31 frames. ], batch size: 68, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:12:48,082 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.288e+02 1.387e+02 1.474e+02 2.010e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 10:12:48,423 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=243666.66666666666, ans=0.2 2023-12-04 10:12:53,014 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:13:10,978 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.49 vs. limit=22.5 2023-12-04 10:13:41,946 INFO [train.py:1087] (2/4) Epoch 41, batch 800, loss[loss=0.1503, simple_loss=0.2396, pruned_loss=0.03049, over 24489.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2515, pruned_loss=0.03409, over 4727846.80 frames. ], batch size: 75, lr: 5.95e-03, grad_scale: 32.0 2023-12-04 10:13:50,089 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244000.0, ans=0.1 2023-12-04 10:13:55,127 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=244066.66666666666, ans=0.04949747468305833 2023-12-04 10:14:13,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=244200.0, ans=0.125 2023-12-04 10:14:33,539 INFO [train.py:1087] (2/4) Epoch 41, batch 850, loss[loss=0.1802, simple_loss=0.2617, pruned_loss=0.04937, over 16540.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2515, pruned_loss=0.03416, over 4756959.65 frames. ], batch size: 178, lr: 5.95e-03, grad_scale: 32.0 2023-12-04 10:14:35,530 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.320e+02 1.423e+02 1.595e+02 2.103e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 10:14:55,689 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-12-04 10:15:34,186 INFO [train.py:1087] (2/4) Epoch 42, batch 0, loss[loss=0.1576, simple_loss=0.2532, pruned_loss=0.03106, over 24455.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2532, pruned_loss=0.03106, over 24455.00 frames. ], batch size: 77, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:15:34,187 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 10:15:46,436 INFO [train.py:1119] (2/4) Epoch 42, validation: loss=0.1528, simple_loss=0.2516, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 10:15:46,436 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 10:15:48,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244633.33333333334, ans=0.1 2023-12-04 10:15:51,340 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.97 vs. limit=22.5 2023-12-04 10:15:53,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=244633.33333333334, ans=0.0 2023-12-04 10:16:01,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=244700.0, ans=0.125 2023-12-04 10:16:01,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=244700.0, ans=0.95 2023-12-04 10:16:02,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=244700.0, ans=0.125 2023-12-04 10:16:08,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=244766.66666666666, ans=0.125 2023-12-04 10:16:08,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=244766.66666666666, ans=0.0 2023-12-04 10:16:16,887 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=244766.66666666666, ans=0.125 2023-12-04 10:16:19,180 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-12-04 10:16:25,731 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.59 vs. limit=15.0 2023-12-04 10:16:38,901 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.18 vs. limit=10.0 2023-12-04 10:16:41,507 INFO [train.py:1087] (2/4) Epoch 42, batch 50, loss[loss=0.1605, simple_loss=0.2576, pruned_loss=0.03167, over 24547.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2535, pruned_loss=0.03602, over 1067329.68 frames. ], batch size: 62, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:16:50,071 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.028e+02 1.299e+02 1.391e+02 1.562e+02 2.248e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-04 10:16:51,747 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.42 vs. limit=15.0 2023-12-04 10:16:54,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=245033.33333333334, ans=0.04949747468305833 2023-12-04 10:17:00,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=245033.33333333334, ans=0.09899494936611666 2023-12-04 10:17:17,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=245166.66666666666, ans=0.125 2023-12-04 10:17:30,569 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-12-04 10:17:37,901 INFO [train.py:1087] (2/4) Epoch 42, batch 100, loss[loss=0.1593, simple_loss=0.2509, pruned_loss=0.03385, over 24598.00 frames. ], tot_loss[loss=0.1624, simple_loss=0.2538, pruned_loss=0.0355, over 1885220.29 frames. ], batch size: 68, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:17:47,421 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-12-04 10:18:04,297 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=15.0 2023-12-04 10:18:10,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=245500.0, ans=0.0 2023-12-04 10:18:13,391 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-12-04 10:18:23,242 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=245566.66666666666, ans=0.1 2023-12-04 10:18:34,003 INFO [train.py:1087] (2/4) Epoch 42, batch 150, loss[loss=0.1645, simple_loss=0.2536, pruned_loss=0.03768, over 24020.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.2521, pruned_loss=0.0342, over 2543552.95 frames. ], batch size: 87, lr: 5.86e-03, grad_scale: 32.0 2023-12-04 10:18:41,710 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.066e+02 1.301e+02 1.400e+02 1.505e+02 2.131e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-04 10:18:52,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=245700.0, ans=0.0 2023-12-04 10:18:54,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=245700.0, ans=0.95 2023-12-04 10:19:00,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=245766.66666666666, ans=22.5 2023-12-04 10:19:02,102 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.57 vs. limit=15.0 2023-12-04 10:19:29,589 INFO [train.py:1087] (2/4) Epoch 42, batch 200, loss[loss=0.1579, simple_loss=0.2502, pruned_loss=0.03284, over 24316.00 frames. ], tot_loss[loss=0.1606, simple_loss=0.2523, pruned_loss=0.03445, over 3037512.89 frames. ], batch size: 79, lr: 5.86e-03, grad_scale: 16.0 2023-12-04 10:19:41,268 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=246033.33333333334, ans=0.0 2023-12-04 10:19:44,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=246033.33333333334, ans=0.0 2023-12-04 10:20:02,821 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.22 vs. limit=15.0 2023-12-04 10:20:14,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=246233.33333333334, ans=0.02 2023-12-04 10:20:17,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=246233.33333333334, ans=0.0 2023-12-04 10:20:25,031 INFO [train.py:1087] (2/4) Epoch 42, batch 250, loss[loss=0.1544, simple_loss=0.2479, pruned_loss=0.0304, over 24764.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2519, pruned_loss=0.03425, over 3438148.18 frames. ], batch size: 66, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:20:29,501 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.32 vs. limit=10.0 2023-12-04 10:20:34,330 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.303e+02 1.399e+02 1.549e+02 3.272e+02, threshold=2.797e+02, percent-clipped=1.0 2023-12-04 10:20:44,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=246366.66666666666, ans=0.0 2023-12-04 10:20:45,888 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246366.66666666666, ans=0.1 2023-12-04 10:20:47,227 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-12-04 10:20:51,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=246433.33333333334, ans=0.0 2023-12-04 10:21:03,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=246500.0, ans=0.0 2023-12-04 10:21:09,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=246566.66666666666, ans=0.0 2023-12-04 10:21:21,123 INFO [train.py:1087] (2/4) Epoch 42, batch 300, loss[loss=0.1621, simple_loss=0.2484, pruned_loss=0.03792, over 23918.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2523, pruned_loss=0.03433, over 3742760.48 frames. ], batch size: 87, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:21:22,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=246633.33333333334, ans=0.0 2023-12-04 10:21:24,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246633.33333333334, ans=0.1 2023-12-04 10:21:29,172 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:21:42,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-12-04 10:21:42,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=246766.66666666666, ans=0.015 2023-12-04 10:22:09,339 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.12 vs. limit=15.0 2023-12-04 10:22:12,296 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.37 vs. limit=10.0 2023-12-04 10:22:16,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=246966.66666666666, ans=0.0 2023-12-04 10:22:17,166 INFO [train.py:1087] (2/4) Epoch 42, batch 350, loss[loss=0.1635, simple_loss=0.2582, pruned_loss=0.03443, over 22971.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2521, pruned_loss=0.03444, over 3972490.07 frames. ], batch size: 106, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:22:26,402 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.309e+02 1.424e+02 1.589e+02 2.700e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 10:22:34,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=247033.33333333334, ans=0.125 2023-12-04 10:22:35,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=247033.33333333334, ans=0.0 2023-12-04 10:23:08,325 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:23:13,396 INFO [train.py:1087] (2/4) Epoch 42, batch 400, loss[loss=0.147, simple_loss=0.2413, pruned_loss=0.02638, over 24603.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2518, pruned_loss=0.03433, over 4159059.05 frames. ], batch size: 68, lr: 5.84e-03, grad_scale: 32.0 2023-12-04 10:23:13,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=247300.0, ans=0.125 2023-12-04 10:23:24,593 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=247366.66666666666, ans=0.125 2023-12-04 10:23:24,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=247366.66666666666, ans=0.125 2023-12-04 10:23:30,127 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247366.66666666666, ans=0.1 2023-12-04 10:23:52,431 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.20 vs. limit=22.5 2023-12-04 10:24:01,495 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:24:04,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=247566.66666666666, ans=0.2 2023-12-04 10:24:08,927 INFO [train.py:1087] (2/4) Epoch 42, batch 450, loss[loss=0.1516, simple_loss=0.2441, pruned_loss=0.02956, over 24771.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2515, pruned_loss=0.03405, over 4308964.57 frames. ], batch size: 65, lr: 5.84e-03, grad_scale: 32.0 2023-12-04 10:24:09,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=247633.33333333334, ans=0.125 2023-12-04 10:24:15,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=247633.33333333334, ans=0.0 2023-12-04 10:24:17,794 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.260e+02 1.360e+02 1.525e+02 2.031e+02, threshold=2.721e+02, percent-clipped=0.0 2023-12-04 10:24:20,337 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=247700.0, ans=0.0 2023-12-04 10:24:30,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=247766.66666666666, ans=0.0 2023-12-04 10:24:38,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=247766.66666666666, ans=0.025 2023-12-04 10:25:04,716 INFO [train.py:1087] (2/4) Epoch 42, batch 500, loss[loss=0.1615, simple_loss=0.2555, pruned_loss=0.03379, over 24609.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2518, pruned_loss=0.03432, over 4406254.38 frames. ], batch size: 68, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:25:33,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=248100.0, ans=0.125 2023-12-04 10:25:43,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=248166.66666666666, ans=0.2 2023-12-04 10:25:45,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=15.0 2023-12-04 10:25:47,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248166.66666666666, ans=0.1 2023-12-04 10:25:48,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=248166.66666666666, ans=0.125 2023-12-04 10:25:48,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=248166.66666666666, ans=0.0 2023-12-04 10:25:49,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248233.33333333334, ans=0.1 2023-12-04 10:26:01,397 INFO [train.py:1087] (2/4) Epoch 42, batch 550, loss[loss=0.1585, simple_loss=0.2505, pruned_loss=0.03323, over 24550.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2518, pruned_loss=0.03408, over 4486173.64 frames. ], batch size: 62, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:26:02,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=248300.0, ans=0.125 2023-12-04 10:26:09,901 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.331e+02 1.425e+02 1.593e+02 2.159e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 10:26:42,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=248500.0, ans=0.125 2023-12-04 10:26:48,757 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:26:57,147 INFO [train.py:1087] (2/4) Epoch 42, batch 600, loss[loss=0.1641, simple_loss=0.257, pruned_loss=0.03559, over 24801.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2515, pruned_loss=0.0339, over 4565818.06 frames. ], batch size: 71, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:27:02,147 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=248633.33333333334, ans=0.5 2023-12-04 10:27:06,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=248633.33333333334, ans=0.0 2023-12-04 10:27:32,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=248833.33333333334, ans=0.0 2023-12-04 10:27:34,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=248833.33333333334, ans=0.125 2023-12-04 10:27:36,383 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=248833.33333333334, ans=0.0 2023-12-04 10:27:40,620 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:27:46,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=248900.0, ans=0.0 2023-12-04 10:27:47,178 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.96 vs. limit=22.5 2023-12-04 10:27:50,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=248900.0, ans=0.125 2023-12-04 10:27:52,675 INFO [train.py:1087] (2/4) Epoch 42, batch 650, loss[loss=0.1631, simple_loss=0.2516, pruned_loss=0.03732, over 24323.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2516, pruned_loss=0.0342, over 4614871.03 frames. ], batch size: 79, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:28:01,587 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.365e+02 1.477e+02 1.651e+02 2.256e+02, threshold=2.955e+02, percent-clipped=0.0 2023-12-04 10:28:17,191 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:28:18,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249100.0, ans=0.125 2023-12-04 10:28:18,580 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.08 vs. limit=22.5 2023-12-04 10:28:23,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=249100.0, ans=0.0 2023-12-04 10:28:48,489 INFO [train.py:1087] (2/4) Epoch 42, batch 700, loss[loss=0.1633, simple_loss=0.2545, pruned_loss=0.03605, over 23558.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2515, pruned_loss=0.03409, over 4653124.75 frames. ], batch size: 94, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:28:49,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=249300.0, ans=0.95 2023-12-04 10:29:07,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=249366.66666666666, ans=0.125 2023-12-04 10:29:40,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249566.66666666666, ans=0.1 2023-12-04 10:29:45,064 INFO [train.py:1087] (2/4) Epoch 42, batch 750, loss[loss=0.1749, simple_loss=0.269, pruned_loss=0.04039, over 22776.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2516, pruned_loss=0.03414, over 4694195.96 frames. ], batch size: 106, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:29:47,414 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=249633.33333333334, ans=0.125 2023-12-04 10:29:51,982 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-12-04 10:29:53,624 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.308e+02 1.406e+02 1.557e+02 2.045e+02, threshold=2.813e+02, percent-clipped=0.0 2023-12-04 10:29:58,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=249700.0, ans=0.0 2023-12-04 10:30:05,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=249700.0, ans=0.2 2023-12-04 10:30:08,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249766.66666666666, ans=0.1 2023-12-04 10:30:21,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=249833.33333333334, ans=0.0 2023-12-04 10:30:34,341 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-12-04 10:30:42,246 INFO [train.py:1087] (2/4) Epoch 42, batch 800, loss[loss=0.1566, simple_loss=0.2473, pruned_loss=0.03289, over 24575.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2514, pruned_loss=0.03398, over 4715392.10 frames. ], batch size: 64, lr: 5.81e-03, grad_scale: 32.0 2023-12-04 10:30:43,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=249966.66666666666, ans=0.125 2023-12-04 10:31:33,892 INFO [train.py:1087] (2/4) Epoch 42, batch 850, loss[loss=0.1684, simple_loss=0.2565, pruned_loss=0.04018, over 23974.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2516, pruned_loss=0.03416, over 4728702.04 frames. ], batch size: 87, lr: 5.81e-03, grad_scale: 32.0 2023-12-04 10:31:41,982 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.298e+02 1.360e+02 1.535e+02 1.939e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-04 10:31:49,289 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=250366.66666666666, ans=0.05 2023-12-04 10:31:59,901 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=250433.33333333334, ans=0.0 2023-12-04 10:32:02,843 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=250433.33333333334, ans=0.125 2023-12-04 10:32:16,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=250566.66666666666, ans=0.125 2023-12-04 10:32:16,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=250566.66666666666, ans=0.0 2023-12-04 10:32:32,949 INFO [train.py:1087] (2/4) Epoch 43, batch 0, loss[loss=0.1868, simple_loss=0.2628, pruned_loss=0.05538, over 16695.00 frames. ], tot_loss[loss=0.1868, simple_loss=0.2628, pruned_loss=0.05538, over 16695.00 frames. ], batch size: 177, lr: 5.74e-03, grad_scale: 32.0 2023-12-04 10:32:32,950 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 10:32:41,914 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9379, 4.0870, 3.7051, 3.8070], device='cuda:2') 2023-12-04 10:32:45,211 INFO [train.py:1119] (2/4) Epoch 43, validation: loss=0.1523, simple_loss=0.2509, pruned_loss=0.02682, over 944034.00 frames. 2023-12-04 10:32:45,212 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 10:32:51,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250600.0, ans=0.125 2023-12-04 10:32:55,565 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:32:58,024 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=250666.66666666666, ans=10.0 2023-12-04 10:33:01,580 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-12-04 10:33:05,679 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=15.0 2023-12-04 10:33:12,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=250733.33333333334, ans=0.125 2023-12-04 10:33:16,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250733.33333333334, ans=0.125 2023-12-04 10:33:23,031 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=12.0 2023-12-04 10:33:28,086 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=250800.0, ans=0.125 2023-12-04 10:33:30,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=250866.66666666666, ans=0.0 2023-12-04 10:33:32,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.00 vs. limit=15.0 2023-12-04 10:33:38,003 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-12-04 10:33:40,669 INFO [train.py:1087] (2/4) Epoch 43, batch 50, loss[loss=0.1565, simple_loss=0.2508, pruned_loss=0.03115, over 24460.00 frames. ], tot_loss[loss=0.1615, simple_loss=0.2531, pruned_loss=0.03494, over 1080947.05 frames. ], batch size: 75, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:33:49,128 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.64 vs. limit=22.5 2023-12-04 10:33:56,847 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.362e+02 1.452e+02 1.646e+02 2.202e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 10:34:09,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=251066.66666666666, ans=0.2 2023-12-04 10:34:12,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=251133.33333333334, ans=0.5 2023-12-04 10:34:36,622 INFO [train.py:1087] (2/4) Epoch 43, batch 100, loss[loss=0.1637, simple_loss=0.2546, pruned_loss=0.03641, over 24151.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.2526, pruned_loss=0.034, over 1899959.44 frames. ], batch size: 82, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:35:10,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=251466.66666666666, ans=0.125 2023-12-04 10:35:12,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=251466.66666666666, ans=0.0 2023-12-04 10:35:26,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=251533.33333333334, ans=0.125 2023-12-04 10:35:33,169 INFO [train.py:1087] (2/4) Epoch 43, batch 150, loss[loss=0.1828, simple_loss=0.2709, pruned_loss=0.04738, over 24456.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.2524, pruned_loss=0.03413, over 2541572.63 frames. ], batch size: 77, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:35:37,018 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.67 vs. limit=15.0 2023-12-04 10:35:39,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=251600.0, ans=0.2 2023-12-04 10:35:44,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=251666.66666666666, ans=0.125 2023-12-04 10:35:46,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=251666.66666666666, ans=0.2 2023-12-04 10:35:48,107 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.291e+02 1.391e+02 1.529e+02 2.044e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 10:36:06,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251800.0, ans=0.1 2023-12-04 10:36:08,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=251800.0, ans=10.0 2023-12-04 10:36:22,381 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-12-04 10:36:28,319 INFO [train.py:1087] (2/4) Epoch 43, batch 200, loss[loss=0.1551, simple_loss=0.2537, pruned_loss=0.02824, over 24708.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2518, pruned_loss=0.0341, over 3055816.33 frames. ], batch size: 69, lr: 5.72e-03, grad_scale: 32.0 2023-12-04 10:36:29,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=251933.33333333334, ans=0.0 2023-12-04 10:36:30,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=251933.33333333334, ans=0.0 2023-12-04 10:36:38,826 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=252000.0, ans=0.2 2023-12-04 10:36:41,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=252000.0, ans=0.125 2023-12-04 10:36:42,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252000.0, ans=0.1 2023-12-04 10:36:58,329 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.16 vs. limit=10.0 2023-12-04 10:37:06,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=252133.33333333334, ans=0.125 2023-12-04 10:37:07,447 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.01 vs. limit=22.5 2023-12-04 10:37:08,327 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=252133.33333333334, ans=0.0 2023-12-04 10:37:24,309 INFO [train.py:1087] (2/4) Epoch 43, batch 250, loss[loss=0.1721, simple_loss=0.2634, pruned_loss=0.04037, over 24806.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.252, pruned_loss=0.03433, over 3435706.69 frames. ], batch size: 62, lr: 5.72e-03, grad_scale: 16.0 2023-12-04 10:37:25,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=252266.66666666666, ans=0.0 2023-12-04 10:37:34,956 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-12-04 10:37:41,479 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.276e+02 1.375e+02 1.506e+02 1.790e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 10:37:44,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=252333.33333333334, ans=0.0 2023-12-04 10:38:08,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=252533.33333333334, ans=0.125 2023-12-04 10:38:10,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=252533.33333333334, ans=0.125 2023-12-04 10:38:20,489 INFO [train.py:1087] (2/4) Epoch 43, batch 300, loss[loss=0.1588, simple_loss=0.2535, pruned_loss=0.0321, over 24569.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2516, pruned_loss=0.03397, over 3753557.58 frames. ], batch size: 65, lr: 5.71e-03, grad_scale: 16.0 2023-12-04 10:38:21,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=252600.0, ans=0.125 2023-12-04 10:38:45,624 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=252733.33333333334, ans=0.125 2023-12-04 10:39:09,601 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252866.66666666666, ans=0.1 2023-12-04 10:39:16,580 INFO [train.py:1087] (2/4) Epoch 43, batch 350, loss[loss=0.1653, simple_loss=0.2563, pruned_loss=0.03714, over 24323.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.2515, pruned_loss=0.03377, over 3996467.74 frames. ], batch size: 79, lr: 5.71e-03, grad_scale: 16.0 2023-12-04 10:39:19,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=252933.33333333334, ans=0.2 2023-12-04 10:39:33,086 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.311e+02 1.434e+02 1.594e+02 2.368e+02, threshold=2.867e+02, percent-clipped=0.0 2023-12-04 10:39:46,536 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-12-04 10:39:48,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=253066.66666666666, ans=0.2 2023-12-04 10:40:02,667 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=253200.0, ans=0.0 2023-12-04 10:40:12,451 INFO [train.py:1087] (2/4) Epoch 43, batch 400, loss[loss=0.1596, simple_loss=0.2555, pruned_loss=0.03178, over 23473.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2517, pruned_loss=0.03401, over 4158117.41 frames. ], batch size: 94, lr: 5.71e-03, grad_scale: 32.0 2023-12-04 10:40:50,220 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=253466.66666666666, ans=0.0 2023-12-04 10:40:50,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253466.66666666666, ans=0.1 2023-12-04 10:40:52,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=253466.66666666666, ans=15.0 2023-12-04 10:40:57,679 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:41:08,414 INFO [train.py:1087] (2/4) Epoch 43, batch 450, loss[loss=0.1543, simple_loss=0.2478, pruned_loss=0.03045, over 24555.00 frames. ], tot_loss[loss=0.1592, simple_loss=0.2511, pruned_loss=0.03366, over 4307814.66 frames. ], batch size: 62, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:41:19,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=253666.66666666666, ans=0.09899494936611666 2023-12-04 10:41:25,331 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.283e+02 1.403e+02 1.492e+02 2.872e+02, threshold=2.807e+02, percent-clipped=1.0 2023-12-04 10:41:27,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=253666.66666666666, ans=0.125 2023-12-04 10:41:48,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=253800.0, ans=0.125 2023-12-04 10:41:51,436 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.58 vs. limit=10.0 2023-12-04 10:41:52,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=253866.66666666666, ans=0.125 2023-12-04 10:42:01,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253866.66666666666, ans=0.125 2023-12-04 10:42:03,946 INFO [train.py:1087] (2/4) Epoch 43, batch 500, loss[loss=0.1652, simple_loss=0.2575, pruned_loss=0.03649, over 23654.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2508, pruned_loss=0.03349, over 4428339.52 frames. ], batch size: 95, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:42:12,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253933.33333333334, ans=0.1 2023-12-04 10:42:24,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=254000.0, ans=0.125 2023-12-04 10:42:25,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254066.66666666666, ans=0.1 2023-12-04 10:42:28,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=254066.66666666666, ans=0.0 2023-12-04 10:42:53,975 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=254200.0, ans=0.125 2023-12-04 10:42:58,616 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=254200.0, ans=0.0 2023-12-04 10:43:00,426 INFO [train.py:1087] (2/4) Epoch 43, batch 550, loss[loss=0.1571, simple_loss=0.2561, pruned_loss=0.02909, over 21185.00 frames. ], tot_loss[loss=0.1592, simple_loss=0.2512, pruned_loss=0.03362, over 4502469.28 frames. ], batch size: 127, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:43:13,762 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=254333.33333333334, ans=0.125 2023-12-04 10:43:16,754 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.284e+02 1.383e+02 1.508e+02 2.379e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 10:43:24,127 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=254400.0, ans=0.0 2023-12-04 10:43:37,660 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=254466.66666666666, ans=0.2 2023-12-04 10:43:38,720 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=254466.66666666666, ans=0.0 2023-12-04 10:43:42,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=254466.66666666666, ans=0.125 2023-12-04 10:43:44,459 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.32 vs. limit=15.0 2023-12-04 10:43:56,356 INFO [train.py:1087] (2/4) Epoch 43, batch 600, loss[loss=0.1539, simple_loss=0.2435, pruned_loss=0.03211, over 24725.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2511, pruned_loss=0.0336, over 4580657.26 frames. ], batch size: 61, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:44:05,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=254600.0, ans=0.125 2023-12-04 10:44:07,482 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-12-04 10:44:12,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=254666.66666666666, ans=0.125 2023-12-04 10:44:14,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=254666.66666666666, ans=0.125 2023-12-04 10:44:15,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=254666.66666666666, ans=0.125 2023-12-04 10:44:44,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=254866.66666666666, ans=0.125 2023-12-04 10:44:52,169 INFO [train.py:1087] (2/4) Epoch 43, batch 650, loss[loss=0.1522, simple_loss=0.2458, pruned_loss=0.02932, over 24718.00 frames. ], tot_loss[loss=0.1592, simple_loss=0.2509, pruned_loss=0.03379, over 4633001.78 frames. ], batch size: 74, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:45:00,338 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=254933.33333333334, ans=0.125 2023-12-04 10:45:02,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=255000.0, ans=0.125 2023-12-04 10:45:10,801 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.306e+02 1.396e+02 1.562e+02 2.104e+02, threshold=2.792e+02, percent-clipped=0.0 2023-12-04 10:45:21,730 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=255066.66666666666, ans=0.0 2023-12-04 10:45:21,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=255066.66666666666, ans=0.0 2023-12-04 10:45:30,754 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=255133.33333333334, ans=0.5 2023-12-04 10:45:37,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=255200.0, ans=0.2 2023-12-04 10:45:42,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=255200.0, ans=0.125 2023-12-04 10:45:44,378 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=255200.0, ans=0.125 2023-12-04 10:45:48,310 INFO [train.py:1087] (2/4) Epoch 43, batch 700, loss[loss=0.1725, simple_loss=0.2598, pruned_loss=0.04255, over 24129.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2508, pruned_loss=0.03375, over 4668156.91 frames. ], batch size: 82, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:45:59,395 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=255333.33333333334, ans=0.125 2023-12-04 10:46:22,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=255466.66666666666, ans=0.0 2023-12-04 10:46:27,258 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=255466.66666666666, ans=22.5 2023-12-04 10:46:35,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=255533.33333333334, ans=0.125 2023-12-04 10:46:44,373 INFO [train.py:1087] (2/4) Epoch 43, batch 750, loss[loss=0.1576, simple_loss=0.2447, pruned_loss=0.0352, over 24742.00 frames. ], tot_loss[loss=0.159, simple_loss=0.2507, pruned_loss=0.03364, over 4697924.39 frames. ], batch size: 63, lr: 5.68e-03, grad_scale: 16.0 2023-12-04 10:46:53,797 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-12-04 10:47:01,889 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.310e+02 1.439e+02 1.594e+02 2.322e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 10:47:04,955 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=12.0 2023-12-04 10:47:31,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=255866.66666666666, ans=0.125 2023-12-04 10:47:40,075 INFO [train.py:1087] (2/4) Epoch 43, batch 800, loss[loss=0.1571, simple_loss=0.2519, pruned_loss=0.03118, over 24874.00 frames. ], tot_loss[loss=0.1587, simple_loss=0.2504, pruned_loss=0.03347, over 4720505.06 frames. ], batch size: 68, lr: 5.68e-03, grad_scale: 32.0 2023-12-04 10:47:47,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=255933.33333333334, ans=0.125 2023-12-04 10:47:54,411 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.70 vs. limit=15.0 2023-12-04 10:48:06,338 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=256066.66666666666, ans=10.0 2023-12-04 10:48:13,476 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=256133.33333333334, ans=0.0 2023-12-04 10:48:13,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=256133.33333333334, ans=0.0 2023-12-04 10:48:19,439 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=256133.33333333334, ans=0.125 2023-12-04 10:48:22,373 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=256200.0, ans=0.0 2023-12-04 10:48:24,662 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.04 vs. limit=15.0 2023-12-04 10:48:25,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=256200.0, ans=0.125 2023-12-04 10:48:32,504 INFO [train.py:1087] (2/4) Epoch 43, batch 850, loss[loss=0.1507, simple_loss=0.2469, pruned_loss=0.0273, over 24605.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2502, pruned_loss=0.03337, over 4744497.63 frames. ], batch size: 68, lr: 5.67e-03, grad_scale: 32.0 2023-12-04 10:48:48,708 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.268e+02 1.363e+02 1.524e+02 1.914e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 10:48:59,613 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=256400.0, ans=0.05 2023-12-04 10:49:09,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=256466.66666666666, ans=0.0 2023-12-04 10:49:31,982 INFO [train.py:1087] (2/4) Epoch 44, batch 0, loss[loss=0.1568, simple_loss=0.2493, pruned_loss=0.03215, over 24567.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2493, pruned_loss=0.03215, over 24567.00 frames. ], batch size: 64, lr: 5.61e-03, grad_scale: 32.0 2023-12-04 10:49:31,983 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 10:49:44,324 INFO [train.py:1119] (2/4) Epoch 44, validation: loss=0.1512, simple_loss=0.2503, pruned_loss=0.02602, over 944034.00 frames. 2023-12-04 10:49:44,325 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 10:49:50,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=256566.66666666666, ans=0.2 2023-12-04 10:50:02,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=256633.33333333334, ans=0.125 2023-12-04 10:50:16,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=256766.66666666666, ans=0.1 2023-12-04 10:50:23,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=256766.66666666666, ans=0.0 2023-12-04 10:50:32,595 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:50:37,281 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.84 vs. limit=15.0 2023-12-04 10:50:39,600 INFO [train.py:1087] (2/4) Epoch 44, batch 50, loss[loss=0.1648, simple_loss=0.2554, pruned_loss=0.03707, over 24009.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2523, pruned_loss=0.03435, over 1082753.79 frames. ], batch size: 87, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:50:39,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=256900.0, ans=0.125 2023-12-04 10:50:46,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=256900.0, ans=0.0 2023-12-04 10:50:47,971 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256900.0, ans=0.1 2023-12-04 10:50:49,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=256900.0, ans=0.0 2023-12-04 10:50:54,437 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=256966.66666666666, ans=0.125 2023-12-04 10:51:04,412 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.346e+02 1.454e+02 1.617e+02 2.444e+02, threshold=2.907e+02, percent-clipped=0.0 2023-12-04 10:51:05,833 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=257033.33333333334, ans=0.04949747468305833 2023-12-04 10:51:17,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=257100.0, ans=0.125 2023-12-04 10:51:35,076 INFO [train.py:1087] (2/4) Epoch 44, batch 100, loss[loss=0.1651, simple_loss=0.2588, pruned_loss=0.03569, over 21788.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2515, pruned_loss=0.03337, over 1924945.28 frames. ], batch size: 128, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:51:38,811 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=257233.33333333334, ans=0.0 2023-12-04 10:51:45,162 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=257300.0, ans=0.125 2023-12-04 10:51:45,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=257300.0, ans=0.04949747468305833 2023-12-04 10:51:52,688 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=257300.0, ans=0.0 2023-12-04 10:51:53,672 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=257300.0, ans=0.0 2023-12-04 10:52:19,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=257500.0, ans=0.125 2023-12-04 10:52:21,060 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2023-12-04 10:52:31,059 INFO [train.py:1087] (2/4) Epoch 44, batch 150, loss[loss=0.1592, simple_loss=0.2523, pruned_loss=0.03308, over 24754.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.251, pruned_loss=0.03328, over 2561516.02 frames. ], batch size: 66, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:52:44,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257633.33333333334, ans=0.125 2023-12-04 10:52:55,811 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.310e+02 1.411e+02 1.524e+02 1.806e+02, threshold=2.823e+02, percent-clipped=0.0 2023-12-04 10:52:56,355 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.42 vs. limit=22.5 2023-12-04 10:53:26,998 INFO [train.py:1087] (2/4) Epoch 44, batch 200, loss[loss=0.1487, simple_loss=0.2481, pruned_loss=0.02468, over 24577.00 frames. ], tot_loss[loss=0.159, simple_loss=0.251, pruned_loss=0.03348, over 3051157.86 frames. ], batch size: 64, lr: 5.59e-03, grad_scale: 32.0 2023-12-04 10:53:30,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257900.0, ans=0.125 2023-12-04 10:53:34,777 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=257900.0, ans=0.125 2023-12-04 10:53:37,525 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-12-04 10:53:48,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=258033.33333333334, ans=0.015 2023-12-04 10:54:22,605 INFO [train.py:1087] (2/4) Epoch 44, batch 250, loss[loss=0.1538, simple_loss=0.2477, pruned_loss=0.02992, over 24564.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2505, pruned_loss=0.03331, over 3439254.20 frames. ], batch size: 63, lr: 5.59e-03, grad_scale: 32.0 2023-12-04 10:54:24,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258233.33333333334, ans=0.1 2023-12-04 10:54:24,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.32 vs. limit=15.0 2023-12-04 10:54:25,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=258233.33333333334, ans=0.125 2023-12-04 10:54:47,512 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.280e+02 1.390e+02 1.521e+02 2.279e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 10:55:07,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=258500.0, ans=0.0 2023-12-04 10:55:18,618 INFO [train.py:1087] (2/4) Epoch 44, batch 300, loss[loss=0.1528, simple_loss=0.2451, pruned_loss=0.03028, over 24798.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.2511, pruned_loss=0.03377, over 3724609.04 frames. ], batch size: 71, lr: 5.58e-03, grad_scale: 16.0 2023-12-04 10:55:34,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258633.33333333334, ans=0.1 2023-12-04 10:55:37,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=258633.33333333334, ans=0.0 2023-12-04 10:55:48,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=258700.0, ans=0.2 2023-12-04 10:55:49,844 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.52 vs. limit=15.0 2023-12-04 10:56:02,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258833.33333333334, ans=0.1 2023-12-04 10:56:15,389 INFO [train.py:1087] (2/4) Epoch 44, batch 350, loss[loss=0.1505, simple_loss=0.2422, pruned_loss=0.02941, over 24623.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2515, pruned_loss=0.03382, over 3950012.12 frames. ], batch size: 68, lr: 5.58e-03, grad_scale: 16.0 2023-12-04 10:56:22,138 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=12.0 2023-12-04 10:56:29,624 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=258966.66666666666, ans=0.125 2023-12-04 10:56:31,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=258966.66666666666, ans=0.125 2023-12-04 10:56:33,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=258966.66666666666, ans=0.125 2023-12-04 10:56:38,099 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=259033.33333333334, ans=0.125 2023-12-04 10:56:40,379 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.276e+02 1.381e+02 1.490e+02 2.030e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 10:56:46,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=259033.33333333334, ans=0.125 2023-12-04 10:57:10,183 INFO [train.py:1087] (2/4) Epoch 44, batch 400, loss[loss=0.1533, simple_loss=0.244, pruned_loss=0.03128, over 24548.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2508, pruned_loss=0.03353, over 4151413.24 frames. ], batch size: 62, lr: 5.58e-03, grad_scale: 32.0 2023-12-04 10:57:12,858 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=259233.33333333334, ans=0.0 2023-12-04 10:57:14,037 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=259233.33333333334, ans=0.125 2023-12-04 10:57:21,414 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.77 vs. limit=15.0 2023-12-04 10:57:40,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=259366.66666666666, ans=0.125 2023-12-04 10:57:48,985 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:58:06,560 INFO [train.py:1087] (2/4) Epoch 44, batch 450, loss[loss=0.1445, simple_loss=0.2335, pruned_loss=0.02775, over 24572.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2501, pruned_loss=0.03314, over 4293434.19 frames. ], batch size: 65, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:58:12,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=259566.66666666666, ans=0.125 2023-12-04 10:58:15,246 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=259566.66666666666, ans=0.1 2023-12-04 10:58:20,032 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.99 vs. limit=15.0 2023-12-04 10:58:23,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=259633.33333333334, ans=0.125 2023-12-04 10:58:32,033 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.255e+02 1.336e+02 1.462e+02 2.228e+02, threshold=2.673e+02, percent-clipped=0.0 2023-12-04 10:59:02,276 INFO [train.py:1087] (2/4) Epoch 44, batch 500, loss[loss=0.1503, simple_loss=0.2431, pruned_loss=0.02874, over 24773.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2499, pruned_loss=0.03287, over 4432638.67 frames. ], batch size: 71, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:59:10,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259900.0, ans=0.1 2023-12-04 10:59:22,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=259966.66666666666, ans=0.2 2023-12-04 10:59:37,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=260100.0, ans=0.2 2023-12-04 10:59:38,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260100.0, ans=0.1 2023-12-04 10:59:56,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=260233.33333333334, ans=0.2 2023-12-04 10:59:57,565 INFO [train.py:1087] (2/4) Epoch 44, batch 550, loss[loss=0.1594, simple_loss=0.25, pruned_loss=0.03438, over 24548.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2504, pruned_loss=0.03316, over 4526263.86 frames. ], batch size: 62, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:59:58,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=260233.33333333334, ans=0.125 2023-12-04 11:00:03,548 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:00:18,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=260366.66666666666, ans=0.125 2023-12-04 11:00:21,200 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=6.54 vs. limit=15.0 2023-12-04 11:00:22,736 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.301e+02 1.379e+02 1.493e+02 2.705e+02, threshold=2.759e+02, percent-clipped=1.0 2023-12-04 11:00:27,454 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=260366.66666666666, ans=0.0 2023-12-04 11:00:29,315 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=260366.66666666666, ans=0.2 2023-12-04 11:00:53,043 INFO [train.py:1087] (2/4) Epoch 44, batch 600, loss[loss=0.1457, simple_loss=0.2395, pruned_loss=0.02594, over 24843.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2502, pruned_loss=0.03305, over 4602639.81 frames. ], batch size: 68, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:01:27,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=260766.66666666666, ans=0.1 2023-12-04 11:01:46,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=260833.33333333334, ans=0.1 2023-12-04 11:01:48,800 INFO [train.py:1087] (2/4) Epoch 44, batch 650, loss[loss=0.1697, simple_loss=0.2578, pruned_loss=0.04084, over 24186.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2501, pruned_loss=0.03302, over 4665270.36 frames. ], batch size: 82, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:01:53,041 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.96 vs. limit=15.0 2023-12-04 11:02:07,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=260966.66666666666, ans=0.2 2023-12-04 11:02:08,056 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=260966.66666666666, ans=0.125 2023-12-04 11:02:14,552 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.257e+02 1.345e+02 1.436e+02 1.946e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 11:02:14,739 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=261033.33333333334, ans=0.035 2023-12-04 11:02:19,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=261033.33333333334, ans=0.125 2023-12-04 11:02:25,063 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-12-04 11:02:28,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=261100.0, ans=0.0 2023-12-04 11:02:37,532 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=261166.66666666666, ans=0.125 2023-12-04 11:02:44,564 INFO [train.py:1087] (2/4) Epoch 44, batch 700, loss[loss=0.1505, simple_loss=0.2456, pruned_loss=0.02773, over 24766.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2501, pruned_loss=0.03301, over 4706691.37 frames. ], batch size: 64, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:02:46,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=261233.33333333334, ans=0.0 2023-12-04 11:03:04,041 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261300.0, ans=0.125 2023-12-04 11:03:16,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=261366.66666666666, ans=0.125 2023-12-04 11:03:18,773 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=261433.33333333334, ans=0.2 2023-12-04 11:03:38,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=261500.0, ans=0.125 2023-12-04 11:03:40,858 INFO [train.py:1087] (2/4) Epoch 44, batch 750, loss[loss=0.1617, simple_loss=0.2517, pruned_loss=0.03589, over 24756.00 frames. ], tot_loss[loss=0.158, simple_loss=0.25, pruned_loss=0.03303, over 4745668.95 frames. ], batch size: 63, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:04:06,500 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.296e+02 1.395e+02 1.494e+02 1.834e+02, threshold=2.790e+02, percent-clipped=0.0 2023-12-04 11:04:33,086 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-12-04 11:04:35,249 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.99 vs. limit=10.0 2023-12-04 11:04:36,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=261900.0, ans=0.1 2023-12-04 11:04:36,802 INFO [train.py:1087] (2/4) Epoch 44, batch 800, loss[loss=0.1521, simple_loss=0.2447, pruned_loss=0.02972, over 24562.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2502, pruned_loss=0.03314, over 4761073.52 frames. ], batch size: 63, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:04:57,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262033.33333333334, ans=0.1 2023-12-04 11:04:58,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=262033.33333333334, ans=0.05 2023-12-04 11:05:05,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=262033.33333333334, ans=0.09899494936611666 2023-12-04 11:05:10,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=262100.0, ans=0.0 2023-12-04 11:05:26,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=262166.6666666667, ans=0.125 2023-12-04 11:05:28,968 INFO [train.py:1087] (2/4) Epoch 44, batch 850, loss[loss=0.1609, simple_loss=0.251, pruned_loss=0.03547, over 24316.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2507, pruned_loss=0.03346, over 4761480.51 frames. ], batch size: 79, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:05:31,170 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262233.3333333333, ans=0.125 2023-12-04 11:05:37,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=262233.3333333333, ans=0.125 2023-12-04 11:05:48,291 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262366.6666666667, ans=0.1 2023-12-04 11:05:50,561 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.63 vs. limit=15.0 2023-12-04 11:05:52,167 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.290e+02 1.372e+02 1.489e+02 2.246e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-04 11:05:52,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=262366.6666666667, ans=0.125 2023-12-04 11:06:03,218 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=262433.3333333333, ans=0.0 2023-12-04 11:06:04,390 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=262433.3333333333, ans=0.2 2023-12-04 11:06:11,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=262500.0, ans=0.125 2023-12-04 11:06:30,262 INFO [train.py:1087] (2/4) Epoch 45, batch 0, loss[loss=0.1404, simple_loss=0.2321, pruned_loss=0.0243, over 24720.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2321, pruned_loss=0.0243, over 24720.00 frames. ], batch size: 67, lr: 5.48e-03, grad_scale: 32.0 2023-12-04 11:06:30,262 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 11:06:42,730 INFO [train.py:1119] (2/4) Epoch 45, validation: loss=0.1525, simple_loss=0.2511, pruned_loss=0.02696, over 944034.00 frames. 2023-12-04 11:06:42,731 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 11:07:16,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=262733.3333333333, ans=0.125 2023-12-04 11:07:20,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=262733.3333333333, ans=0.125 2023-12-04 11:07:23,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=262733.3333333333, ans=0.1 2023-12-04 11:07:38,494 INFO [train.py:1087] (2/4) Epoch 45, batch 50, loss[loss=0.1648, simple_loss=0.2533, pruned_loss=0.03814, over 21400.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.2511, pruned_loss=0.03379, over 1081092.97 frames. ], batch size: 127, lr: 5.48e-03, grad_scale: 16.0 2023-12-04 11:07:43,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=262866.6666666667, ans=0.125 2023-12-04 11:07:46,532 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=262866.6666666667, ans=0.125 2023-12-04 11:07:57,327 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=262933.3333333333, ans=0.0 2023-12-04 11:08:04,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263000.0, ans=0.1 2023-12-04 11:08:08,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=263000.0, ans=0.2 2023-12-04 11:08:10,395 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.054e+02 1.301e+02 1.423e+02 1.672e+02 2.279e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 11:08:17,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=263066.6666666667, ans=0.125 2023-12-04 11:08:33,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263200.0, ans=0.0 2023-12-04 11:08:34,114 INFO [train.py:1087] (2/4) Epoch 45, batch 100, loss[loss=0.1507, simple_loss=0.2432, pruned_loss=0.02908, over 24541.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.2509, pruned_loss=0.03408, over 1893165.77 frames. ], batch size: 62, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:08:38,982 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.71 vs. limit=22.5 2023-12-04 11:08:49,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=263266.6666666667, ans=0.125 2023-12-04 11:09:05,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=263333.3333333333, ans=0.125 2023-12-04 11:09:07,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=263400.0, ans=0.125 2023-12-04 11:09:07,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=263400.0, ans=0.125 2023-12-04 11:09:08,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263400.0, ans=0.1 2023-12-04 11:09:22,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=263466.6666666667, ans=0.0 2023-12-04 11:09:29,388 INFO [train.py:1087] (2/4) Epoch 45, batch 150, loss[loss=0.1594, simple_loss=0.2511, pruned_loss=0.03386, over 24761.00 frames. ], tot_loss[loss=0.1606, simple_loss=0.2521, pruned_loss=0.03457, over 2520808.55 frames. ], batch size: 64, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:09:33,633 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=263533.3333333333, ans=0.125 2023-12-04 11:09:46,205 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.01 vs. limit=10.0 2023-12-04 11:10:01,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.89 vs. limit=5.0 2023-12-04 11:10:02,904 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.302e+02 1.364e+02 1.498e+02 1.785e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 11:10:06,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=263733.3333333333, ans=0.125 2023-12-04 11:10:12,194 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=263733.3333333333, ans=0.125 2023-12-04 11:10:13,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=263733.3333333333, ans=0.125 2023-12-04 11:10:19,766 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=263800.0, ans=0.125 2023-12-04 11:10:20,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=263800.0, ans=0.0 2023-12-04 11:10:22,187 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-12-04 11:10:26,277 INFO [train.py:1087] (2/4) Epoch 45, batch 200, loss[loss=0.1572, simple_loss=0.2479, pruned_loss=0.03325, over 23338.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2516, pruned_loss=0.03419, over 3015900.63 frames. ], batch size: 56, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:10:39,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=263933.3333333333, ans=0.2 2023-12-04 11:10:52,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=264000.0, ans=0.125 2023-12-04 11:10:52,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=264000.0, ans=0.125 2023-12-04 11:10:55,712 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.38 vs. limit=22.5 2023-12-04 11:10:57,708 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=264000.0, ans=0.2 2023-12-04 11:11:13,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=264133.3333333333, ans=0.125 2023-12-04 11:11:19,485 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=264133.3333333333, ans=0.0 2023-12-04 11:11:22,367 INFO [train.py:1087] (2/4) Epoch 45, batch 250, loss[loss=0.1643, simple_loss=0.263, pruned_loss=0.03285, over 20865.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2517, pruned_loss=0.03409, over 3410948.09 frames. ], batch size: 50, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:11:30,214 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=264200.0, ans=0.0 2023-12-04 11:11:42,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=264266.6666666667, ans=0.0 2023-12-04 11:11:43,286 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-12-04 11:11:54,844 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.261e+02 1.380e+02 1.511e+02 1.936e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 11:12:19,000 INFO [train.py:1087] (2/4) Epoch 45, batch 300, loss[loss=0.1562, simple_loss=0.2475, pruned_loss=0.03246, over 24812.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2516, pruned_loss=0.03421, over 3686675.39 frames. ], batch size: 62, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:12:27,197 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-12-04 11:12:28,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=264600.0, ans=0.125 2023-12-04 11:12:43,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=264666.6666666667, ans=0.2 2023-12-04 11:12:49,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=264666.6666666667, ans=0.025 2023-12-04 11:13:10,278 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=264800.0, ans=0.125 2023-12-04 11:13:14,918 INFO [train.py:1087] (2/4) Epoch 45, batch 350, loss[loss=0.1457, simple_loss=0.2383, pruned_loss=0.02652, over 24796.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2517, pruned_loss=0.03431, over 3915390.12 frames. ], batch size: 72, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:13:21,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=264866.6666666667, ans=0.2 2023-12-04 11:13:40,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=265000.0, ans=0.125 2023-12-04 11:13:43,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=265000.0, ans=0.125 2023-12-04 11:13:47,424 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.292e+02 1.391e+02 1.519e+02 2.133e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 11:13:47,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=265066.6666666667, ans=0.125 2023-12-04 11:13:52,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=265066.6666666667, ans=0.125 2023-12-04 11:13:53,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=265066.6666666667, ans=0.2 2023-12-04 11:13:54,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=265066.6666666667, ans=0.0 2023-12-04 11:13:57,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=265066.6666666667, ans=0.125 2023-12-04 11:14:10,356 INFO [train.py:1087] (2/4) Epoch 45, batch 400, loss[loss=0.151, simple_loss=0.2372, pruned_loss=0.03234, over 24762.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2505, pruned_loss=0.03356, over 4129559.34 frames. ], batch size: 64, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:14:11,804 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=12.0 2023-12-04 11:14:17,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=265200.0, ans=0.125 2023-12-04 11:14:49,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=265400.0, ans=0.125 2023-12-04 11:15:00,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=265466.6666666667, ans=0.95 2023-12-04 11:15:06,947 INFO [train.py:1087] (2/4) Epoch 45, batch 450, loss[loss=0.1653, simple_loss=0.2518, pruned_loss=0.03939, over 24323.00 frames. ], tot_loss[loss=0.158, simple_loss=0.25, pruned_loss=0.03299, over 4299268.71 frames. ], batch size: 79, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:15:24,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=265600.0, ans=0.0 2023-12-04 11:15:39,535 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.288e+02 1.390e+02 1.502e+02 2.726e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 11:15:51,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=265800.0, ans=0.125 2023-12-04 11:15:55,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=265800.0, ans=0.0 2023-12-04 11:16:03,386 INFO [train.py:1087] (2/4) Epoch 45, batch 500, loss[loss=0.1561, simple_loss=0.2547, pruned_loss=0.02876, over 24555.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2505, pruned_loss=0.03329, over 4396262.47 frames. ], batch size: 63, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:16:37,544 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=266066.6666666667, ans=0.125 2023-12-04 11:16:42,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=266066.6666666667, ans=0.2 2023-12-04 11:16:57,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=266133.3333333333, ans=0.1 2023-12-04 11:16:59,713 INFO [train.py:1087] (2/4) Epoch 45, batch 550, loss[loss=0.1449, simple_loss=0.2417, pruned_loss=0.02405, over 24754.00 frames. ], tot_loss[loss=0.1587, simple_loss=0.2506, pruned_loss=0.03345, over 4463663.58 frames. ], batch size: 65, lr: 5.44e-03, grad_scale: 32.0 2023-12-04 11:17:10,017 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:17:16,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=266266.6666666667, ans=0.125 2023-12-04 11:17:16,746 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.54 vs. limit=15.0 2023-12-04 11:17:32,313 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.283e+02 1.402e+02 1.532e+02 2.255e+02, threshold=2.804e+02, percent-clipped=0.0 2023-12-04 11:17:34,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=266400.0, ans=15.0 2023-12-04 11:17:42,542 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=266400.0, ans=0.0 2023-12-04 11:17:42,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=266400.0, ans=0.125 2023-12-04 11:17:55,089 INFO [train.py:1087] (2/4) Epoch 45, batch 600, loss[loss=0.1535, simple_loss=0.2521, pruned_loss=0.02742, over 24797.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2503, pruned_loss=0.03315, over 4546601.61 frames. ], batch size: 73, lr: 5.44e-03, grad_scale: 16.0 2023-12-04 11:17:58,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=266533.3333333333, ans=0.015 2023-12-04 11:18:17,736 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.58 vs. limit=22.5 2023-12-04 11:18:21,414 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.79 vs. limit=15.0 2023-12-04 11:18:37,079 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=266733.3333333333, ans=0.125 2023-12-04 11:18:42,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=266800.0, ans=0.125 2023-12-04 11:18:45,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=266800.0, ans=0.125 2023-12-04 11:18:45,970 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.89 vs. limit=22.5 2023-12-04 11:18:51,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=266800.0, ans=0.0 2023-12-04 11:18:54,520 INFO [train.py:1087] (2/4) Epoch 45, batch 650, loss[loss=0.1508, simple_loss=0.2394, pruned_loss=0.03106, over 24773.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2502, pruned_loss=0.03311, over 4612041.30 frames. ], batch size: 64, lr: 5.44e-03, grad_scale: 16.0 2023-12-04 11:18:57,114 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266866.6666666667, ans=0.1 2023-12-04 11:19:20,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=267000.0, ans=0.0 2023-12-04 11:19:23,015 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2023-12-04 11:19:24,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=267000.0, ans=0.125 2023-12-04 11:19:28,564 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.316e+02 1.419e+02 1.572e+02 2.122e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 11:19:51,004 INFO [train.py:1087] (2/4) Epoch 45, batch 700, loss[loss=0.166, simple_loss=0.2581, pruned_loss=0.03691, over 22735.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2506, pruned_loss=0.03324, over 4655220.87 frames. ], batch size: 106, lr: 5.43e-03, grad_scale: 16.0 2023-12-04 11:20:11,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=267266.6666666667, ans=0.0 2023-12-04 11:20:19,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=267333.3333333333, ans=0.125 2023-12-04 11:20:19,391 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=267333.3333333333, ans=0.0 2023-12-04 11:20:22,698 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=267333.3333333333, ans=0.2 2023-12-04 11:20:32,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267400.0, ans=0.125 2023-12-04 11:20:35,109 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.18 vs. limit=22.5 2023-12-04 11:20:47,164 INFO [train.py:1087] (2/4) Epoch 45, batch 750, loss[loss=0.1592, simple_loss=0.2541, pruned_loss=0.03212, over 24766.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2509, pruned_loss=0.03341, over 4684162.49 frames. ], batch size: 71, lr: 5.43e-03, grad_scale: 16.0 2023-12-04 11:20:50,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267533.3333333333, ans=0.1 2023-12-04 11:20:54,852 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=267533.3333333333, ans=0.125 2023-12-04 11:21:13,772 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-12-04 11:21:20,415 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.281e+02 1.385e+02 1.471e+02 1.798e+02, threshold=2.769e+02, percent-clipped=0.0 2023-12-04 11:21:31,730 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.16 vs. limit=22.5 2023-12-04 11:21:42,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-12-04 11:21:42,630 INFO [train.py:1087] (2/4) Epoch 45, batch 800, loss[loss=0.1632, simple_loss=0.2532, pruned_loss=0.03661, over 24753.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2505, pruned_loss=0.03326, over 4717636.13 frames. ], batch size: 63, lr: 5.43e-03, grad_scale: 32.0 2023-12-04 11:22:05,957 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:22:08,302 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.36 vs. limit=15.0 2023-12-04 11:22:12,204 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.62 vs. limit=15.0 2023-12-04 11:22:25,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=268133.3333333333, ans=0.0 2023-12-04 11:22:33,604 INFO [train.py:1087] (2/4) Epoch 45, batch 850, loss[loss=0.1532, simple_loss=0.2469, pruned_loss=0.0298, over 24796.00 frames. ], tot_loss[loss=0.1584, simple_loss=0.2504, pruned_loss=0.03321, over 4753226.03 frames. ], batch size: 71, lr: 5.42e-03, grad_scale: 32.0 2023-12-04 11:22:44,229 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=268266.6666666667, ans=0.0 2023-12-04 11:22:48,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=268266.6666666667, ans=0.0 2023-12-04 11:22:51,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=268266.6666666667, ans=0.025 2023-12-04 11:22:56,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=268333.3333333333, ans=0.1 2023-12-04 11:23:05,460 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.317e+02 1.452e+02 1.588e+02 2.065e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 11:23:05,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=268400.0, ans=0.125 2023-12-04 11:23:08,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=268400.0, ans=0.125 2023-12-04 11:23:13,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=268466.6666666667, ans=0.125 2023-12-04 11:23:27,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=268500.0, ans=0.0 2023-12-04 11:23:27,882 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 11:23:33,063 INFO [train.py:1087] (2/4) Epoch 46, batch 0, loss[loss=0.1441, simple_loss=0.2408, pruned_loss=0.02372, over 24602.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2408, pruned_loss=0.02372, over 24602.00 frames. ], batch size: 68, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:23:33,064 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 11:23:45,365 INFO [train.py:1119] (2/4) Epoch 46, validation: loss=0.1518, simple_loss=0.2501, pruned_loss=0.02668, over 944034.00 frames. 2023-12-04 11:23:45,366 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 11:23:48,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=268500.0, ans=0.0 2023-12-04 11:23:59,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=268566.6666666667, ans=0.2 2023-12-04 11:24:16,828 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268633.3333333333, ans=0.1 2023-12-04 11:24:21,126 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=268700.0, ans=0.2 2023-12-04 11:24:24,706 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-12-04 11:24:40,865 INFO [train.py:1087] (2/4) Epoch 46, batch 50, loss[loss=0.1517, simple_loss=0.2467, pruned_loss=0.02834, over 24562.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2509, pruned_loss=0.03308, over 1070673.10 frames. ], batch size: 62, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:24:57,681 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-12-04 11:25:20,367 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.282e+02 1.420e+02 1.605e+02 2.832e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 11:25:28,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=269100.0, ans=0.125 2023-12-04 11:25:35,879 INFO [train.py:1087] (2/4) Epoch 46, batch 100, loss[loss=0.1475, simple_loss=0.2378, pruned_loss=0.02855, over 24573.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2498, pruned_loss=0.03248, over 1893541.89 frames. ], batch size: 65, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:26:01,513 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=12.0 2023-12-04 11:26:05,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=269300.0, ans=0.09899494936611666 2023-12-04 11:26:14,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=269366.6666666667, ans=0.0 2023-12-04 11:26:30,659 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:26:31,391 INFO [train.py:1087] (2/4) Epoch 46, batch 150, loss[loss=0.1578, simple_loss=0.2502, pruned_loss=0.03268, over 24577.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2496, pruned_loss=0.03254, over 2528704.38 frames. ], batch size: 65, lr: 5.35e-03, grad_scale: 32.0 2023-12-04 11:26:35,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269500.0, ans=0.1 2023-12-04 11:26:43,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=269566.6666666667, ans=0.125 2023-12-04 11:27:12,677 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.277e+02 1.340e+02 1.477e+02 1.879e+02, threshold=2.679e+02, percent-clipped=0.0 2023-12-04 11:27:25,393 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-12-04 11:27:26,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=269833.3333333333, ans=0.125 2023-12-04 11:27:26,282 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.59 vs. limit=15.0 2023-12-04 11:27:26,812 INFO [train.py:1087] (2/4) Epoch 46, batch 200, loss[loss=0.1541, simple_loss=0.2453, pruned_loss=0.03147, over 24558.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2493, pruned_loss=0.03272, over 3022008.97 frames. ], batch size: 62, lr: 5.35e-03, grad_scale: 16.0 2023-12-04 11:27:35,055 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.76 vs. limit=15.0 2023-12-04 11:27:42,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=269900.0, ans=0.0 2023-12-04 11:27:52,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=269966.6666666667, ans=0.0 2023-12-04 11:28:10,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=270100.0, ans=0.125 2023-12-04 11:28:16,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=270100.0, ans=0.125 2023-12-04 11:28:22,404 INFO [train.py:1087] (2/4) Epoch 46, batch 250, loss[loss=0.1563, simple_loss=0.2499, pruned_loss=0.03133, over 24267.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2491, pruned_loss=0.03254, over 3424966.07 frames. ], batch size: 79, lr: 5.35e-03, grad_scale: 16.0 2023-12-04 11:28:22,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=270166.6666666667, ans=0.0 2023-12-04 11:28:24,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=270166.6666666667, ans=0.125 2023-12-04 11:28:35,465 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=270233.3333333333, ans=0.5 2023-12-04 11:28:36,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270233.3333333333, ans=0.1 2023-12-04 11:28:50,406 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=270300.0, ans=0.0 2023-12-04 11:28:54,148 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-12-04 11:29:00,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270366.6666666667, ans=0.1 2023-12-04 11:29:03,030 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.293e+02 1.389e+02 1.495e+02 1.860e+02, threshold=2.779e+02, percent-clipped=0.0 2023-12-04 11:29:14,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=270433.3333333333, ans=0.125 2023-12-04 11:29:15,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=270433.3333333333, ans=10.0 2023-12-04 11:29:18,224 INFO [train.py:1087] (2/4) Epoch 46, batch 300, loss[loss=0.1471, simple_loss=0.2396, pruned_loss=0.02727, over 24553.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2497, pruned_loss=0.03301, over 3728832.19 frames. ], batch size: 63, lr: 5.34e-03, grad_scale: 16.0 2023-12-04 11:29:26,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=270500.0, ans=0.04949747468305833 2023-12-04 11:29:35,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=270566.6666666667, ans=0.04949747468305833 2023-12-04 11:29:37,965 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-12-04 11:29:39,151 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.89 vs. limit=15.0 2023-12-04 11:29:51,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=270700.0, ans=0.0 2023-12-04 11:29:57,953 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.60 vs. limit=15.0 2023-12-04 11:30:08,211 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=270766.6666666667, ans=0.0 2023-12-04 11:30:08,523 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.58 vs. limit=15.0 2023-12-04 11:30:12,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=270833.3333333333, ans=0.0 2023-12-04 11:30:13,355 INFO [train.py:1087] (2/4) Epoch 46, batch 350, loss[loss=0.1467, simple_loss=0.243, pruned_loss=0.02522, over 24714.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2498, pruned_loss=0.03314, over 3958130.99 frames. ], batch size: 67, lr: 5.34e-03, grad_scale: 16.0 2023-12-04 11:30:17,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=270833.3333333333, ans=0.125 2023-12-04 11:30:39,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=270966.6666666667, ans=0.0 2023-12-04 11:30:39,849 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=22.5 2023-12-04 11:30:42,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=270966.6666666667, ans=0.025 2023-12-04 11:30:43,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270966.6666666667, ans=0.1 2023-12-04 11:30:49,579 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=271033.3333333333, ans=0.0 2023-12-04 11:30:54,289 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=271033.3333333333, ans=0.0 2023-12-04 11:30:54,998 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.070e+02 1.318e+02 1.456e+02 1.567e+02 1.885e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 11:31:08,711 INFO [train.py:1087] (2/4) Epoch 46, batch 400, loss[loss=0.1504, simple_loss=0.2395, pruned_loss=0.03062, over 24455.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.25, pruned_loss=0.03334, over 4127861.84 frames. ], batch size: 77, lr: 5.34e-03, grad_scale: 32.0 2023-12-04 11:31:11,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=271166.6666666667, ans=0.0 2023-12-04 11:31:18,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=271233.3333333333, ans=0.125 2023-12-04 11:31:36,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=271300.0, ans=0.0 2023-12-04 11:31:45,638 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=271366.6666666667, ans=0.0 2023-12-04 11:31:57,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=271433.3333333333, ans=0.125 2023-12-04 11:32:04,244 INFO [train.py:1087] (2/4) Epoch 46, batch 450, loss[loss=0.1509, simple_loss=0.2452, pruned_loss=0.02828, over 24611.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2496, pruned_loss=0.03295, over 4277689.21 frames. ], batch size: 68, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:32:16,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=271566.6666666667, ans=0.125 2023-12-04 11:32:18,986 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-12-04 11:32:31,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=271633.3333333333, ans=0.125 2023-12-04 11:32:44,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=271700.0, ans=0.125 2023-12-04 11:32:45,216 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.332e+02 1.442e+02 1.626e+02 2.367e+02, threshold=2.885e+02, percent-clipped=0.0 2023-12-04 11:32:45,542 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271700.0, ans=0.1 2023-12-04 11:33:00,603 INFO [train.py:1087] (2/4) Epoch 46, batch 500, loss[loss=0.1543, simple_loss=0.2456, pruned_loss=0.0315, over 24777.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2497, pruned_loss=0.03293, over 4404024.54 frames. ], batch size: 73, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:33:28,895 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-12-04 11:33:56,367 INFO [train.py:1087] (2/4) Epoch 46, batch 550, loss[loss=0.1496, simple_loss=0.2412, pruned_loss=0.02895, over 24762.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2496, pruned_loss=0.03286, over 4489968.37 frames. ], batch size: 65, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:34:00,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=272166.6666666667, ans=0.0 2023-12-04 11:34:11,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=272233.3333333333, ans=0.125 2023-12-04 11:34:15,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272233.3333333333, ans=0.1 2023-12-04 11:34:29,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=272366.6666666667, ans=0.0 2023-12-04 11:34:31,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=272366.6666666667, ans=0.125 2023-12-04 11:34:32,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=272366.6666666667, ans=0.0 2023-12-04 11:34:38,246 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.294e+02 1.384e+02 1.497e+02 2.316e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-04 11:34:52,067 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.60 vs. limit=6.0 2023-12-04 11:34:52,503 INFO [train.py:1087] (2/4) Epoch 46, batch 600, loss[loss=0.1683, simple_loss=0.2596, pruned_loss=0.03846, over 23999.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2499, pruned_loss=0.03312, over 4557215.18 frames. ], batch size: 87, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:34:53,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=272500.0, ans=0.125 2023-12-04 11:34:55,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=272500.0, ans=0.125 2023-12-04 11:35:16,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272633.3333333333, ans=0.1 2023-12-04 11:35:19,658 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=272633.3333333333, ans=0.125 2023-12-04 11:35:27,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=272700.0, ans=0.125 2023-12-04 11:35:29,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272700.0, ans=0.1 2023-12-04 11:35:30,655 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=272700.0, ans=0.125 2023-12-04 11:35:48,564 INFO [train.py:1087] (2/4) Epoch 46, batch 650, loss[loss=0.1543, simple_loss=0.249, pruned_loss=0.02977, over 24782.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.25, pruned_loss=0.03313, over 4621988.67 frames. ], batch size: 73, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:35:53,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272833.3333333333, ans=0.1 2023-12-04 11:35:55,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=272833.3333333333, ans=0.0 2023-12-04 11:35:55,422 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.15 vs. limit=12.0 2023-12-04 11:36:01,864 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=272900.0, ans=0.125 2023-12-04 11:36:23,539 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-12-04 11:36:29,607 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.280e+02 1.369e+02 1.496e+02 2.731e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 11:36:44,167 INFO [train.py:1087] (2/4) Epoch 46, batch 700, loss[loss=0.1534, simple_loss=0.2433, pruned_loss=0.03178, over 24856.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.25, pruned_loss=0.03295, over 4663166.27 frames. ], batch size: 68, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:37:09,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=273300.0, ans=0.2 2023-12-04 11:37:16,982 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=273366.6666666667, ans=0.125 2023-12-04 11:37:17,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273366.6666666667, ans=0.0 2023-12-04 11:37:29,178 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=273433.3333333333, ans=0.125 2023-12-04 11:37:39,929 INFO [train.py:1087] (2/4) Epoch 46, batch 750, loss[loss=0.1532, simple_loss=0.245, pruned_loss=0.03074, over 24478.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.25, pruned_loss=0.03285, over 4698656.27 frames. ], batch size: 77, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:37:41,651 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=273500.0, ans=0.2 2023-12-04 11:37:49,746 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=22.5 2023-12-04 11:38:07,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=273633.3333333333, ans=0.125 2023-12-04 11:38:18,476 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=273700.0, ans=0.125 2023-12-04 11:38:21,511 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.302e+02 1.405e+02 1.601e+02 2.273e+02, threshold=2.810e+02, percent-clipped=0.0 2023-12-04 11:38:23,931 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=273766.6666666667, ans=0.125 2023-12-04 11:38:28,205 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=273766.6666666667, ans=0.125 2023-12-04 11:38:36,057 INFO [train.py:1087] (2/4) Epoch 46, batch 800, loss[loss=0.1642, simple_loss=0.2581, pruned_loss=0.03518, over 24768.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2499, pruned_loss=0.03281, over 4732052.97 frames. ], batch size: 64, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:38:36,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=273833.3333333333, ans=0.125 2023-12-04 11:38:38,532 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=273833.3333333333, ans=0.2 2023-12-04 11:38:45,664 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=273900.0, ans=0.0 2023-12-04 11:38:50,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=273900.0, ans=0.0 2023-12-04 11:38:51,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=273900.0, ans=0.2 2023-12-04 11:38:55,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=273900.0, ans=0.04949747468305833 2023-12-04 11:39:05,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=273966.6666666667, ans=0.125 2023-12-04 11:39:06,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=274033.3333333333, ans=0.0 2023-12-04 11:39:14,227 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=274033.3333333333, ans=0.125 2023-12-04 11:39:25,459 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:39:27,312 INFO [train.py:1087] (2/4) Epoch 46, batch 850, loss[loss=0.1671, simple_loss=0.2588, pruned_loss=0.03773, over 23478.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2496, pruned_loss=0.03262, over 4754157.84 frames. ], batch size: 94, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:39:32,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274166.6666666667, ans=0.1 2023-12-04 11:39:33,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=274166.6666666667, ans=0.0 2023-12-04 11:39:37,548 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=274233.3333333333, ans=0.0 2023-12-04 11:39:38,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=274233.3333333333, ans=0.125 2023-12-04 11:39:40,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=274233.3333333333, ans=0.125 2023-12-04 11:39:41,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.58 vs. limit=22.5 2023-12-04 11:40:00,410 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-12-04 11:40:04,861 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.284e+02 1.352e+02 1.445e+02 2.031e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 11:40:26,890 INFO [train.py:1087] (2/4) Epoch 47, batch 0, loss[loss=0.1438, simple_loss=0.241, pruned_loss=0.02325, over 24802.00 frames. ], tot_loss[loss=0.1438, simple_loss=0.241, pruned_loss=0.02325, over 24802.00 frames. ], batch size: 71, lr: 5.25e-03, grad_scale: 32.0 2023-12-04 11:40:26,890 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 11:40:38,155 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.8005, 5.7058, 5.6355, 5.4971], device='cuda:2') 2023-12-04 11:40:39,383 INFO [train.py:1119] (2/4) Epoch 47, validation: loss=0.152, simple_loss=0.2504, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 11:40:39,383 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 11:40:42,719 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274466.6666666667, ans=0.1 2023-12-04 11:40:53,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274533.3333333333, ans=0.125 2023-12-04 11:41:17,407 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=274666.6666666667, ans=0.125 2023-12-04 11:41:25,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=274733.3333333333, ans=0.0 2023-12-04 11:41:35,061 INFO [train.py:1087] (2/4) Epoch 47, batch 50, loss[loss=0.1559, simple_loss=0.2479, pruned_loss=0.03194, over 24770.00 frames. ], tot_loss[loss=0.1586, simple_loss=0.2513, pruned_loss=0.03293, over 1091422.85 frames. ], batch size: 65, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:41:54,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=274866.6666666667, ans=0.0 2023-12-04 11:42:10,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=275000.0, ans=0.125 2023-12-04 11:42:11,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=275000.0, ans=0.025 2023-12-04 11:42:11,936 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275000.0, ans=0.125 2023-12-04 11:42:21,521 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.272e+02 1.416e+02 1.610e+02 2.646e+02, threshold=2.833e+02, percent-clipped=0.0 2023-12-04 11:42:30,853 INFO [train.py:1087] (2/4) Epoch 47, batch 100, loss[loss=0.1488, simple_loss=0.239, pruned_loss=0.02927, over 24719.00 frames. ], tot_loss[loss=0.1586, simple_loss=0.2508, pruned_loss=0.03318, over 1921885.68 frames. ], batch size: 69, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:42:45,033 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-12-04 11:42:59,671 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=275266.6666666667, ans=0.5 2023-12-04 11:43:14,356 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=275400.0, ans=0.05 2023-12-04 11:43:24,156 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-12-04 11:43:25,661 INFO [train.py:1087] (2/4) Epoch 47, batch 150, loss[loss=0.1602, simple_loss=0.2495, pruned_loss=0.03548, over 24549.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.25, pruned_loss=0.03279, over 2577746.44 frames. ], batch size: 62, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:43:53,011 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=275600.0, ans=0.07 2023-12-04 11:44:04,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=275666.6666666667, ans=0.125 2023-12-04 11:44:13,362 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.064e+02 1.279e+02 1.366e+02 1.469e+02 1.889e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-04 11:44:20,234 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.96 vs. limit=22.5 2023-12-04 11:44:21,880 INFO [train.py:1087] (2/4) Epoch 47, batch 200, loss[loss=0.1694, simple_loss=0.2563, pruned_loss=0.04127, over 24471.00 frames. ], tot_loss[loss=0.158, simple_loss=0.2502, pruned_loss=0.03292, over 3071250.93 frames. ], batch size: 75, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:44:36,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=275866.6666666667, ans=0.125 2023-12-04 11:44:44,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=275933.3333333333, ans=0.125 2023-12-04 11:45:18,579 INFO [train.py:1087] (2/4) Epoch 47, batch 250, loss[loss=0.1963, simple_loss=0.2757, pruned_loss=0.05841, over 16653.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2501, pruned_loss=0.03279, over 3454360.51 frames. ], batch size: 177, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:45:24,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=276133.3333333333, ans=0.125 2023-12-04 11:45:30,606 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=276200.0, ans=0.1 2023-12-04 11:45:54,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=276333.3333333333, ans=0.125 2023-12-04 11:46:03,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=276400.0, ans=0.025 2023-12-04 11:46:05,730 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.247e+02 1.326e+02 1.449e+02 2.219e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-04 11:46:10,691 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=276400.0, ans=0.0 2023-12-04 11:46:14,077 INFO [train.py:1087] (2/4) Epoch 47, batch 300, loss[loss=0.1563, simple_loss=0.2494, pruned_loss=0.03161, over 24786.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2502, pruned_loss=0.03314, over 3745475.38 frames. ], batch size: 62, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:46:25,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=276533.3333333333, ans=0.125 2023-12-04 11:46:27,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276533.3333333333, ans=0.1 2023-12-04 11:46:37,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=276600.0, ans=0.125 2023-12-04 11:46:40,033 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.67 vs. limit=22.5 2023-12-04 11:46:42,347 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.38 vs. limit=10.0 2023-12-04 11:46:48,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=276666.6666666667, ans=0.125 2023-12-04 11:46:58,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=276733.3333333333, ans=0.0 2023-12-04 11:47:08,897 INFO [train.py:1087] (2/4) Epoch 47, batch 350, loss[loss=0.1566, simple_loss=0.25, pruned_loss=0.03161, over 24155.00 frames. ], tot_loss[loss=0.158, simple_loss=0.25, pruned_loss=0.03303, over 3983634.59 frames. ], batch size: 58, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:47:25,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=276866.6666666667, ans=0.125 2023-12-04 11:47:30,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=276933.3333333333, ans=0.125 2023-12-04 11:47:45,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277000.0, ans=0.125 2023-12-04 11:47:57,758 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.280e+02 1.373e+02 1.456e+02 2.082e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 11:47:59,427 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 11:48:03,513 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=277066.6666666667, ans=0.09899494936611666 2023-12-04 11:48:05,381 INFO [train.py:1087] (2/4) Epoch 47, batch 400, loss[loss=0.1431, simple_loss=0.2354, pruned_loss=0.02545, over 24785.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.25, pruned_loss=0.03283, over 4169552.19 frames. ], batch size: 71, lr: 5.22e-03, grad_scale: 32.0 2023-12-04 11:48:10,940 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=277133.3333333333, ans=0.125 2023-12-04 11:48:10,976 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277133.3333333333, ans=0.1 2023-12-04 11:48:20,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=277200.0, ans=0.2 2023-12-04 11:48:25,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=277200.0, ans=0.125 2023-12-04 11:48:34,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277266.6666666667, ans=0.1 2023-12-04 11:48:38,464 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-12-04 11:48:55,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=277400.0, ans=0.2 2023-12-04 11:49:01,946 INFO [train.py:1087] (2/4) Epoch 47, batch 450, loss[loss=0.1961, simple_loss=0.2764, pruned_loss=0.0579, over 17183.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2497, pruned_loss=0.03296, over 4289380.80 frames. ], batch size: 178, lr: 5.22e-03, grad_scale: 32.0 2023-12-04 11:49:04,933 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-12-04 11:49:06,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277466.6666666667, ans=0.1 2023-12-04 11:49:18,012 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=15.0 2023-12-04 11:49:18,724 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=277533.3333333333, ans=0.0 2023-12-04 11:49:31,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=277600.0, ans=22.5 2023-12-04 11:49:34,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=277666.6666666667, ans=0.125 2023-12-04 11:49:43,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=277666.6666666667, ans=0.125 2023-12-04 11:49:47,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=277733.3333333333, ans=0.0 2023-12-04 11:49:50,810 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.153e+02 1.318e+02 1.415e+02 1.584e+02 2.056e+02, threshold=2.829e+02, percent-clipped=0.0 2023-12-04 11:49:52,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277733.3333333333, ans=0.1 2023-12-04 11:49:54,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=277733.3333333333, ans=0.0 2023-12-04 11:49:57,490 INFO [train.py:1087] (2/4) Epoch 47, batch 500, loss[loss=0.1769, simple_loss=0.2681, pruned_loss=0.04284, over 24506.00 frames. ], tot_loss[loss=0.158, simple_loss=0.25, pruned_loss=0.03302, over 4385976.44 frames. ], batch size: 77, lr: 5.22e-03, grad_scale: 16.0 2023-12-04 11:50:14,208 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-12-04 11:50:16,100 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=277866.6666666667, ans=0.1 2023-12-04 11:50:37,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=278000.0, ans=0.0 2023-12-04 11:50:53,367 INFO [train.py:1087] (2/4) Epoch 47, batch 550, loss[loss=0.1559, simple_loss=0.251, pruned_loss=0.03037, over 24229.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2496, pruned_loss=0.03277, over 4477186.66 frames. ], batch size: 82, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:51:02,107 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-12-04 11:51:13,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=278200.0, ans=0.125 2023-12-04 11:51:42,220 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.317e+02 1.437e+02 1.661e+02 2.158e+02, threshold=2.873e+02, percent-clipped=0.0 2023-12-04 11:51:49,080 INFO [train.py:1087] (2/4) Epoch 47, batch 600, loss[loss=0.1959, simple_loss=0.2742, pruned_loss=0.05882, over 16377.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2496, pruned_loss=0.03273, over 4528480.68 frames. ], batch size: 179, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:52:09,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=278533.3333333333, ans=0.125 2023-12-04 11:52:25,097 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.75 vs. limit=6.0 2023-12-04 11:52:45,489 INFO [train.py:1087] (2/4) Epoch 47, batch 650, loss[loss=0.1603, simple_loss=0.2587, pruned_loss=0.03099, over 23436.00 frames. ], tot_loss[loss=0.158, simple_loss=0.25, pruned_loss=0.03299, over 4583716.16 frames. ], batch size: 94, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:52:57,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=278866.6666666667, ans=0.2 2023-12-04 11:53:13,263 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-12-04 11:53:17,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=278933.3333333333, ans=0.125 2023-12-04 11:53:23,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=279000.0, ans=0.125 2023-12-04 11:53:35,821 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.314e+02 1.430e+02 1.582e+02 2.935e+02, threshold=2.859e+02, percent-clipped=1.0 2023-12-04 11:53:42,220 INFO [train.py:1087] (2/4) Epoch 47, batch 700, loss[loss=0.1532, simple_loss=0.2419, pruned_loss=0.03226, over 24575.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2497, pruned_loss=0.03276, over 4638831.24 frames. ], batch size: 64, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:53:58,744 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-12-04 11:54:08,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=279266.6666666667, ans=0.125 2023-12-04 11:54:30,268 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=279400.0, ans=0.125 2023-12-04 11:54:38,264 INFO [train.py:1087] (2/4) Epoch 47, batch 750, loss[loss=0.1579, simple_loss=0.2499, pruned_loss=0.0329, over 24470.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2493, pruned_loss=0.03255, over 4686149.67 frames. ], batch size: 77, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:55:07,421 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=279600.0, ans=0.125 2023-12-04 11:55:12,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=279666.6666666667, ans=0.125 2023-12-04 11:55:14,114 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=279666.6666666667, ans=0.125 2023-12-04 11:55:26,820 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.316e+02 1.459e+02 1.626e+02 2.190e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 11:55:33,705 INFO [train.py:1087] (2/4) Epoch 47, batch 800, loss[loss=0.1538, simple_loss=0.2423, pruned_loss=0.03263, over 24795.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2489, pruned_loss=0.03251, over 4714723.95 frames. ], batch size: 71, lr: 5.20e-03, grad_scale: 32.0 2023-12-04 11:55:36,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=279800.0, ans=0.2 2023-12-04 11:56:25,381 INFO [train.py:1087] (2/4) Epoch 47, batch 850, loss[loss=0.1534, simple_loss=0.2487, pruned_loss=0.02909, over 24712.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2492, pruned_loss=0.03256, over 4734090.45 frames. ], batch size: 67, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:56:35,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=280200.0, ans=0.125 2023-12-04 11:56:58,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280333.3333333333, ans=0.0 2023-12-04 11:56:58,963 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=280333.3333333333, ans=0.0 2023-12-04 11:57:16,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=280433.3333333333, ans=0.125 2023-12-04 11:57:19,342 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.062e+02 1.297e+02 1.418e+02 1.577e+02 2.456e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 11:57:19,369 INFO [train.py:1087] (2/4) Epoch 48, batch 0, loss[loss=0.1495, simple_loss=0.2425, pruned_loss=0.0283, over 24756.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2425, pruned_loss=0.0283, over 24756.00 frames. ], batch size: 70, lr: 5.14e-03, grad_scale: 32.0 2023-12-04 11:57:19,370 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 11:57:31,863 INFO [train.py:1119] (2/4) Epoch 48, validation: loss=0.152, simple_loss=0.2501, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 11:57:31,863 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 11:58:04,189 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=280633.3333333333, ans=0.125 2023-12-04 11:58:25,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=280700.0, ans=0.2 2023-12-04 11:58:27,421 INFO [train.py:1087] (2/4) Epoch 48, batch 50, loss[loss=0.1645, simple_loss=0.2507, pruned_loss=0.03914, over 22687.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2495, pruned_loss=0.03238, over 1095260.42 frames. ], batch size: 106, lr: 5.13e-03, grad_scale: 32.0 2023-12-04 11:58:28,958 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.31 vs. limit=10.0 2023-12-04 11:58:33,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=280766.6666666667, ans=0.125 2023-12-04 11:58:50,132 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280900.0, ans=0.0 2023-12-04 11:58:52,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=280900.0, ans=0.07 2023-12-04 11:59:07,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=280966.6666666667, ans=10.0 2023-12-04 11:59:14,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=281033.3333333333, ans=0.125 2023-12-04 11:59:21,764 INFO [train.py:1087] (2/4) Epoch 48, batch 100, loss[loss=0.1636, simple_loss=0.2569, pruned_loss=0.03513, over 24549.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2488, pruned_loss=0.03145, over 1930096.41 frames. ], batch size: 63, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 11:59:23,159 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.273e+02 1.375e+02 1.487e+02 2.008e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 11:59:41,189 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-12-04 11:59:42,921 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=281233.3333333333, ans=0.1 2023-12-04 11:59:51,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=281233.3333333333, ans=0.125 2023-12-04 12:00:04,675 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-12-04 12:00:15,099 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=281366.6666666667, ans=0.2 2023-12-04 12:00:17,006 INFO [train.py:1087] (2/4) Epoch 48, batch 150, loss[loss=0.1368, simple_loss=0.232, pruned_loss=0.02083, over 24600.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2497, pruned_loss=0.03237, over 2555838.95 frames. ], batch size: 68, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 12:00:18,664 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=15.0 2023-12-04 12:00:41,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=281566.6666666667, ans=0.125 2023-12-04 12:00:45,447 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:00:55,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=281633.3333333333, ans=0.2 2023-12-04 12:01:13,189 INFO [train.py:1087] (2/4) Epoch 48, batch 200, loss[loss=0.1637, simple_loss=0.255, pruned_loss=0.03618, over 24496.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2495, pruned_loss=0.03258, over 3045755.23 frames. ], batch size: 75, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 12:01:13,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=281766.6666666667, ans=0.0 2023-12-04 12:01:14,235 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.241e+02 1.316e+02 1.436e+02 2.359e+02, threshold=2.632e+02, percent-clipped=0.0 2023-12-04 12:01:31,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=281833.3333333333, ans=0.125 2023-12-04 12:01:45,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=281966.6666666667, ans=0.125 2023-12-04 12:02:08,700 INFO [train.py:1087] (2/4) Epoch 48, batch 250, loss[loss=0.158, simple_loss=0.2519, pruned_loss=0.0321, over 24708.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2492, pruned_loss=0.03253, over 3449413.81 frames. ], batch size: 69, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:02:15,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=282100.0, ans=0.125 2023-12-04 12:02:19,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=282166.6666666667, ans=0.015 2023-12-04 12:02:19,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=282166.6666666667, ans=0.0 2023-12-04 12:02:29,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=282233.3333333333, ans=0.125 2023-12-04 12:02:30,218 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-12-04 12:02:49,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=282300.0, ans=0.1 2023-12-04 12:03:00,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=282366.6666666667, ans=0.0 2023-12-04 12:03:03,798 INFO [train.py:1087] (2/4) Epoch 48, batch 300, loss[loss=0.1558, simple_loss=0.2522, pruned_loss=0.02965, over 24695.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2491, pruned_loss=0.03239, over 3758340.56 frames. ], batch size: 74, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:03:05,185 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.347e+02 1.460e+02 1.610e+02 2.169e+02, threshold=2.920e+02, percent-clipped=0.0 2023-12-04 12:03:11,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=282433.3333333333, ans=0.0 2023-12-04 12:03:11,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=282433.3333333333, ans=0.0 2023-12-04 12:03:27,428 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-12-04 12:03:31,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282566.6666666667, ans=0.1 2023-12-04 12:03:36,328 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.48 vs. limit=22.5 2023-12-04 12:03:37,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=282633.3333333333, ans=0.2 2023-12-04 12:03:37,673 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-12-04 12:03:47,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=282700.0, ans=0.0 2023-12-04 12:03:59,248 INFO [train.py:1087] (2/4) Epoch 48, batch 350, loss[loss=0.1729, simple_loss=0.268, pruned_loss=0.03887, over 22830.00 frames. ], tot_loss[loss=0.1566, simple_loss=0.2488, pruned_loss=0.03217, over 3995776.69 frames. ], batch size: 106, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:04:06,066 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=282766.6666666667, ans=0.0 2023-12-04 12:04:20,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=282900.0, ans=0.125 2023-12-04 12:04:51,735 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.70 vs. limit=6.0 2023-12-04 12:04:54,352 INFO [train.py:1087] (2/4) Epoch 48, batch 400, loss[loss=0.1512, simple_loss=0.243, pruned_loss=0.02971, over 24786.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2491, pruned_loss=0.03234, over 4178529.74 frames. ], batch size: 73, lr: 5.11e-03, grad_scale: 32.0 2023-12-04 12:04:55,400 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.259e+02 1.354e+02 1.462e+02 2.607e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 12:05:08,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=283166.6666666667, ans=0.125 2023-12-04 12:05:09,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=283166.6666666667, ans=0.0 2023-12-04 12:05:44,595 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-12-04 12:05:46,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=283366.6666666667, ans=0.125 2023-12-04 12:05:50,079 INFO [train.py:1087] (2/4) Epoch 48, batch 450, loss[loss=0.1642, simple_loss=0.257, pruned_loss=0.03575, over 24554.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2494, pruned_loss=0.03259, over 4301221.72 frames. ], batch size: 62, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:06:08,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=283500.0, ans=0.2 2023-12-04 12:06:27,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283633.3333333333, ans=0.125 2023-12-04 12:06:45,302 INFO [train.py:1087] (2/4) Epoch 48, batch 500, loss[loss=0.146, simple_loss=0.2371, pruned_loss=0.02741, over 24794.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2489, pruned_loss=0.03244, over 4429279.13 frames. ], batch size: 62, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:06:47,388 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.281e+02 1.349e+02 1.438e+02 2.094e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 12:07:04,474 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=283833.3333333333, ans=0.125 2023-12-04 12:07:23,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=283966.6666666667, ans=0.95 2023-12-04 12:07:24,951 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=283966.6666666667, ans=0.07 2023-12-04 12:07:39,632 INFO [train.py:1087] (2/4) Epoch 48, batch 550, loss[loss=0.1509, simple_loss=0.2469, pruned_loss=0.02747, over 24688.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2489, pruned_loss=0.03262, over 4506074.83 frames. ], batch size: 74, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:07:39,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=284100.0, ans=0.2 2023-12-04 12:07:47,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=284100.0, ans=0.0 2023-12-04 12:07:50,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=284166.6666666667, ans=0.125 2023-12-04 12:08:10,174 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=284233.3333333333, ans=0.125 2023-12-04 12:08:35,658 INFO [train.py:1087] (2/4) Epoch 48, batch 600, loss[loss=0.1524, simple_loss=0.2459, pruned_loss=0.02943, over 24784.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2493, pruned_loss=0.03288, over 4572578.10 frames. ], batch size: 70, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:08:36,214 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=15.0 2023-12-04 12:08:37,852 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.324e+02 1.469e+02 1.698e+02 2.383e+02, threshold=2.939e+02, percent-clipped=0.0 2023-12-04 12:08:40,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=284433.3333333333, ans=0.125 2023-12-04 12:08:48,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=284500.0, ans=0.04949747468305833 2023-12-04 12:08:54,144 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:08:56,976 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=15.0 2023-12-04 12:09:08,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=284633.3333333333, ans=0.125 2023-12-04 12:09:31,650 INFO [train.py:1087] (2/4) Epoch 48, batch 650, loss[loss=0.1474, simple_loss=0.2407, pruned_loss=0.02711, over 24694.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.249, pruned_loss=0.03258, over 4627997.10 frames. ], batch size: 69, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:09:52,048 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2023-12-04 12:10:27,166 INFO [train.py:1087] (2/4) Epoch 48, batch 700, loss[loss=0.1545, simple_loss=0.2504, pruned_loss=0.02927, over 24791.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2488, pruned_loss=0.03226, over 4676418.20 frames. ], batch size: 71, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:10:29,339 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.347e+02 1.589e+02 1.708e+02 2.161e+02, threshold=3.178e+02, percent-clipped=0.0 2023-12-04 12:10:36,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=285100.0, ans=0.0 2023-12-04 12:11:14,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=285366.6666666667, ans=0.05 2023-12-04 12:11:17,977 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=285366.6666666667, ans=0.125 2023-12-04 12:11:22,115 INFO [train.py:1087] (2/4) Epoch 48, batch 750, loss[loss=0.16, simple_loss=0.2498, pruned_loss=0.03512, over 24850.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2485, pruned_loss=0.03209, over 4711467.43 frames. ], batch size: 68, lr: 5.09e-03, grad_scale: 16.0 2023-12-04 12:11:25,915 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=285433.3333333333, ans=0.0 2023-12-04 12:11:34,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=285500.0, ans=0.125 2023-12-04 12:11:39,205 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=285500.0, ans=0.125 2023-12-04 12:11:49,887 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=285566.6666666667, ans=0.1 2023-12-04 12:11:55,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285633.3333333333, ans=0.1 2023-12-04 12:12:14,909 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=22.5 2023-12-04 12:12:17,407 INFO [train.py:1087] (2/4) Epoch 48, batch 800, loss[loss=0.1491, simple_loss=0.2424, pruned_loss=0.02787, over 24716.00 frames. ], tot_loss[loss=0.1563, simple_loss=0.2485, pruned_loss=0.03204, over 4737210.97 frames. ], batch size: 69, lr: 5.09e-03, grad_scale: 32.0 2023-12-04 12:12:18,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=285766.6666666667, ans=0.0 2023-12-04 12:12:19,544 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.291e+02 1.368e+02 1.474e+02 1.788e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 12:12:21,272 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-12-04 12:12:34,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=285833.3333333333, ans=0.0 2023-12-04 12:12:37,087 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=285833.3333333333, ans=0.04949747468305833 2023-12-04 12:12:40,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=285900.0, ans=0.125 2023-12-04 12:12:42,339 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.94 vs. limit=10.0 2023-12-04 12:12:45,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=285900.0, ans=0.025 2023-12-04 12:12:49,248 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=285966.6666666667, ans=0.0 2023-12-04 12:12:49,758 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-12-04 12:13:09,257 INFO [train.py:1087] (2/4) Epoch 48, batch 850, loss[loss=0.1494, simple_loss=0.2456, pruned_loss=0.02659, over 24728.00 frames. ], tot_loss[loss=0.1563, simple_loss=0.2485, pruned_loss=0.03203, over 4766958.47 frames. ], batch size: 67, lr: 5.09e-03, grad_scale: 16.0 2023-12-04 12:13:10,553 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286100.0, ans=0.1 2023-12-04 12:13:29,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=286233.3333333333, ans=0.0 2023-12-04 12:13:47,168 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-12-04 12:14:06,581 INFO [train.py:1087] (2/4) Epoch 49, batch 0, loss[loss=0.1588, simple_loss=0.2524, pruned_loss=0.03257, over 24026.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2524, pruned_loss=0.03257, over 24026.00 frames. ], batch size: 87, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:14:06,581 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 12:14:18,645 INFO [train.py:1119] (2/4) Epoch 49, validation: loss=0.1515, simple_loss=0.2498, pruned_loss=0.02665, over 944034.00 frames. 2023-12-04 12:14:18,646 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 12:14:26,504 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.37 vs. limit=15.0 2023-12-04 12:14:27,056 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.277e+02 1.377e+02 1.551e+02 2.556e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 12:14:27,594 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2023-12-04 12:14:31,608 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=286466.6666666667, ans=0.0 2023-12-04 12:14:32,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=286466.6666666667, ans=0.0 2023-12-04 12:14:37,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-12-04 12:14:45,530 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=286533.3333333333, ans=0.0 2023-12-04 12:14:52,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=286600.0, ans=0.0 2023-12-04 12:15:11,285 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=286666.6666666667, ans=0.0 2023-12-04 12:15:14,290 INFO [train.py:1087] (2/4) Epoch 49, batch 50, loss[loss=0.1556, simple_loss=0.2531, pruned_loss=0.0291, over 24776.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2487, pruned_loss=0.03102, over 1102221.66 frames. ], batch size: 73, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:15:25,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=286800.0, ans=0.1 2023-12-04 12:16:07,842 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.72 vs. limit=22.5 2023-12-04 12:16:09,401 INFO [train.py:1087] (2/4) Epoch 49, batch 100, loss[loss=0.1546, simple_loss=0.2438, pruned_loss=0.03273, over 24754.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2492, pruned_loss=0.03139, over 1933279.01 frames. ], batch size: 66, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:16:12,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=287066.6666666667, ans=0.05 2023-12-04 12:16:18,633 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.082e+02 1.265e+02 1.369e+02 1.471e+02 2.016e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-04 12:17:04,960 INFO [train.py:1087] (2/4) Epoch 49, batch 150, loss[loss=0.1552, simple_loss=0.2421, pruned_loss=0.03418, over 24461.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2495, pruned_loss=0.03208, over 2570878.97 frames. ], batch size: 77, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:17:30,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=287533.3333333333, ans=0.125 2023-12-04 12:17:53,128 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:17:55,526 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-12-04 12:18:00,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=287733.3333333333, ans=0.1 2023-12-04 12:18:01,286 INFO [train.py:1087] (2/4) Epoch 49, batch 200, loss[loss=0.1613, simple_loss=0.2518, pruned_loss=0.03536, over 24578.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2495, pruned_loss=0.03234, over 3064940.60 frames. ], batch size: 64, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:18:11,197 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.291e+02 1.371e+02 1.476e+02 1.985e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 12:18:31,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=287866.6666666667, ans=0.1 2023-12-04 12:18:42,275 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-12-04 12:18:50,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=288000.0, ans=0.125 2023-12-04 12:18:54,951 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=288000.0, ans=0.125 2023-12-04 12:18:57,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=288066.6666666667, ans=0.5 2023-12-04 12:18:57,952 INFO [train.py:1087] (2/4) Epoch 49, batch 250, loss[loss=0.1631, simple_loss=0.256, pruned_loss=0.03509, over 21576.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2493, pruned_loss=0.03219, over 3463744.20 frames. ], batch size: 128, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:19:09,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=288133.3333333333, ans=0.0 2023-12-04 12:19:17,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=288133.3333333333, ans=0.95 2023-12-04 12:19:38,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=288266.6666666667, ans=0.0 2023-12-04 12:19:46,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=288333.3333333333, ans=0.125 2023-12-04 12:19:53,308 INFO [train.py:1087] (2/4) Epoch 49, batch 300, loss[loss=0.1586, simple_loss=0.2489, pruned_loss=0.03414, over 24775.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2496, pruned_loss=0.03229, over 3759487.04 frames. ], batch size: 61, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:20:03,360 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.319e+02 1.446e+02 1.575e+02 3.932e+02, threshold=2.892e+02, percent-clipped=1.0 2023-12-04 12:20:07,348 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-12-04 12:20:48,944 INFO [train.py:1087] (2/4) Epoch 49, batch 350, loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02709, over 24562.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2493, pruned_loss=0.03224, over 3970638.59 frames. ], batch size: 63, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:20:51,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=288733.3333333333, ans=0.125 2023-12-04 12:20:57,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=288733.3333333333, ans=0.125 2023-12-04 12:21:04,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=288800.0, ans=0.0 2023-12-04 12:21:11,119 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.88 vs. limit=15.0 2023-12-04 12:21:13,155 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.96 vs. limit=15.0 2023-12-04 12:21:22,837 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.37 vs. limit=22.5 2023-12-04 12:21:38,374 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.94 vs. limit=12.0 2023-12-04 12:21:39,231 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=289000.0, ans=0.0 2023-12-04 12:21:45,360 INFO [train.py:1087] (2/4) Epoch 49, batch 400, loss[loss=0.1574, simple_loss=0.2469, pruned_loss=0.03394, over 24558.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2495, pruned_loss=0.03235, over 4156175.01 frames. ], batch size: 63, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:21:45,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=289066.6666666667, ans=0.0 2023-12-04 12:21:56,374 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.290e+02 1.383e+02 1.499e+02 2.166e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 12:22:03,027 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-12-04 12:22:11,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=289200.0, ans=0.125 2023-12-04 12:22:17,678 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=289266.6666666667, ans=0.0 2023-12-04 12:22:41,510 INFO [train.py:1087] (2/4) Epoch 49, batch 450, loss[loss=0.1679, simple_loss=0.2569, pruned_loss=0.03949, over 24473.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.249, pruned_loss=0.0322, over 4310308.71 frames. ], batch size: 77, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:22:41,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=289400.0, ans=0.0 2023-12-04 12:22:47,266 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.75 vs. limit=15.0 2023-12-04 12:22:59,898 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=289466.6666666667, ans=0.125 2023-12-04 12:23:05,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=289533.3333333333, ans=0.0 2023-12-04 12:23:10,005 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.64 vs. limit=15.0 2023-12-04 12:23:31,179 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-12-04 12:23:31,323 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2023-12-04 12:23:37,220 INFO [train.py:1087] (2/4) Epoch 49, batch 500, loss[loss=0.1574, simple_loss=0.2545, pruned_loss=0.03012, over 24578.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2491, pruned_loss=0.03243, over 4402994.15 frames. ], batch size: 64, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:23:41,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=289733.3333333333, ans=0.125 2023-12-04 12:23:47,868 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.299e+02 1.421e+02 1.527e+02 2.302e+02, threshold=2.842e+02, percent-clipped=0.0 2023-12-04 12:24:02,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=289866.6666666667, ans=0.125 2023-12-04 12:24:06,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=289866.6666666667, ans=0.125 2023-12-04 12:24:19,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=289933.3333333333, ans=0.125 2023-12-04 12:24:27,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=290000.0, ans=0.0 2023-12-04 12:24:31,373 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=290066.6666666667, ans=0.07 2023-12-04 12:24:32,216 INFO [train.py:1087] (2/4) Epoch 49, batch 550, loss[loss=0.1597, simple_loss=0.2536, pruned_loss=0.03293, over 24548.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2494, pruned_loss=0.03257, over 4489831.27 frames. ], batch size: 63, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:24:36,448 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=290066.6666666667, ans=0.2 2023-12-04 12:24:36,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=290066.6666666667, ans=0.2 2023-12-04 12:24:45,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-12-04 12:24:54,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=290200.0, ans=0.0 2023-12-04 12:25:01,668 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:25:03,862 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=290200.0, ans=0.025 2023-12-04 12:25:10,696 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-12-04 12:25:11,701 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.75 vs. limit=15.0 2023-12-04 12:25:12,454 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=290266.6666666667, ans=0.0 2023-12-04 12:25:13,671 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-12-04 12:25:28,632 INFO [train.py:1087] (2/4) Epoch 49, batch 600, loss[loss=0.1591, simple_loss=0.2512, pruned_loss=0.03348, over 24838.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2492, pruned_loss=0.03239, over 4563874.33 frames. ], batch size: 68, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:25:40,497 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.365e+02 1.448e+02 1.602e+02 1.913e+02, threshold=2.897e+02, percent-clipped=0.0 2023-12-04 12:26:01,396 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:26:04,707 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-12-04 12:26:06,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=290600.0, ans=0.0 2023-12-04 12:26:15,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=290666.6666666667, ans=0.2 2023-12-04 12:26:22,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=290666.6666666667, ans=0.125 2023-12-04 12:26:25,382 INFO [train.py:1087] (2/4) Epoch 49, batch 650, loss[loss=0.1573, simple_loss=0.2522, pruned_loss=0.03113, over 24745.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2489, pruned_loss=0.03222, over 4624143.75 frames. ], batch size: 61, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:26:26,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=290733.3333333333, ans=0.0 2023-12-04 12:26:31,957 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=290733.3333333333, ans=0.0 2023-12-04 12:26:53,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=290866.6666666667, ans=0.125 2023-12-04 12:27:04,976 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=290933.3333333333, ans=0.05 2023-12-04 12:27:08,289 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=290933.3333333333, ans=0.125 2023-12-04 12:27:18,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=291000.0, ans=0.0 2023-12-04 12:27:19,218 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291000.0, ans=0.1 2023-12-04 12:27:22,244 INFO [train.py:1087] (2/4) Epoch 49, batch 700, loss[loss=0.1467, simple_loss=0.2376, pruned_loss=0.02786, over 24776.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2487, pruned_loss=0.03206, over 4677200.54 frames. ], batch size: 71, lr: 4.99e-03, grad_scale: 16.0 2023-12-04 12:27:26,176 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-12-04 12:27:32,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=291133.3333333333, ans=0.125 2023-12-04 12:27:33,326 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.253e+02 1.366e+02 1.468e+02 1.908e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 12:27:39,200 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.90 vs. limit=10.0 2023-12-04 12:27:41,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=291133.3333333333, ans=0.1 2023-12-04 12:27:59,353 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-12-04 12:28:17,996 INFO [train.py:1087] (2/4) Epoch 49, batch 750, loss[loss=0.1634, simple_loss=0.2541, pruned_loss=0.03631, over 24793.00 frames. ], tot_loss[loss=0.1565, simple_loss=0.2487, pruned_loss=0.03211, over 4710692.20 frames. ], batch size: 62, lr: 4.99e-03, grad_scale: 16.0 2023-12-04 12:28:25,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=291400.0, ans=0.125 2023-12-04 12:28:37,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.77 vs. limit=15.0 2023-12-04 12:28:53,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=291600.0, ans=0.1 2023-12-04 12:29:00,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=291600.0, ans=0.125 2023-12-04 12:29:09,885 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=291666.6666666667, ans=0.95 2023-12-04 12:29:13,774 INFO [train.py:1087] (2/4) Epoch 49, batch 800, loss[loss=0.1593, simple_loss=0.2566, pruned_loss=0.03103, over 24858.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2489, pruned_loss=0.03232, over 4724459.12 frames. ], batch size: 68, lr: 4.99e-03, grad_scale: 32.0 2023-12-04 12:29:24,827 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.268e+02 1.351e+02 1.470e+02 2.062e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 12:29:30,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=291800.0, ans=0.2 2023-12-04 12:29:36,876 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=22.5 2023-12-04 12:29:48,602 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=291933.3333333333, ans=0.125 2023-12-04 12:30:05,890 INFO [train.py:1087] (2/4) Epoch 49, batch 850, loss[loss=0.154, simple_loss=0.2503, pruned_loss=0.02891, over 24583.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2489, pruned_loss=0.03244, over 4730350.86 frames. ], batch size: 65, lr: 4.98e-03, grad_scale: 32.0 2023-12-04 12:30:07,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=292066.6666666667, ans=0.5 2023-12-04 12:30:08,111 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=292066.6666666667, ans=0.125 2023-12-04 12:30:18,488 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=292133.3333333333, ans=0.05 2023-12-04 12:30:43,938 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=292266.6666666667, ans=0.125 2023-12-04 12:31:04,582 INFO [train.py:1087] (2/4) Epoch 50, batch 0, loss[loss=0.1605, simple_loss=0.2523, pruned_loss=0.03432, over 24762.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2523, pruned_loss=0.03432, over 24762.00 frames. ], batch size: 70, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:31:04,583 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 12:31:15,404 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.3639, 2.3879, 2.8389, 3.0777], device='cuda:2') 2023-12-04 12:31:16,958 INFO [train.py:1119] (2/4) Epoch 50, validation: loss=0.1516, simple_loss=0.2496, pruned_loss=0.02681, over 944034.00 frames. 2023-12-04 12:31:16,959 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 12:31:18,651 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2023-12-04 12:31:32,638 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.261e+02 1.358e+02 1.517e+02 2.252e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 12:31:51,662 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=292566.6666666667, ans=0.0 2023-12-04 12:31:55,401 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=15.0 2023-12-04 12:31:55,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=292566.6666666667, ans=0.125 2023-12-04 12:31:56,637 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-12-04 12:32:00,227 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=292633.3333333333, ans=0.0 2023-12-04 12:32:01,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=292633.3333333333, ans=0.0 2023-12-04 12:32:11,087 INFO [train.py:1087] (2/4) Epoch 50, batch 50, loss[loss=0.1493, simple_loss=0.2446, pruned_loss=0.02695, over 24733.00 frames. ], tot_loss[loss=0.1565, simple_loss=0.2489, pruned_loss=0.03204, over 1078344.04 frames. ], batch size: 61, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:32:11,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=292700.0, ans=0.125 2023-12-04 12:32:23,678 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.00 vs. limit=22.5 2023-12-04 12:33:07,078 INFO [train.py:1087] (2/4) Epoch 50, batch 100, loss[loss=0.1461, simple_loss=0.2413, pruned_loss=0.02542, over 24559.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2489, pruned_loss=0.03177, over 1908871.46 frames. ], batch size: 66, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:33:08,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=293033.3333333333, ans=0.2 2023-12-04 12:33:13,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=293033.3333333333, ans=15.0 2023-12-04 12:33:22,576 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293100.0, ans=0.125 2023-12-04 12:33:23,422 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.296e+02 1.400e+02 1.614e+02 2.724e+02, threshold=2.800e+02, percent-clipped=1.0 2023-12-04 12:33:32,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=293166.6666666667, ans=0.125 2023-12-04 12:33:33,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=293166.6666666667, ans=0.125 2023-12-04 12:33:33,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293166.6666666667, ans=0.1 2023-12-04 12:33:33,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=293166.6666666667, ans=0.0 2023-12-04 12:33:33,853 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.44 vs. limit=15.0 2023-12-04 12:33:52,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293300.0, ans=0.125 2023-12-04 12:33:59,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=293300.0, ans=0.1 2023-12-04 12:34:04,667 INFO [train.py:1087] (2/4) Epoch 50, batch 150, loss[loss=0.1834, simple_loss=0.2661, pruned_loss=0.05032, over 16873.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2491, pruned_loss=0.03219, over 2548502.56 frames. ], batch size: 177, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:34:43,901 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-12-04 12:34:58,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=293700.0, ans=0.0 2023-12-04 12:34:59,660 INFO [train.py:1087] (2/4) Epoch 50, batch 200, loss[loss=0.1645, simple_loss=0.2543, pruned_loss=0.03733, over 24216.00 frames. ], tot_loss[loss=0.1563, simple_loss=0.2486, pruned_loss=0.032, over 3052459.98 frames. ], batch size: 82, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:35:02,359 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-12-04 12:35:16,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=293766.6666666667, ans=0.0 2023-12-04 12:35:18,243 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.295e+02 1.378e+02 1.503e+02 2.109e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 12:35:21,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=293766.6666666667, ans=0.125 2023-12-04 12:35:26,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=293833.3333333333, ans=0.125 2023-12-04 12:35:43,093 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-12-04 12:36:02,479 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-12-04 12:36:09,810 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=293966.6666666667, ans=0.0 2023-12-04 12:36:12,394 INFO [train.py:1087] (2/4) Epoch 50, batch 250, loss[loss=0.1499, simple_loss=0.242, pruned_loss=0.02891, over 24581.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2483, pruned_loss=0.03193, over 3448332.46 frames. ], batch size: 64, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:36:12,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=294033.3333333333, ans=0.0 2023-12-04 12:36:33,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=294100.0, ans=0.07 2023-12-04 12:36:41,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=294166.6666666667, ans=0.1 2023-12-04 12:36:48,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=294166.6666666667, ans=0.2 2023-12-04 12:36:57,321 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-12-04 12:37:03,169 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-12-04 12:37:14,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=294300.0, ans=0.0 2023-12-04 12:37:14,800 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-12-04 12:37:20,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=294300.0, ans=0.1 2023-12-04 12:37:27,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=294366.6666666667, ans=0.125 2023-12-04 12:37:28,033 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.15 vs. limit=10.0 2023-12-04 12:37:28,451 INFO [train.py:1087] (2/4) Epoch 50, batch 300, loss[loss=0.1722, simple_loss=0.2614, pruned_loss=0.04146, over 24472.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2483, pruned_loss=0.03194, over 3763349.53 frames. ], batch size: 77, lr: 4.92e-03, grad_scale: 16.0 2023-12-04 12:37:47,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=294433.3333333333, ans=0.0 2023-12-04 12:37:52,381 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.315e+02 1.430e+02 1.594e+02 2.206e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 12:37:57,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=294500.0, ans=0.125 2023-12-04 12:38:09,059 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=294500.0, ans=0.0 2023-12-04 12:38:10,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=294500.0, ans=0.125 2023-12-04 12:38:28,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=294633.3333333333, ans=0.125 2023-12-04 12:38:44,299 INFO [train.py:1087] (2/4) Epoch 50, batch 350, loss[loss=0.1649, simple_loss=0.2604, pruned_loss=0.03475, over 23904.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2486, pruned_loss=0.03193, over 3996854.13 frames. ], batch size: 87, lr: 4.91e-03, grad_scale: 16.0 2023-12-04 12:38:54,560 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=294700.0, ans=0.125 2023-12-04 12:39:02,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=294766.6666666667, ans=0.125 2023-12-04 12:39:26,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=294833.3333333333, ans=0.125 2023-12-04 12:39:33,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294900.0, ans=0.1 2023-12-04 12:40:01,221 INFO [train.py:1087] (2/4) Epoch 50, batch 400, loss[loss=0.1417, simple_loss=0.234, pruned_loss=0.02471, over 24765.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2484, pruned_loss=0.03168, over 4178158.11 frames. ], batch size: 65, lr: 4.91e-03, grad_scale: 32.0 2023-12-04 12:40:03,181 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=295033.3333333333, ans=0.125 2023-12-04 12:40:13,581 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-12-04 12:40:26,389 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.284e+02 1.371e+02 1.478e+02 1.721e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 12:41:17,307 INFO [train.py:1087] (2/4) Epoch 50, batch 450, loss[loss=0.1617, simple_loss=0.2548, pruned_loss=0.03429, over 23399.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2488, pruned_loss=0.03203, over 4324809.71 frames. ], batch size: 94, lr: 4.91e-03, grad_scale: 32.0 2023-12-04 12:41:19,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=295366.6666666667, ans=22.5 2023-12-04 12:41:23,913 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-12-04 12:42:01,356 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=295566.6666666667, ans=0.125 2023-12-04 12:42:02,093 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-12-04 12:42:33,065 INFO [train.py:1087] (2/4) Epoch 50, batch 500, loss[loss=0.1543, simple_loss=0.2478, pruned_loss=0.03035, over 21722.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2488, pruned_loss=0.03203, over 4412685.07 frames. ], batch size: 127, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:42:39,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295700.0, ans=0.0 2023-12-04 12:42:56,285 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=295766.6666666667, ans=0.125 2023-12-04 12:42:58,524 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.256e+02 1.344e+02 1.428e+02 2.057e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 12:43:00,546 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=295766.6666666667, ans=0.125 2023-12-04 12:43:27,585 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-12-04 12:43:49,674 INFO [train.py:1087] (2/4) Epoch 50, batch 550, loss[loss=0.1632, simple_loss=0.2565, pruned_loss=0.03499, over 23564.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2491, pruned_loss=0.03216, over 4498415.00 frames. ], batch size: 94, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:44:27,579 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=296166.6666666667, ans=0.0 2023-12-04 12:44:41,878 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-12-04 12:45:02,598 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=296300.0, ans=0.125 2023-12-04 12:45:06,858 INFO [train.py:1087] (2/4) Epoch 50, batch 600, loss[loss=0.1512, simple_loss=0.2458, pruned_loss=0.02832, over 24548.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2487, pruned_loss=0.0319, over 4587816.21 frames. ], batch size: 66, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:45:33,643 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.304e+02 1.379e+02 1.513e+02 2.002e+02, threshold=2.759e+02, percent-clipped=0.0 2023-12-04 12:45:38,905 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=12.0 2023-12-04 12:45:42,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=296500.0, ans=0.0 2023-12-04 12:45:46,240 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.65 vs. limit=15.0 2023-12-04 12:46:01,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=296566.6666666667, ans=0.125 2023-12-04 12:46:24,126 INFO [train.py:1087] (2/4) Epoch 50, batch 650, loss[loss=0.1493, simple_loss=0.2445, pruned_loss=0.02706, over 24576.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2486, pruned_loss=0.03178, over 4649592.95 frames. ], batch size: 64, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:46:43,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=296766.6666666667, ans=0.07 2023-12-04 12:47:04,104 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=296833.3333333333, ans=0.0 2023-12-04 12:47:04,330 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-12-04 12:47:27,463 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=22.5 2023-12-04 12:47:42,669 INFO [train.py:1087] (2/4) Epoch 50, batch 700, loss[loss=0.1562, simple_loss=0.2451, pruned_loss=0.03366, over 24849.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2486, pruned_loss=0.03192, over 4687169.62 frames. ], batch size: 68, lr: 4.89e-03, grad_scale: 16.0 2023-12-04 12:47:44,530 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=297033.3333333333, ans=0.0 2023-12-04 12:48:09,131 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.064e+02 1.270e+02 1.347e+02 1.446e+02 1.862e+02, threshold=2.693e+02, percent-clipped=0.0 2023-12-04 12:48:11,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=297100.0, ans=0.0 2023-12-04 12:48:19,102 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-12-04 12:48:20,570 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=12.0 2023-12-04 12:48:28,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=297233.3333333333, ans=0.0 2023-12-04 12:49:01,123 INFO [train.py:1087] (2/4) Epoch 50, batch 750, loss[loss=0.1499, simple_loss=0.2419, pruned_loss=0.02894, over 24795.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.249, pruned_loss=0.03231, over 4696909.18 frames. ], batch size: 62, lr: 4.89e-03, grad_scale: 16.0 2023-12-04 12:49:08,479 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2023-12-04 12:49:10,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=297366.6666666667, ans=15.0 2023-12-04 12:49:21,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=297433.3333333333, ans=0.125 2023-12-04 12:50:03,666 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-12-04 12:50:12,290 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=15.0 2023-12-04 12:50:17,614 INFO [train.py:1087] (2/4) Epoch 50, batch 800, loss[loss=0.147, simple_loss=0.2374, pruned_loss=0.02827, over 24749.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2489, pruned_loss=0.03239, over 4698204.94 frames. ], batch size: 61, lr: 4.89e-03, grad_scale: 32.0 2023-12-04 12:50:18,238 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-12-04 12:50:42,510 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.327e+02 1.412e+02 1.549e+02 1.936e+02, threshold=2.824e+02, percent-clipped=0.0 2023-12-04 12:51:26,891 INFO [train.py:1087] (2/4) Epoch 50, batch 850, loss[loss=0.1511, simple_loss=0.2456, pruned_loss=0.02827, over 24097.00 frames. ], tot_loss[loss=0.157, simple_loss=0.249, pruned_loss=0.0325, over 4694611.30 frames. ], batch size: 87, lr: 4.89e-03, grad_scale: 32.0 2023-12-04 12:51:50,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=298100.0, ans=15.0 2023-12-04 12:52:10,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=298233.3333333333, ans=0.125 2023-12-04 12:52:12,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=298233.3333333333, ans=0.0 2023-12-04 12:52:14,000 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.90 vs. limit=15.0 2023-12-04 12:52:22,845 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.15 vs. limit=12.0 2023-12-04 12:52:45,041 INFO [train.py:1087] (2/4) Epoch 51, batch 0, loss[loss=0.1491, simple_loss=0.2419, pruned_loss=0.02809, over 24508.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2419, pruned_loss=0.02809, over 24508.00 frames. ], batch size: 75, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:52:45,042 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 12:53:01,898 INFO [train.py:1119] (2/4) Epoch 51, validation: loss=0.1517, simple_loss=0.2496, pruned_loss=0.02685, over 944034.00 frames. 2023-12-04 12:53:01,899 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 12:53:35,314 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.039e+02 1.298e+02 1.401e+02 1.598e+02 2.046e+02, threshold=2.803e+02, percent-clipped=0.0 2023-12-04 12:53:46,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=298533.3333333333, ans=0.0 2023-12-04 12:54:00,641 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=298533.3333333333, ans=0.0 2023-12-04 12:54:16,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=298600.0, ans=0.1 2023-12-04 12:54:19,889 INFO [train.py:1087] (2/4) Epoch 51, batch 50, loss[loss=0.1453, simple_loss=0.2405, pruned_loss=0.02507, over 24573.00 frames. ], tot_loss[loss=0.158, simple_loss=0.2502, pruned_loss=0.03294, over 1079078.10 frames. ], batch size: 64, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:54:22,411 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=298666.6666666667, ans=0.1 2023-12-04 12:54:44,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=298733.3333333333, ans=0.5 2023-12-04 12:55:00,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=298800.0, ans=0.0 2023-12-04 12:55:01,920 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=298800.0, ans=0.125 2023-12-04 12:55:13,210 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.94 vs. limit=22.5 2023-12-04 12:55:14,324 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=298866.6666666667, ans=0.2 2023-12-04 12:55:36,397 INFO [train.py:1087] (2/4) Epoch 51, batch 100, loss[loss=0.1605, simple_loss=0.2567, pruned_loss=0.03215, over 21616.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2487, pruned_loss=0.0316, over 1913793.99 frames. ], batch size: 128, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:56:01,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=299066.6666666667, ans=0.125 2023-12-04 12:56:10,758 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.247e+02 1.341e+02 1.436e+02 1.925e+02, threshold=2.681e+02, percent-clipped=0.0 2023-12-04 12:56:19,378 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.61 vs. limit=15.0 2023-12-04 12:56:35,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=299200.0, ans=0.95 2023-12-04 12:56:53,525 INFO [train.py:1087] (2/4) Epoch 51, batch 150, loss[loss=0.1567, simple_loss=0.2521, pruned_loss=0.03067, over 24857.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2482, pruned_loss=0.03178, over 2567097.85 frames. ], batch size: 68, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:56:53,920 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=299333.3333333333, ans=0.125 2023-12-04 12:56:58,359 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=299333.3333333333, ans=0.125 2023-12-04 12:57:07,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=299400.0, ans=0.2 2023-12-04 12:57:30,722 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-12-04 12:57:55,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=299600.0, ans=0.2 2023-12-04 12:58:03,101 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=299600.0, ans=0.125 2023-12-04 12:58:08,666 INFO [train.py:1087] (2/4) Epoch 51, batch 200, loss[loss=0.1524, simple_loss=0.2466, pruned_loss=0.02909, over 24565.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2477, pruned_loss=0.03138, over 3064419.99 frames. ], batch size: 63, lr: 4.82e-03, grad_scale: 32.0 2023-12-04 12:58:21,393 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=299666.6666666667, ans=0.0 2023-12-04 12:58:24,561 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-12-04 12:58:28,954 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=299733.3333333333, ans=0.125 2023-12-04 12:58:38,978 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-12-04 12:58:43,681 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.285e+02 1.368e+02 1.485e+02 1.874e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 12:58:51,294 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:59:10,863 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=12.0 2023-12-04 12:59:22,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=300000.0, ans=0.2 2023-12-04 12:59:24,413 INFO [train.py:1087] (2/4) Epoch 51, batch 250, loss[loss=0.1609, simple_loss=0.2481, pruned_loss=0.03689, over 24277.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2477, pruned_loss=0.03121, over 3461131.58 frames. ], batch size: 79, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 12:59:24,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=300000.0, ans=0.0 2023-12-04 13:00:09,808 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=300200.0, ans=0.125 2023-12-04 13:00:16,218 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.44 vs. limit=22.5 2023-12-04 13:00:19,934 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=300200.0, ans=0.125 2023-12-04 13:00:41,671 INFO [train.py:1087] (2/4) Epoch 51, batch 300, loss[loss=0.1671, simple_loss=0.2601, pruned_loss=0.03706, over 24202.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2482, pruned_loss=0.03145, over 3761103.40 frames. ], batch size: 82, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 13:00:47,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=300333.3333333333, ans=0.125 2023-12-04 13:01:16,432 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.302e+02 1.405e+02 1.500e+02 2.125e+02, threshold=2.811e+02, percent-clipped=0.0 2023-12-04 13:01:57,782 INFO [train.py:1087] (2/4) Epoch 51, batch 350, loss[loss=0.1663, simple_loss=0.2572, pruned_loss=0.03766, over 24470.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2477, pruned_loss=0.0312, over 4008230.46 frames. ], batch size: 77, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 13:02:31,656 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-12-04 13:02:43,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=300866.6666666667, ans=0.125 2023-12-04 13:02:48,633 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-12-04 13:02:49,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=300866.6666666667, ans=0.0 2023-12-04 13:02:51,199 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=300866.6666666667, ans=0.05 2023-12-04 13:02:51,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=300866.6666666667, ans=0.2 2023-12-04 13:03:05,649 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=300933.3333333333, ans=0.2 2023-12-04 13:03:11,374 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300933.3333333333, ans=0.1 2023-12-04 13:03:14,077 INFO [train.py:1087] (2/4) Epoch 51, batch 400, loss[loss=0.1508, simple_loss=0.2451, pruned_loss=0.02827, over 24554.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2473, pruned_loss=0.03091, over 4191123.58 frames. ], batch size: 62, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:03:45,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=301133.3333333333, ans=0.125 2023-12-04 13:03:49,706 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.264e+02 1.361e+02 1.495e+02 2.080e+02, threshold=2.722e+02, percent-clipped=0.0 2023-12-04 13:04:23,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301266.6666666667, ans=0.1 2023-12-04 13:04:30,733 INFO [train.py:1087] (2/4) Epoch 51, batch 450, loss[loss=0.148, simple_loss=0.2449, pruned_loss=0.02554, over 24559.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2474, pruned_loss=0.03092, over 4337684.94 frames. ], batch size: 63, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:04:36,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=301333.3333333333, ans=0.035 2023-12-04 13:04:59,291 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=301466.6666666667, ans=0.125 2023-12-04 13:05:03,456 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=301466.6666666667, ans=0.2 2023-12-04 13:05:18,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=301533.3333333333, ans=0.125 2023-12-04 13:05:19,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=301533.3333333333, ans=0.125 2023-12-04 13:05:46,814 INFO [train.py:1087] (2/4) Epoch 51, batch 500, loss[loss=0.1486, simple_loss=0.24, pruned_loss=0.02862, over 24596.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2476, pruned_loss=0.03114, over 4445065.77 frames. ], batch size: 68, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:05:59,688 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:06:15,969 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:06:21,503 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.256e+02 1.336e+02 1.463e+02 1.740e+02, threshold=2.671e+02, percent-clipped=0.0 2023-12-04 13:06:23,574 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.86 vs. limit=15.0 2023-12-04 13:06:50,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=301933.3333333333, ans=0.125 2023-12-04 13:06:53,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=301933.3333333333, ans=0.07 2023-12-04 13:07:03,739 INFO [train.py:1087] (2/4) Epoch 51, batch 550, loss[loss=0.1643, simple_loss=0.2563, pruned_loss=0.03613, over 24159.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2478, pruned_loss=0.03121, over 4525114.12 frames. ], batch size: 82, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:07:20,918 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=302066.6666666667, ans=0.125 2023-12-04 13:07:27,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=302066.6666666667, ans=0.2 2023-12-04 13:07:27,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=302066.6666666667, ans=0.0 2023-12-04 13:07:49,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=302200.0, ans=0.125 2023-12-04 13:08:14,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=302266.6666666667, ans=0.125 2023-12-04 13:08:21,954 INFO [train.py:1087] (2/4) Epoch 51, batch 600, loss[loss=0.1555, simple_loss=0.2473, pruned_loss=0.03185, over 24735.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2477, pruned_loss=0.03116, over 4602324.75 frames. ], batch size: 61, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:08:58,248 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.169e+02 1.269e+02 1.388e+02 1.458e+02 2.113e+02, threshold=2.776e+02, percent-clipped=0.0 2023-12-04 13:09:02,939 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=302466.6666666667, ans=0.125 2023-12-04 13:09:03,011 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302466.6666666667, ans=0.1 2023-12-04 13:09:15,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=302533.3333333333, ans=0.125 2023-12-04 13:09:41,628 INFO [train.py:1087] (2/4) Epoch 51, batch 650, loss[loss=0.1552, simple_loss=0.2487, pruned_loss=0.03083, over 24789.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2476, pruned_loss=0.03109, over 4650082.51 frames. ], batch size: 72, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:09:43,268 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=302666.6666666667, ans=0.125 2023-12-04 13:10:02,642 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=302733.3333333333, ans=0.125 2023-12-04 13:10:04,421 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=302733.3333333333, ans=0.0 2023-12-04 13:10:09,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=302733.3333333333, ans=0.125 2023-12-04 13:10:49,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=302933.3333333333, ans=0.125 2023-12-04 13:10:54,201 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:10:54,393 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-12-04 13:10:58,607 INFO [train.py:1087] (2/4) Epoch 51, batch 700, loss[loss=0.1539, simple_loss=0.2472, pruned_loss=0.03033, over 24342.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2474, pruned_loss=0.03107, over 4700880.09 frames. ], batch size: 79, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:11:00,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303000.0, ans=0.1 2023-12-04 13:11:33,732 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.280e+02 1.383e+02 1.505e+02 1.927e+02, threshold=2.765e+02, percent-clipped=0.0 2023-12-04 13:11:36,205 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.65 vs. limit=22.5 2023-12-04 13:12:12,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=303266.6666666667, ans=0.0 2023-12-04 13:12:16,985 INFO [train.py:1087] (2/4) Epoch 51, batch 750, loss[loss=0.1511, simple_loss=0.2492, pruned_loss=0.02648, over 24562.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2479, pruned_loss=0.03136, over 4711415.68 frames. ], batch size: 66, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:12:24,167 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=303333.3333333333, ans=0.125 2023-12-04 13:13:16,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=303533.3333333333, ans=0.125 2023-12-04 13:13:22,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=303600.0, ans=0.0 2023-12-04 13:13:28,247 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.24 vs. limit=10.0 2023-12-04 13:13:35,046 INFO [train.py:1087] (2/4) Epoch 51, batch 800, loss[loss=0.1469, simple_loss=0.2423, pruned_loss=0.02571, over 24861.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2478, pruned_loss=0.03133, over 4732325.74 frames. ], batch size: 68, lr: 4.79e-03, grad_scale: 32.0 2023-12-04 13:13:46,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=303666.6666666667, ans=0.125 2023-12-04 13:14:04,378 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=303800.0, ans=0.0 2023-12-04 13:14:09,426 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.291e+02 1.399e+02 1.523e+02 2.071e+02, threshold=2.798e+02, percent-clipped=0.0 2023-12-04 13:14:46,119 INFO [train.py:1087] (2/4) Epoch 51, batch 850, loss[loss=0.1479, simple_loss=0.2446, pruned_loss=0.0256, over 24585.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2478, pruned_loss=0.03136, over 4747312.26 frames. ], batch size: 64, lr: 4.79e-03, grad_scale: 32.0 2023-12-04 13:15:20,572 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=12.0 2023-12-04 13:15:21,399 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=304133.3333333333, ans=0.0 2023-12-04 13:15:36,358 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304200.0, ans=0.1 2023-12-04 13:16:09,166 INFO [train.py:1087] (2/4) Epoch 52, batch 0, loss[loss=0.1537, simple_loss=0.2518, pruned_loss=0.02781, over 24868.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2518, pruned_loss=0.02781, over 24868.00 frames. ], batch size: 68, lr: 4.74e-03, grad_scale: 32.0 2023-12-04 13:16:09,170 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 13:16:25,868 INFO [train.py:1119] (2/4) Epoch 52, validation: loss=0.1515, simple_loss=0.2494, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 13:16:25,869 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 13:16:32,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=304300.0, ans=0.125 2023-12-04 13:16:32,821 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-12-04 13:16:39,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=304366.6666666667, ans=0.0 2023-12-04 13:16:53,097 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-12-04 13:17:01,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=304433.3333333333, ans=0.0 2023-12-04 13:17:11,987 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.264e+02 1.362e+02 1.496e+02 2.534e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 13:17:12,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=304500.0, ans=0.025 2023-12-04 13:17:28,200 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.16 vs. limit=15.0 2023-12-04 13:17:43,846 INFO [train.py:1087] (2/4) Epoch 52, batch 50, loss[loss=0.151, simple_loss=0.2463, pruned_loss=0.02789, over 24826.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.25, pruned_loss=0.0323, over 1090780.18 frames. ], batch size: 73, lr: 4.74e-03, grad_scale: 16.0 2023-12-04 13:17:49,213 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.83 vs. limit=12.0 2023-12-04 13:18:22,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=304766.6666666667, ans=0.125 2023-12-04 13:18:25,784 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-12-04 13:18:52,778 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.17 vs. limit=15.0 2023-12-04 13:19:02,924 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-12-04 13:19:03,458 INFO [train.py:1087] (2/4) Epoch 52, batch 100, loss[loss=0.1482, simple_loss=0.2433, pruned_loss=0.02652, over 24542.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2491, pruned_loss=0.03181, over 1919775.28 frames. ], batch size: 62, lr: 4.74e-03, grad_scale: 16.0 2023-12-04 13:19:35,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=305100.0, ans=0.0 2023-12-04 13:19:41,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=305100.0, ans=0.1 2023-12-04 13:19:43,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=305100.0, ans=0.1 2023-12-04 13:19:50,781 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.256e+02 1.354e+02 1.466e+02 2.724e+02, threshold=2.709e+02, percent-clipped=1.0 2023-12-04 13:20:02,138 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=305166.6666666667, ans=0.125 2023-12-04 13:20:11,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=305233.3333333333, ans=0.0 2023-12-04 13:20:22,234 INFO [train.py:1087] (2/4) Epoch 52, batch 150, loss[loss=0.1601, simple_loss=0.2501, pruned_loss=0.03506, over 24258.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2486, pruned_loss=0.03151, over 2562019.22 frames. ], batch size: 79, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:20:28,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=305300.0, ans=0.125 2023-12-04 13:20:30,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305300.0, ans=0.1 2023-12-04 13:20:33,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305300.0, ans=0.1 2023-12-04 13:21:10,670 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.31 vs. limit=10.0 2023-12-04 13:21:41,577 INFO [train.py:1087] (2/4) Epoch 52, batch 200, loss[loss=0.1631, simple_loss=0.2561, pruned_loss=0.03505, over 23521.00 frames. ], tot_loss[loss=0.1557, simple_loss=0.2483, pruned_loss=0.03157, over 3067759.32 frames. ], batch size: 94, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:21:46,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=305633.3333333333, ans=0.07 2023-12-04 13:22:18,962 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-12-04 13:22:28,491 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.310e+02 1.423e+02 1.525e+02 1.849e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 13:22:29,007 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=305833.3333333333, ans=0.125 2023-12-04 13:22:58,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=305966.6666666667, ans=0.125 2023-12-04 13:23:00,001 INFO [train.py:1087] (2/4) Epoch 52, batch 250, loss[loss=0.1503, simple_loss=0.2393, pruned_loss=0.03062, over 24769.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2484, pruned_loss=0.03175, over 3440101.53 frames. ], batch size: 65, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:23:03,410 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=305966.6666666667, ans=0.0 2023-12-04 13:23:06,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=305966.6666666667, ans=0.0 2023-12-04 13:23:12,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=305966.6666666667, ans=0.02 2023-12-04 13:23:21,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=306033.3333333333, ans=0.2 2023-12-04 13:23:51,057 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306166.6666666667, ans=0.0 2023-12-04 13:24:01,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=306233.3333333333, ans=0.125 2023-12-04 13:24:18,673 INFO [train.py:1087] (2/4) Epoch 52, batch 300, loss[loss=0.1417, simple_loss=0.2371, pruned_loss=0.02315, over 24715.00 frames. ], tot_loss[loss=0.1557, simple_loss=0.2483, pruned_loss=0.03154, over 3754395.90 frames. ], batch size: 67, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:24:32,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=306366.6666666667, ans=0.1 2023-12-04 13:24:43,003 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-12-04 13:24:45,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=306366.6666666667, ans=0.0 2023-12-04 13:24:53,782 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=306433.3333333333, ans=0.0 2023-12-04 13:25:01,319 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.49 vs. limit=10.0 2023-12-04 13:25:05,177 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.271e+02 1.354e+02 1.459e+02 1.877e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 13:25:07,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306500.0, ans=0.1 2023-12-04 13:25:07,145 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306500.0, ans=0.0 2023-12-04 13:25:26,701 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.37 vs. limit=22.5 2023-12-04 13:25:36,098 INFO [train.py:1087] (2/4) Epoch 52, batch 350, loss[loss=0.158, simple_loss=0.2531, pruned_loss=0.03141, over 21356.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2481, pruned_loss=0.03149, over 3982729.22 frames. ], batch size: 128, lr: 4.72e-03, grad_scale: 16.0 2023-12-04 13:26:06,300 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=306766.6666666667, ans=0.2 2023-12-04 13:26:11,336 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.98 vs. limit=15.0 2023-12-04 13:26:23,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=306833.3333333333, ans=0.025 2023-12-04 13:26:49,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=306900.0, ans=0.09899494936611666 2023-12-04 13:26:55,395 INFO [train.py:1087] (2/4) Epoch 52, batch 400, loss[loss=0.1516, simple_loss=0.2463, pruned_loss=0.02847, over 24771.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2478, pruned_loss=0.03128, over 4183236.67 frames. ], batch size: 70, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:27:18,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=307033.3333333333, ans=0.125 2023-12-04 13:27:24,157 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:27:24,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=307033.3333333333, ans=0.125 2023-12-04 13:27:31,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307100.0, ans=0.1 2023-12-04 13:27:34,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=307100.0, ans=0.125 2023-12-04 13:27:43,121 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.294e+02 1.381e+02 1.531e+02 2.056e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-04 13:28:02,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=307233.3333333333, ans=0.125 2023-12-04 13:28:07,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=307233.3333333333, ans=0.125 2023-12-04 13:28:07,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=307233.3333333333, ans=0.125 2023-12-04 13:28:09,427 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=307233.3333333333, ans=0.125 2023-12-04 13:28:14,826 INFO [train.py:1087] (2/4) Epoch 52, batch 450, loss[loss=0.1669, simple_loss=0.2573, pruned_loss=0.03825, over 24216.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2478, pruned_loss=0.03132, over 4325980.52 frames. ], batch size: 82, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:28:31,270 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307366.6666666667, ans=0.0 2023-12-04 13:28:48,390 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=307433.3333333333, ans=0.125 2023-12-04 13:28:53,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=307433.3333333333, ans=0.0 2023-12-04 13:29:31,017 INFO [train.py:1087] (2/4) Epoch 52, batch 500, loss[loss=0.1615, simple_loss=0.2495, pruned_loss=0.03672, over 24494.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2477, pruned_loss=0.03119, over 4423383.16 frames. ], batch size: 77, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:29:31,274 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=307633.3333333333, ans=0.125 2023-12-04 13:29:36,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=307633.3333333333, ans=0.125 2023-12-04 13:29:50,239 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=307700.0, ans=0.125 2023-12-04 13:29:51,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307700.0, ans=0.1 2023-12-04 13:30:12,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=307766.6666666667, ans=0.125 2023-12-04 13:30:17,942 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.320e+02 1.466e+02 1.634e+02 2.572e+02, threshold=2.932e+02, percent-clipped=0.0 2023-12-04 13:30:19,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=307833.3333333333, ans=0.125 2023-12-04 13:30:26,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=307833.3333333333, ans=0.0 2023-12-04 13:30:28,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307833.3333333333, ans=0.1 2023-12-04 13:30:40,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=307900.0, ans=0.125 2023-12-04 13:30:50,153 INFO [train.py:1087] (2/4) Epoch 52, batch 550, loss[loss=0.1414, simple_loss=0.2352, pruned_loss=0.0238, over 24824.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2477, pruned_loss=0.03128, over 4484296.92 frames. ], batch size: 73, lr: 4.71e-03, grad_scale: 32.0 2023-12-04 13:31:28,836 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=308100.0, ans=0.125 2023-12-04 13:31:28,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=308100.0, ans=0.09899494936611666 2023-12-04 13:31:36,739 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=22.5 2023-12-04 13:31:43,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=308166.6666666667, ans=0.2 2023-12-04 13:31:46,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=308166.6666666667, ans=0.0 2023-12-04 13:31:52,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308233.3333333333, ans=0.1 2023-12-04 13:32:07,354 INFO [train.py:1087] (2/4) Epoch 52, batch 600, loss[loss=0.1558, simple_loss=0.2539, pruned_loss=0.0288, over 24157.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2475, pruned_loss=0.03125, over 4552891.01 frames. ], batch size: 58, lr: 4.71e-03, grad_scale: 32.0 2023-12-04 13:32:14,662 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-12-04 13:32:23,487 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308366.6666666667, ans=0.1 2023-12-04 13:32:27,920 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=308366.6666666667, ans=0.125 2023-12-04 13:32:43,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=308433.3333333333, ans=0.0 2023-12-04 13:32:54,903 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.254e+02 1.324e+02 1.412e+02 1.879e+02, threshold=2.648e+02, percent-clipped=0.0 2023-12-04 13:33:03,202 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=308500.0, ans=0.125 2023-12-04 13:33:23,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=308633.3333333333, ans=0.0 2023-12-04 13:33:25,202 INFO [train.py:1087] (2/4) Epoch 52, batch 650, loss[loss=0.1461, simple_loss=0.2326, pruned_loss=0.02979, over 24713.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2476, pruned_loss=0.03124, over 4616988.89 frames. ], batch size: 69, lr: 4.71e-03, grad_scale: 16.0 2023-12-04 13:33:27,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=308633.3333333333, ans=0.125 2023-12-04 13:34:26,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=308900.0, ans=0.1 2023-12-04 13:34:43,607 INFO [train.py:1087] (2/4) Epoch 52, batch 700, loss[loss=0.1488, simple_loss=0.2425, pruned_loss=0.02757, over 24714.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2474, pruned_loss=0.03116, over 4651730.33 frames. ], batch size: 67, lr: 4.71e-03, grad_scale: 16.0 2023-12-04 13:34:47,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=308966.6666666667, ans=0.125 2023-12-04 13:35:07,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=309033.3333333333, ans=0.1 2023-12-04 13:35:25,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=309100.0, ans=0.125 2023-12-04 13:35:32,663 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.284e+02 1.379e+02 1.528e+02 1.943e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-04 13:35:41,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=309166.6666666667, ans=0.125 2023-12-04 13:36:03,289 INFO [train.py:1087] (2/4) Epoch 52, batch 750, loss[loss=0.1455, simple_loss=0.2411, pruned_loss=0.02501, over 24716.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2468, pruned_loss=0.0309, over 4686806.05 frames. ], batch size: 74, lr: 4.70e-03, grad_scale: 16.0 2023-12-04 13:36:18,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=309366.6666666667, ans=10.0 2023-12-04 13:36:50,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=309500.0, ans=0.0 2023-12-04 13:36:55,227 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=309500.0, ans=0.125 2023-12-04 13:37:21,786 INFO [train.py:1087] (2/4) Epoch 52, batch 800, loss[loss=0.1461, simple_loss=0.2373, pruned_loss=0.02747, over 24771.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2467, pruned_loss=0.03091, over 4725783.82 frames. ], batch size: 64, lr: 4.70e-03, grad_scale: 32.0 2023-12-04 13:37:32,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=309633.3333333333, ans=0.5 2023-12-04 13:37:38,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309700.0, ans=0.1 2023-12-04 13:37:43,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=309700.0, ans=0.125 2023-12-04 13:37:58,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=309766.6666666667, ans=0.0 2023-12-04 13:38:06,958 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.283e+02 1.363e+02 1.464e+02 2.233e+02, threshold=2.727e+02, percent-clipped=0.0 2023-12-04 13:38:14,506 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.42 vs. limit=22.5 2023-12-04 13:38:33,071 INFO [train.py:1087] (2/4) Epoch 52, batch 850, loss[loss=0.1465, simple_loss=0.2406, pruned_loss=0.02614, over 24772.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2469, pruned_loss=0.03102, over 4739087.32 frames. ], batch size: 64, lr: 4.70e-03, grad_scale: 32.0 2023-12-04 13:38:41,336 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=309966.6666666667, ans=0.0 2023-12-04 13:38:56,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=310033.3333333333, ans=0.125 2023-12-04 13:39:14,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=310166.6666666667, ans=0.05 2023-12-04 13:39:18,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=310166.6666666667, ans=0.0 2023-12-04 13:39:28,051 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:39:56,186 INFO [train.py:1087] (2/4) Epoch 53, batch 0, loss[loss=0.146, simple_loss=0.2416, pruned_loss=0.02518, over 24749.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2416, pruned_loss=0.02518, over 24749.00 frames. ], batch size: 61, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:39:56,187 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 13:40:15,190 INFO [train.py:1119] (2/4) Epoch 53, validation: loss=0.1513, simple_loss=0.249, pruned_loss=0.02677, over 944034.00 frames. 2023-12-04 13:40:15,191 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 13:40:38,210 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-12-04 13:40:39,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=310333.3333333333, ans=0.125 2023-12-04 13:40:53,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=310400.0, ans=0.125 2023-12-04 13:40:58,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=310400.0, ans=0.125 2023-12-04 13:41:12,823 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.286e+02 1.385e+02 1.598e+02 2.350e+02, threshold=2.770e+02, percent-clipped=0.0 2023-12-04 13:41:13,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=310466.6666666667, ans=0.1 2023-12-04 13:41:32,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=310600.0, ans=0.0 2023-12-04 13:41:33,349 INFO [train.py:1087] (2/4) Epoch 53, batch 50, loss[loss=0.1565, simple_loss=0.2534, pruned_loss=0.02981, over 24773.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2499, pruned_loss=0.03242, over 1087922.33 frames. ], batch size: 73, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:41:45,410 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=310600.0, ans=0.2 2023-12-04 13:42:21,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=310800.0, ans=0.0 2023-12-04 13:42:52,849 INFO [train.py:1087] (2/4) Epoch 53, batch 100, loss[loss=0.1467, simple_loss=0.24, pruned_loss=0.02672, over 24677.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2474, pruned_loss=0.03089, over 1919690.88 frames. ], batch size: 74, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:43:33,738 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=311066.6666666667, ans=0.125 2023-12-04 13:43:46,330 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=311133.3333333333, ans=0.0 2023-12-04 13:43:53,146 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.276e+02 1.377e+02 1.537e+02 2.483e+02, threshold=2.754e+02, percent-clipped=0.0 2023-12-04 13:43:54,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=311200.0, ans=0.0 2023-12-04 13:44:07,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=311200.0, ans=0.125 2023-12-04 13:44:10,684 INFO [train.py:1087] (2/4) Epoch 53, batch 150, loss[loss=0.1727, simple_loss=0.2611, pruned_loss=0.04212, over 24291.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2473, pruned_loss=0.03067, over 2568678.92 frames. ], batch size: 79, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:44:15,572 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=311266.6666666667, ans=0.125 2023-12-04 13:44:25,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=311333.3333333333, ans=0.1 2023-12-04 13:44:54,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=311400.0, ans=12.0 2023-12-04 13:45:00,004 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:45:02,960 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=311466.6666666667, ans=0.0 2023-12-04 13:45:29,588 INFO [train.py:1087] (2/4) Epoch 53, batch 200, loss[loss=0.1564, simple_loss=0.2464, pruned_loss=0.03317, over 24555.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2475, pruned_loss=0.03115, over 3068518.25 frames. ], batch size: 63, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:45:30,267 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-12-04 13:45:53,762 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=12.0 2023-12-04 13:46:02,399 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=311733.3333333333, ans=0.0 2023-12-04 13:46:04,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=311733.3333333333, ans=0.125 2023-12-04 13:46:21,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311800.0, ans=0.1 2023-12-04 13:46:33,502 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.293e+02 1.379e+02 1.475e+02 1.829e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-04 13:46:41,110 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.16 vs. limit=15.0 2023-12-04 13:46:48,250 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=311866.6666666667, ans=0.2 2023-12-04 13:46:50,938 INFO [train.py:1087] (2/4) Epoch 53, batch 250, loss[loss=0.1476, simple_loss=0.2374, pruned_loss=0.02887, over 24757.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2475, pruned_loss=0.03111, over 3438274.96 frames. ], batch size: 66, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:48:03,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=312200.0, ans=0.125 2023-12-04 13:48:04,253 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-12-04 13:48:13,337 INFO [train.py:1087] (2/4) Epoch 53, batch 300, loss[loss=0.1535, simple_loss=0.2512, pruned_loss=0.02791, over 21063.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2474, pruned_loss=0.03099, over 3752170.34 frames. ], batch size: 127, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:48:17,149 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=312266.6666666667, ans=0.125 2023-12-04 13:48:26,197 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-12-04 13:48:38,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=312333.3333333333, ans=0.125 2023-12-04 13:48:39,390 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=312333.3333333333, ans=0.1 2023-12-04 13:49:17,864 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.305e+02 1.404e+02 1.545e+02 2.475e+02, threshold=2.807e+02, percent-clipped=0.0 2023-12-04 13:49:30,661 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=312533.3333333333, ans=0.125 2023-12-04 13:49:35,550 INFO [train.py:1087] (2/4) Epoch 53, batch 350, loss[loss=0.1619, simple_loss=0.2521, pruned_loss=0.03588, over 24294.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2478, pruned_loss=0.0314, over 3976787.99 frames. ], batch size: 79, lr: 4.63e-03, grad_scale: 8.0 2023-12-04 13:49:44,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=312600.0, ans=0.125 2023-12-04 13:49:50,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=312600.0, ans=0.1 2023-12-04 13:49:53,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=312666.6666666667, ans=0.125 2023-12-04 13:49:54,111 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=312666.6666666667, ans=0.125 2023-12-04 13:50:20,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=312733.3333333333, ans=0.1 2023-12-04 13:50:55,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=312866.6666666667, ans=0.125 2023-12-04 13:50:58,407 INFO [train.py:1087] (2/4) Epoch 53, batch 400, loss[loss=0.1907, simple_loss=0.2709, pruned_loss=0.05524, over 16918.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2478, pruned_loss=0.03122, over 4151067.71 frames. ], batch size: 176, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:51:13,211 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313000.0, ans=0.1 2023-12-04 13:51:28,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=313000.0, ans=0.2 2023-12-04 13:51:33,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=313066.6666666667, ans=0.125 2023-12-04 13:51:51,411 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313133.3333333333, ans=0.1 2023-12-04 13:52:00,442 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.298e+02 1.363e+02 1.477e+02 1.902e+02, threshold=2.727e+02, percent-clipped=0.0 2023-12-04 13:52:12,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=313200.0, ans=0.125 2023-12-04 13:52:19,385 INFO [train.py:1087] (2/4) Epoch 53, batch 450, loss[loss=0.1485, simple_loss=0.2426, pruned_loss=0.02716, over 24570.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.248, pruned_loss=0.03143, over 4293838.79 frames. ], batch size: 65, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:52:24,873 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-12-04 13:52:32,586 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.31 vs. limit=22.5 2023-12-04 13:53:39,426 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=313600.0, ans=0.0 2023-12-04 13:53:40,433 INFO [train.py:1087] (2/4) Epoch 53, batch 500, loss[loss=0.1647, simple_loss=0.2498, pruned_loss=0.03978, over 24458.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2473, pruned_loss=0.03119, over 4418402.41 frames. ], batch size: 77, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:53:49,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=313600.0, ans=0.125 2023-12-04 13:53:53,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=313600.0, ans=0.125 2023-12-04 13:54:02,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=313666.6666666667, ans=0.1 2023-12-04 13:54:15,852 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=313733.3333333333, ans=0.04949747468305833 2023-12-04 13:54:41,102 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313800.0, ans=0.1 2023-12-04 13:54:41,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=313800.0, ans=0.2 2023-12-04 13:54:43,777 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.289e+02 1.379e+02 1.489e+02 2.033e+02, threshold=2.759e+02, percent-clipped=0.0 2023-12-04 13:54:48,633 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=313866.6666666667, ans=0.0 2023-12-04 13:54:49,980 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=313866.6666666667, ans=0.125 2023-12-04 13:55:00,537 INFO [train.py:1087] (2/4) Epoch 53, batch 550, loss[loss=0.1543, simple_loss=0.2474, pruned_loss=0.03057, over 24010.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2475, pruned_loss=0.03116, over 4493174.60 frames. ], batch size: 87, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:55:17,821 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=314000.0, ans=0.125 2023-12-04 13:55:20,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314000.0, ans=0.125 2023-12-04 13:55:25,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314000.0, ans=0.125 2023-12-04 13:55:49,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314133.3333333333, ans=0.125 2023-12-04 13:56:07,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=314200.0, ans=0.0 2023-12-04 13:56:09,318 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-12-04 13:56:20,862 INFO [train.py:1087] (2/4) Epoch 53, batch 600, loss[loss=0.1585, simple_loss=0.2525, pruned_loss=0.03227, over 23563.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2473, pruned_loss=0.03133, over 4543038.91 frames. ], batch size: 94, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:56:28,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=314266.6666666667, ans=0.0 2023-12-04 13:56:51,069 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=314400.0, ans=0.2 2023-12-04 13:56:57,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.54 vs. limit=15.0 2023-12-04 13:57:21,018 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.322e+02 1.398e+02 1.473e+02 2.127e+02, threshold=2.796e+02, percent-clipped=0.0 2023-12-04 13:57:27,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=314533.3333333333, ans=0.07 2023-12-04 13:57:35,610 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314533.3333333333, ans=0.125 2023-12-04 13:57:38,342 INFO [train.py:1087] (2/4) Epoch 53, batch 650, loss[loss=0.15, simple_loss=0.2463, pruned_loss=0.0269, over 24554.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2471, pruned_loss=0.0311, over 4605738.96 frames. ], batch size: 66, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:57:38,752 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=314600.0, ans=0.1 2023-12-04 13:57:46,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=314600.0, ans=0.0 2023-12-04 13:57:51,074 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=314600.0, ans=0.0 2023-12-04 13:58:45,536 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=314866.6666666667, ans=0.125 2023-12-04 13:58:50,107 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=314866.6666666667, ans=0.0 2023-12-04 13:58:54,169 INFO [train.py:1087] (2/4) Epoch 53, batch 700, loss[loss=0.1683, simple_loss=0.2576, pruned_loss=0.03956, over 24024.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2468, pruned_loss=0.03084, over 4661764.50 frames. ], batch size: 87, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:59:35,178 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-12-04 13:59:53,625 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.311e+02 1.391e+02 1.533e+02 2.138e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-04 14:00:02,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=315200.0, ans=0.0 2023-12-04 14:00:09,970 INFO [train.py:1087] (2/4) Epoch 53, batch 750, loss[loss=0.165, simple_loss=0.2576, pruned_loss=0.03618, over 24758.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2466, pruned_loss=0.03091, over 4692759.33 frames. ], batch size: 70, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 14:00:16,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=315266.6666666667, ans=0.125 2023-12-04 14:00:33,334 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=315333.3333333333, ans=0.125 2023-12-04 14:00:37,698 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=315333.3333333333, ans=0.05 2023-12-04 14:01:01,332 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=315466.6666666667, ans=0.0 2023-12-04 14:01:01,887 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.36 vs. limit=22.5 2023-12-04 14:01:25,400 INFO [train.py:1087] (2/4) Epoch 53, batch 800, loss[loss=0.1484, simple_loss=0.2413, pruned_loss=0.02776, over 24615.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2466, pruned_loss=0.03078, over 4730016.36 frames. ], batch size: 68, lr: 4.61e-03, grad_scale: 32.0 2023-12-04 14:01:30,601 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.09 vs. limit=22.5 2023-12-04 14:02:01,377 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-12-04 14:02:18,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=315800.0, ans=0.0 2023-12-04 14:02:19,640 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.282e+02 1.351e+02 1.470e+02 2.116e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 14:02:22,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=315866.6666666667, ans=0.1 2023-12-04 14:02:29,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=315866.6666666667, ans=0.125 2023-12-04 14:02:34,112 INFO [train.py:1087] (2/4) Epoch 53, batch 850, loss[loss=0.15, simple_loss=0.2473, pruned_loss=0.02633, over 21654.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2464, pruned_loss=0.03066, over 4744899.41 frames. ], batch size: 128, lr: 4.61e-03, grad_scale: 16.0 2023-12-04 14:02:37,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=315933.3333333333, ans=0.0 2023-12-04 14:02:56,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316000.0, ans=0.125 2023-12-04 14:02:57,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=316000.0, ans=0.0 2023-12-04 14:02:57,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=316000.0, ans=0.125 2023-12-04 14:03:31,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=316200.0, ans=0.125 2023-12-04 14:03:50,894 INFO [train.py:1087] (2/4) Epoch 54, batch 0, loss[loss=0.1463, simple_loss=0.2386, pruned_loss=0.02704, over 24746.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2386, pruned_loss=0.02704, over 24746.00 frames. ], batch size: 61, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:03:50,895 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 14:04:07,109 INFO [train.py:1119] (2/4) Epoch 54, validation: loss=0.1516, simple_loss=0.249, pruned_loss=0.02707, over 944034.00 frames. 2023-12-04 14:04:07,110 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 14:04:34,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=316300.0, ans=0.125 2023-12-04 14:04:37,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=316366.6666666667, ans=0.125 2023-12-04 14:05:05,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=316433.3333333333, ans=0.0 2023-12-04 14:05:15,774 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.288e+02 1.377e+02 1.512e+02 2.121e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 14:05:19,979 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.25 vs. limit=15.0 2023-12-04 14:05:23,543 INFO [train.py:1087] (2/4) Epoch 54, batch 50, loss[loss=0.1477, simple_loss=0.2398, pruned_loss=0.02778, over 24769.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.247, pruned_loss=0.02987, over 1095381.45 frames. ], batch size: 64, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:05:51,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=316633.3333333333, ans=0.125 2023-12-04 14:06:29,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=316833.3333333333, ans=0.125 2023-12-04 14:06:33,838 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=316833.3333333333, ans=0.0 2023-12-04 14:06:35,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.38 vs. limit=10.0 2023-12-04 14:06:40,754 INFO [train.py:1087] (2/4) Epoch 54, batch 100, loss[loss=0.1621, simple_loss=0.2538, pruned_loss=0.03516, over 24558.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2473, pruned_loss=0.03037, over 1925491.34 frames. ], batch size: 63, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:07:08,524 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=316966.6666666667, ans=0.125 2023-12-04 14:07:41,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=317166.6666666667, ans=0.0 2023-12-04 14:07:47,096 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=317166.6666666667, ans=0.2 2023-12-04 14:07:51,168 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.075e+02 1.240e+02 1.325e+02 1.426e+02 1.923e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-04 14:07:58,968 INFO [train.py:1087] (2/4) Epoch 54, batch 150, loss[loss=0.1497, simple_loss=0.2377, pruned_loss=0.03088, over 24787.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.247, pruned_loss=0.03043, over 2564677.29 frames. ], batch size: 71, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:08:03,296 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.96 vs. limit=22.5 2023-12-04 14:08:04,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=317233.3333333333, ans=0.0 2023-12-04 14:08:41,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=317366.6666666667, ans=0.125 2023-12-04 14:09:02,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=317500.0, ans=0.125 2023-12-04 14:09:17,737 INFO [train.py:1087] (2/4) Epoch 54, batch 200, loss[loss=0.1675, simple_loss=0.2581, pruned_loss=0.03846, over 24486.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2473, pruned_loss=0.03079, over 3053111.95 frames. ], batch size: 77, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:09:25,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=317566.6666666667, ans=0.125 2023-12-04 14:09:45,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=317633.3333333333, ans=0.0 2023-12-04 14:09:47,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=317633.3333333333, ans=0.0 2023-12-04 14:10:01,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=317700.0, ans=0.125 2023-12-04 14:10:15,071 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-12-04 14:10:16,127 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=317766.6666666667, ans=0.125 2023-12-04 14:10:28,796 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.325e+02 1.399e+02 1.551e+02 2.197e+02, threshold=2.799e+02, percent-clipped=0.0 2023-12-04 14:10:36,674 INFO [train.py:1087] (2/4) Epoch 54, batch 250, loss[loss=0.1492, simple_loss=0.2417, pruned_loss=0.02838, over 24574.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.247, pruned_loss=0.03104, over 3442949.77 frames. ], batch size: 65, lr: 4.55e-03, grad_scale: 32.0 2023-12-04 14:11:02,833 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=317966.6666666667, ans=0.2 2023-12-04 14:11:30,052 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=318100.0, ans=0.09899494936611666 2023-12-04 14:11:54,196 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=318166.6666666667, ans=0.125 2023-12-04 14:11:58,182 INFO [train.py:1087] (2/4) Epoch 54, batch 300, loss[loss=0.1533, simple_loss=0.2455, pruned_loss=0.03059, over 24784.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2472, pruned_loss=0.03111, over 3747428.97 frames. ], batch size: 72, lr: 4.55e-03, grad_scale: 16.0 2023-12-04 14:12:34,922 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=318366.6666666667, ans=0.2 2023-12-04 14:12:55,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=318433.3333333333, ans=0.0 2023-12-04 14:13:15,574 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.067e+02 1.258e+02 1.370e+02 1.479e+02 2.899e+02, threshold=2.739e+02, percent-clipped=1.0 2023-12-04 14:13:19,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=318500.0, ans=0.05 2023-12-04 14:13:22,259 INFO [train.py:1087] (2/4) Epoch 54, batch 350, loss[loss=0.1558, simple_loss=0.2449, pruned_loss=0.03339, over 24707.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2476, pruned_loss=0.03136, over 3981547.80 frames. ], batch size: 74, lr: 4.55e-03, grad_scale: 16.0 2023-12-04 14:13:55,769 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:14:02,178 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=318700.0, ans=0.125 2023-12-04 14:14:11,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=318766.6666666667, ans=0.025 2023-12-04 14:14:11,991 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.50 vs. limit=15.0 2023-12-04 14:14:21,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=318766.6666666667, ans=0.125 2023-12-04 14:14:43,945 INFO [train.py:1087] (2/4) Epoch 54, batch 400, loss[loss=0.1467, simple_loss=0.2377, pruned_loss=0.02784, over 24755.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2468, pruned_loss=0.03089, over 4185086.62 frames. ], batch size: 63, lr: 4.55e-03, grad_scale: 32.0 2023-12-04 14:14:59,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=12.0 2023-12-04 14:15:04,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=318966.6666666667, ans=0.0 2023-12-04 14:15:32,479 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319100.0, ans=0.0 2023-12-04 14:15:59,051 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.256e+02 1.328e+02 1.463e+02 1.976e+02, threshold=2.656e+02, percent-clipped=0.0 2023-12-04 14:16:01,567 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-12-04 14:16:05,811 INFO [train.py:1087] (2/4) Epoch 54, batch 450, loss[loss=0.1458, simple_loss=0.2434, pruned_loss=0.02407, over 24860.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2469, pruned_loss=0.03087, over 4317248.71 frames. ], batch size: 68, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:16:11,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=319233.3333333333, ans=0.0 2023-12-04 14:16:16,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=319233.3333333333, ans=0.0 2023-12-04 14:16:48,457 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.46 vs. limit=15.0 2023-12-04 14:17:12,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=319500.0, ans=0.125 2023-12-04 14:17:14,050 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=319500.0, ans=0.05 2023-12-04 14:17:21,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=319500.0, ans=0.2 2023-12-04 14:17:27,371 INFO [train.py:1087] (2/4) Epoch 54, batch 500, loss[loss=0.1516, simple_loss=0.242, pruned_loss=0.03062, over 24754.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2472, pruned_loss=0.03101, over 4421244.05 frames. ], batch size: 70, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:17:36,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=319566.6666666667, ans=0.125 2023-12-04 14:17:48,257 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=319633.3333333333, ans=0.125 2023-12-04 14:18:33,189 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=22.5 2023-12-04 14:18:34,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=319833.3333333333, ans=0.09899494936611666 2023-12-04 14:18:39,571 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.327e+02 1.410e+02 1.630e+02 2.218e+02, threshold=2.821e+02, percent-clipped=0.0 2023-12-04 14:18:45,419 INFO [train.py:1087] (2/4) Epoch 54, batch 550, loss[loss=0.1538, simple_loss=0.2466, pruned_loss=0.03048, over 24730.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2468, pruned_loss=0.0308, over 4515482.30 frames. ], batch size: 67, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:18:53,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=319900.0, ans=0.125 2023-12-04 14:18:59,005 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=319966.6666666667, ans=0.2 2023-12-04 14:19:03,939 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319966.6666666667, ans=0.1 2023-12-04 14:19:13,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=319966.6666666667, ans=0.2 2023-12-04 14:19:18,218 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:19:32,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=320033.3333333333, ans=0.09899494936611666 2023-12-04 14:19:33,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320033.3333333333, ans=0.1 2023-12-04 14:19:36,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=320100.0, ans=0.2 2023-12-04 14:19:41,483 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=320100.0, ans=0.0 2023-12-04 14:19:44,950 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2023-12-04 14:19:56,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=320166.6666666667, ans=0.0 2023-12-04 14:20:07,480 INFO [train.py:1087] (2/4) Epoch 54, batch 600, loss[loss=0.1951, simple_loss=0.2769, pruned_loss=0.0566, over 16576.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2471, pruned_loss=0.03105, over 4578789.41 frames. ], batch size: 177, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:20:15,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=320233.3333333333, ans=0.0 2023-12-04 14:20:18,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=320233.3333333333, ans=8.0 2023-12-04 14:20:43,305 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-12-04 14:21:17,899 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.296e+02 1.398e+02 1.549e+02 2.118e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-04 14:21:23,700 INFO [train.py:1087] (2/4) Epoch 54, batch 650, loss[loss=0.1588, simple_loss=0.2523, pruned_loss=0.03261, over 23889.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2469, pruned_loss=0.03089, over 4646901.30 frames. ], batch size: 87, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:21:26,151 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.94 vs. limit=12.0 2023-12-04 14:21:41,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=320633.3333333333, ans=0.0 2023-12-04 14:21:57,880 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=320700.0, ans=0.125 2023-12-04 14:22:02,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=320700.0, ans=0.0 2023-12-04 14:22:11,532 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=320766.6666666667, ans=0.125 2023-12-04 14:22:38,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=320900.0, ans=0.0 2023-12-04 14:22:39,831 INFO [train.py:1087] (2/4) Epoch 54, batch 700, loss[loss=0.1468, simple_loss=0.2409, pruned_loss=0.02629, over 24762.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2471, pruned_loss=0.03111, over 4668253.51 frames. ], batch size: 70, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:23:33,470 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=321100.0, ans=0.95 2023-12-04 14:23:39,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=321166.6666666667, ans=0.0 2023-12-04 14:23:44,334 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-12-04 14:23:48,423 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-12-04 14:23:49,056 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.277e+02 1.367e+02 1.520e+02 2.046e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 14:23:54,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=321233.3333333333, ans=0.125 2023-12-04 14:23:55,354 INFO [train.py:1087] (2/4) Epoch 54, batch 750, loss[loss=0.1574, simple_loss=0.2533, pruned_loss=0.03078, over 24738.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.247, pruned_loss=0.0311, over 4688807.49 frames. ], batch size: 63, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:24:19,526 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.96 vs. limit=10.0 2023-12-04 14:24:20,713 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321300.0, ans=0.1 2023-12-04 14:24:50,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=321433.3333333333, ans=0.125 2023-12-04 14:24:56,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=321500.0, ans=0.2 2023-12-04 14:25:09,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321500.0, ans=0.1 2023-12-04 14:25:12,330 INFO [train.py:1087] (2/4) Epoch 54, batch 800, loss[loss=0.1517, simple_loss=0.2465, pruned_loss=0.0285, over 24559.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2466, pruned_loss=0.03083, over 4726231.25 frames. ], batch size: 66, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:25:51,757 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.13 vs. limit=15.0 2023-12-04 14:26:02,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=321766.6666666667, ans=0.0 2023-12-04 14:26:13,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=321833.3333333333, ans=0.0 2023-12-04 14:26:16,610 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.254e+02 1.363e+02 1.450e+02 1.749e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 14:26:21,932 INFO [train.py:1087] (2/4) Epoch 54, batch 850, loss[loss=0.1518, simple_loss=0.2433, pruned_loss=0.03012, over 24195.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2469, pruned_loss=0.03101, over 4735587.56 frames. ], batch size: 82, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:26:28,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=321900.0, ans=0.125 2023-12-04 14:26:47,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=322033.3333333333, ans=0.07 2023-12-04 14:27:17,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=322166.6666666667, ans=0.125 2023-12-04 14:27:45,149 INFO [train.py:1087] (2/4) Epoch 55, batch 0, loss[loss=0.1414, simple_loss=0.2375, pruned_loss=0.02261, over 24709.00 frames. ], tot_loss[loss=0.1414, simple_loss=0.2375, pruned_loss=0.02261, over 24709.00 frames. ], batch size: 69, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:27:45,150 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 14:28:01,824 INFO [train.py:1119] (2/4) Epoch 55, validation: loss=0.1514, simple_loss=0.2492, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 14:28:01,825 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 14:28:09,567 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:28:38,315 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.79 vs. limit=22.5 2023-12-04 14:28:39,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322333.3333333333, ans=0.1 2023-12-04 14:28:52,837 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.39 vs. limit=12.0 2023-12-04 14:29:03,847 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=322466.6666666667, ans=0.2 2023-12-04 14:29:07,693 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-12-04 14:29:15,494 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-12-04 14:29:16,111 INFO [train.py:1087] (2/4) Epoch 55, batch 50, loss[loss=0.1597, simple_loss=0.2567, pruned_loss=0.03138, over 24841.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2478, pruned_loss=0.0315, over 1089394.97 frames. ], batch size: 68, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:29:19,671 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.275e+02 1.387e+02 1.532e+02 2.766e+02, threshold=2.774e+02, percent-clipped=1.0 2023-12-04 14:29:24,445 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=322533.3333333333, ans=0.0 2023-12-04 14:29:35,124 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=22.5 2023-12-04 14:29:37,736 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=22.5 2023-12-04 14:30:05,662 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-12-04 14:30:31,030 INFO [train.py:1087] (2/4) Epoch 55, batch 100, loss[loss=0.1577, simple_loss=0.2503, pruned_loss=0.03257, over 24764.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2477, pruned_loss=0.03119, over 1914234.40 frames. ], batch size: 70, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:30:59,697 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-12-04 14:31:01,483 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-12-04 14:31:12,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=323000.0, ans=0.125 2023-12-04 14:31:13,387 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=323000.0, ans=0.1 2023-12-04 14:31:43,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=323133.3333333333, ans=0.0 2023-12-04 14:31:46,368 INFO [train.py:1087] (2/4) Epoch 55, batch 150, loss[loss=0.16, simple_loss=0.249, pruned_loss=0.03554, over 24510.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2465, pruned_loss=0.0307, over 2557138.11 frames. ], batch size: 75, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:31:50,750 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.281e+02 1.343e+02 1.455e+02 3.000e+02, threshold=2.685e+02, percent-clipped=1.0 2023-12-04 14:31:55,456 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:32:11,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=323266.6666666667, ans=0.0 2023-12-04 14:32:26,147 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=12.0 2023-12-04 14:33:02,209 INFO [train.py:1087] (2/4) Epoch 55, batch 200, loss[loss=0.1628, simple_loss=0.254, pruned_loss=0.03577, over 24283.00 frames. ], tot_loss[loss=0.1538, simple_loss=0.2465, pruned_loss=0.03061, over 3066863.72 frames. ], batch size: 79, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:33:05,747 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.92 vs. limit=15.0 2023-12-04 14:33:09,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=323533.3333333333, ans=0.09899494936611666 2023-12-04 14:33:24,184 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.60 vs. limit=6.0 2023-12-04 14:33:30,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=323600.0, ans=0.125 2023-12-04 14:33:57,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=323733.3333333333, ans=0.2 2023-12-04 14:34:18,212 INFO [train.py:1087] (2/4) Epoch 55, batch 250, loss[loss=0.1502, simple_loss=0.2392, pruned_loss=0.03062, over 24020.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2466, pruned_loss=0.03072, over 3455937.51 frames. ], batch size: 87, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:34:22,535 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.264e+02 1.356e+02 1.493e+02 1.911e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 14:34:45,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=323933.3333333333, ans=0.07 2023-12-04 14:35:35,496 INFO [train.py:1087] (2/4) Epoch 55, batch 300, loss[loss=0.1476, simple_loss=0.2426, pruned_loss=0.02636, over 24714.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2469, pruned_loss=0.03094, over 3754887.84 frames. ], batch size: 74, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:35:44,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=324200.0, ans=0.0 2023-12-04 14:35:45,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=324200.0, ans=0.125 2023-12-04 14:35:47,582 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=324200.0, ans=0.125 2023-12-04 14:35:51,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=324266.6666666667, ans=0.125 2023-12-04 14:36:28,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=324400.0, ans=0.2 2023-12-04 14:36:51,507 INFO [train.py:1087] (2/4) Epoch 55, batch 350, loss[loss=0.1462, simple_loss=0.2393, pruned_loss=0.02654, over 24725.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2472, pruned_loss=0.03099, over 3975845.83 frames. ], batch size: 67, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:36:55,917 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.238e+02 1.351e+02 1.486e+02 2.118e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 14:37:26,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=324666.6666666667, ans=0.125 2023-12-04 14:37:26,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=324666.6666666667, ans=0.05 2023-12-04 14:37:37,199 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=324733.3333333333, ans=0.0 2023-12-04 14:38:02,315 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.03 vs. limit=12.0 2023-12-04 14:38:09,146 INFO [train.py:1087] (2/4) Epoch 55, batch 400, loss[loss=0.1508, simple_loss=0.2429, pruned_loss=0.02935, over 24758.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2467, pruned_loss=0.03066, over 4163901.22 frames. ], batch size: 65, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:38:09,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=324866.6666666667, ans=0.2 2023-12-04 14:38:47,204 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2023-12-04 14:39:11,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=325133.3333333333, ans=0.09899494936611666 2023-12-04 14:39:15,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=325133.3333333333, ans=0.125 2023-12-04 14:39:25,761 INFO [train.py:1087] (2/4) Epoch 55, batch 450, loss[loss=0.1626, simple_loss=0.2577, pruned_loss=0.03374, over 24790.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2468, pruned_loss=0.03066, over 4313176.49 frames. ], batch size: 62, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:39:27,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=325200.0, ans=0.125 2023-12-04 14:39:27,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=325200.0, ans=0.0 2023-12-04 14:39:30,012 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.294e+02 1.407e+02 1.497e+02 1.971e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 14:39:43,724 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=325266.6666666667, ans=0.125 2023-12-04 14:39:48,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=325266.6666666667, ans=15.0 2023-12-04 14:39:51,320 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.22 vs. limit=15.0 2023-12-04 14:39:56,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=325333.3333333333, ans=0.0 2023-12-04 14:40:00,921 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.48 vs. limit=10.0 2023-12-04 14:40:13,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=325400.0, ans=0.125 2023-12-04 14:40:21,360 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.18 vs. limit=15.0 2023-12-04 14:40:22,458 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=325400.0, ans=0.09899494936611666 2023-12-04 14:40:22,469 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=325400.0, ans=0.125 2023-12-04 14:40:34,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=325466.6666666667, ans=0.1 2023-12-04 14:40:41,069 INFO [train.py:1087] (2/4) Epoch 55, batch 500, loss[loss=0.1527, simple_loss=0.2502, pruned_loss=0.02765, over 24758.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.0305, over 4431797.09 frames. ], batch size: 71, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:40:42,918 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=325533.3333333333, ans=0.125 2023-12-04 14:40:53,492 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.33 vs. limit=15.0 2023-12-04 14:41:08,148 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=325600.0, ans=0.2 2023-12-04 14:41:10,836 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:41:17,368 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325666.6666666667, ans=0.1 2023-12-04 14:41:44,511 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325800.0, ans=0.1 2023-12-04 14:41:55,211 INFO [train.py:1087] (2/4) Epoch 55, batch 550, loss[loss=0.1532, simple_loss=0.249, pruned_loss=0.02872, over 24692.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2465, pruned_loss=0.03048, over 4511639.19 frames. ], batch size: 74, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:42:00,040 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.269e+02 1.374e+02 1.501e+02 2.077e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 14:42:00,890 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.17 vs. limit=22.5 2023-12-04 14:42:17,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=325933.3333333333, ans=0.0 2023-12-04 14:42:38,316 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=326000.0, ans=0.125 2023-12-04 14:43:12,529 INFO [train.py:1087] (2/4) Epoch 55, batch 600, loss[loss=0.1522, simple_loss=0.2422, pruned_loss=0.03112, over 24744.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2465, pruned_loss=0.03044, over 4576796.88 frames. ], batch size: 63, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:43:14,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=326200.0, ans=0.125 2023-12-04 14:44:09,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=326400.0, ans=0.0 2023-12-04 14:44:11,590 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.34 vs. limit=22.5 2023-12-04 14:44:14,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=326466.6666666667, ans=0.125 2023-12-04 14:44:23,374 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=326466.6666666667, ans=0.0 2023-12-04 14:44:29,677 INFO [train.py:1087] (2/4) Epoch 55, batch 650, loss[loss=0.1514, simple_loss=0.2451, pruned_loss=0.02883, over 24772.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2469, pruned_loss=0.03082, over 4606986.49 frames. ], batch size: 64, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:44:34,254 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.282e+02 1.381e+02 1.479e+02 1.836e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-04 14:44:51,960 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=326600.0, ans=0.0 2023-12-04 14:45:03,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326666.6666666667, ans=0.1 2023-12-04 14:45:07,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=326666.6666666667, ans=0.0 2023-12-04 14:45:14,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=326733.3333333333, ans=0.0 2023-12-04 14:45:16,905 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.51 vs. limit=6.0 2023-12-04 14:45:18,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=326733.3333333333, ans=0.0 2023-12-04 14:45:39,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=326800.0, ans=0.07 2023-12-04 14:45:46,482 INFO [train.py:1087] (2/4) Epoch 55, batch 700, loss[loss=0.1636, simple_loss=0.2499, pruned_loss=0.03864, over 23507.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2469, pruned_loss=0.03091, over 4660991.69 frames. ], batch size: 94, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:46:17,580 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-12-04 14:46:26,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=327000.0, ans=0.125 2023-12-04 14:46:27,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327000.0, ans=0.125 2023-12-04 14:46:30,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327066.6666666667, ans=0.1 2023-12-04 14:46:44,984 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327066.6666666667, ans=0.1 2023-12-04 14:47:01,639 INFO [train.py:1087] (2/4) Epoch 55, batch 750, loss[loss=0.1513, simple_loss=0.2467, pruned_loss=0.02792, over 24721.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2468, pruned_loss=0.03086, over 4699681.95 frames. ], batch size: 69, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:47:06,270 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.251e+02 1.375e+02 1.484e+02 2.107e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 14:47:10,424 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-12-04 14:47:25,411 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2023-12-04 14:47:34,157 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=327333.3333333333, ans=0.125 2023-12-04 14:47:56,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-12-04 14:48:14,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=327466.6666666667, ans=0.125 2023-12-04 14:48:17,111 INFO [train.py:1087] (2/4) Epoch 55, batch 800, loss[loss=0.1497, simple_loss=0.2413, pruned_loss=0.02904, over 24139.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2464, pruned_loss=0.03073, over 4733642.60 frames. ], batch size: 58, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:48:21,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327533.3333333333, ans=0.1 2023-12-04 14:48:30,148 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=327600.0, ans=0.125 2023-12-04 14:48:37,764 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.30 vs. limit=15.0 2023-12-04 14:48:52,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=327666.6666666667, ans=0.125 2023-12-04 14:48:56,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327666.6666666667, ans=0.1 2023-12-04 14:49:22,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=327800.0, ans=0.125 2023-12-04 14:49:25,750 INFO [train.py:1087] (2/4) Epoch 55, batch 850, loss[loss=0.1497, simple_loss=0.245, pruned_loss=0.02722, over 24758.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2464, pruned_loss=0.03087, over 4743549.77 frames. ], batch size: 71, lr: 4.44e-03, grad_scale: 32.0 2023-12-04 14:49:30,091 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.287e+02 1.437e+02 1.562e+02 2.084e+02, threshold=2.874e+02, percent-clipped=0.0 2023-12-04 14:50:13,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=328066.6666666667, ans=0.125 2023-12-04 14:50:47,365 INFO [train.py:1087] (2/4) Epoch 56, batch 0, loss[loss=0.1628, simple_loss=0.2521, pruned_loss=0.03678, over 23999.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2521, pruned_loss=0.03678, over 23999.00 frames. ], batch size: 87, lr: 4.40e-03, grad_scale: 32.0 2023-12-04 14:50:47,366 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 14:51:03,637 INFO [train.py:1119] (2/4) Epoch 56, validation: loss=0.1512, simple_loss=0.2487, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 14:51:03,638 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 14:51:45,088 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=22.5 2023-12-04 14:51:45,170 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=15.0 2023-12-04 14:51:57,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=328366.6666666667, ans=0.125 2023-12-04 14:51:57,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=328366.6666666667, ans=10.0 2023-12-04 14:52:17,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=328433.3333333333, ans=0.125 2023-12-04 14:52:20,355 INFO [train.py:1087] (2/4) Epoch 56, batch 50, loss[loss=0.1549, simple_loss=0.248, pruned_loss=0.03095, over 24556.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2465, pruned_loss=0.03084, over 1080474.76 frames. ], batch size: 66, lr: 4.40e-03, grad_scale: 16.0 2023-12-04 14:52:20,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=328500.0, ans=0.0 2023-12-04 14:52:35,596 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.319e+02 1.474e+02 1.645e+02 2.645e+02, threshold=2.949e+02, percent-clipped=0.0 2023-12-04 14:53:16,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=328700.0, ans=0.5 2023-12-04 14:53:36,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=328833.3333333333, ans=0.0 2023-12-04 14:53:37,589 INFO [train.py:1087] (2/4) Epoch 56, batch 100, loss[loss=0.1586, simple_loss=0.2548, pruned_loss=0.03122, over 23472.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2464, pruned_loss=0.03035, over 1906508.40 frames. ], batch size: 94, lr: 4.40e-03, grad_scale: 16.0 2023-12-04 14:53:41,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=328833.3333333333, ans=0.0 2023-12-04 14:53:44,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=328833.3333333333, ans=0.07 2023-12-04 14:53:49,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=328833.3333333333, ans=0.035 2023-12-04 14:54:16,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=328966.6666666667, ans=0.125 2023-12-04 14:54:18,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=328966.6666666667, ans=0.125 2023-12-04 14:54:28,125 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329033.3333333333, ans=0.125 2023-12-04 14:54:34,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.17 vs. limit=22.5 2023-12-04 14:54:42,301 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-12-04 14:54:53,374 INFO [train.py:1087] (2/4) Epoch 56, batch 150, loss[loss=0.1569, simple_loss=0.253, pruned_loss=0.03044, over 24072.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2458, pruned_loss=0.03002, over 2566685.70 frames. ], batch size: 87, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:55:04,690 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.44 vs. limit=15.0 2023-12-04 14:55:05,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=329166.6666666667, ans=0.1 2023-12-04 14:55:09,764 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.250e+02 1.328e+02 1.469e+02 1.847e+02, threshold=2.656e+02, percent-clipped=0.0 2023-12-04 14:55:19,410 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-12-04 14:55:50,834 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=329366.6666666667, ans=0.125 2023-12-04 14:55:59,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=329433.3333333333, ans=0.125 2023-12-04 14:56:09,803 INFO [train.py:1087] (2/4) Epoch 56, batch 200, loss[loss=0.1677, simple_loss=0.262, pruned_loss=0.03669, over 23385.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03009, over 3072960.47 frames. ], batch size: 94, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:56:47,958 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.54 vs. limit=15.0 2023-12-04 14:56:50,753 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2023-12-04 14:57:18,823 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-12-04 14:57:19,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=329766.6666666667, ans=0.125 2023-12-04 14:57:25,440 INFO [train.py:1087] (2/4) Epoch 56, batch 250, loss[loss=0.1582, simple_loss=0.2495, pruned_loss=0.03347, over 21202.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.246, pruned_loss=0.03006, over 3470292.41 frames. ], batch size: 127, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:57:39,905 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.298e+02 1.398e+02 1.522e+02 2.172e+02, threshold=2.796e+02, percent-clipped=0.0 2023-12-04 14:57:50,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=329900.0, ans=0.2 2023-12-04 14:57:56,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=329966.6666666667, ans=0.125 2023-12-04 14:58:36,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=330100.0, ans=15.0 2023-12-04 14:58:41,388 INFO [train.py:1087] (2/4) Epoch 56, batch 300, loss[loss=0.1444, simple_loss=0.2369, pruned_loss=0.02601, over 24800.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2459, pruned_loss=0.02986, over 3775319.94 frames. ], batch size: 62, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:59:04,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=330233.3333333333, ans=6.0 2023-12-04 14:59:31,100 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=330366.6666666667, ans=0.125 2023-12-04 14:59:55,187 INFO [train.py:1087] (2/4) Epoch 56, batch 350, loss[loss=0.1535, simple_loss=0.2462, pruned_loss=0.03043, over 24306.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2459, pruned_loss=0.03001, over 4000176.66 frames. ], batch size: 79, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 15:00:11,340 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.289e+02 1.367e+02 1.457e+02 1.920e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 15:00:20,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=330566.6666666667, ans=0.0 2023-12-04 15:00:51,606 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:00:53,253 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=330700.0, ans=0.125 2023-12-04 15:01:11,842 INFO [train.py:1087] (2/4) Epoch 56, batch 400, loss[loss=0.1514, simple_loss=0.2456, pruned_loss=0.0286, over 23521.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2461, pruned_loss=0.03017, over 4173932.19 frames. ], batch size: 94, lr: 4.38e-03, grad_scale: 32.0 2023-12-04 15:01:16,344 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=330833.3333333333, ans=0.0 2023-12-04 15:01:36,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=330900.0, ans=0.125 2023-12-04 15:01:52,277 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=330966.6666666667, ans=0.125 2023-12-04 15:02:00,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331033.3333333333, ans=0.1 2023-12-04 15:02:11,183 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=331100.0, ans=0.1 2023-12-04 15:02:28,092 INFO [train.py:1087] (2/4) Epoch 56, batch 450, loss[loss=0.153, simple_loss=0.2464, pruned_loss=0.02985, over 24744.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2463, pruned_loss=0.03022, over 4322059.98 frames. ], batch size: 66, lr: 4.38e-03, grad_scale: 32.0 2023-12-04 15:02:34,387 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=331166.6666666667, ans=0.0 2023-12-04 15:02:42,982 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.275e+02 1.353e+02 1.494e+02 1.982e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 15:02:43,354 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=331233.3333333333, ans=0.2 2023-12-04 15:02:51,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=331233.3333333333, ans=0.125 2023-12-04 15:03:07,695 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-12-04 15:03:23,479 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.53 vs. limit=15.0 2023-12-04 15:03:26,001 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:03:28,265 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.45 vs. limit=10.0 2023-12-04 15:03:46,057 INFO [train.py:1087] (2/4) Epoch 56, batch 500, loss[loss=0.1466, simple_loss=0.2394, pruned_loss=0.02685, over 24145.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2461, pruned_loss=0.03033, over 4426858.91 frames. ], batch size: 82, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:03:46,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=331500.0, ans=0.125 2023-12-04 15:03:56,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=331500.0, ans=0.2 2023-12-04 15:03:58,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=331500.0, ans=0.125 2023-12-04 15:04:24,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=331633.3333333333, ans=0.125 2023-12-04 15:04:27,088 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-12-04 15:05:01,811 INFO [train.py:1087] (2/4) Epoch 56, batch 550, loss[loss=0.1427, simple_loss=0.2372, pruned_loss=0.02406, over 24712.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2461, pruned_loss=0.03029, over 4501177.18 frames. ], batch size: 74, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:05:05,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=331833.3333333333, ans=0.1 2023-12-04 15:05:10,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331833.3333333333, ans=0.1 2023-12-04 15:05:17,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=331900.0, ans=0.125 2023-12-04 15:05:18,661 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.304e+02 1.422e+02 1.549e+02 2.533e+02, threshold=2.844e+02, percent-clipped=0.0 2023-12-04 15:05:44,069 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=331966.6666666667, ans=0.1 2023-12-04 15:05:45,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=331966.6666666667, ans=0.015 2023-12-04 15:06:08,059 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2023-12-04 15:06:19,652 INFO [train.py:1087] (2/4) Epoch 56, batch 600, loss[loss=0.1409, simple_loss=0.2366, pruned_loss=0.02261, over 24620.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2459, pruned_loss=0.03028, over 4560347.92 frames. ], batch size: 68, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:06:27,884 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=11.02 vs. limit=12.0 2023-12-04 15:06:35,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=332233.3333333333, ans=10.0 2023-12-04 15:06:41,720 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=332233.3333333333, ans=0.0 2023-12-04 15:06:43,224 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=332233.3333333333, ans=0.0 2023-12-04 15:06:44,901 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=332233.3333333333, ans=0.1 2023-12-04 15:06:50,833 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=332300.0, ans=0.0 2023-12-04 15:06:59,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332300.0, ans=0.1 2023-12-04 15:07:14,113 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.61 vs. limit=22.5 2023-12-04 15:07:27,135 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=332433.3333333333, ans=0.0 2023-12-04 15:07:36,846 INFO [train.py:1087] (2/4) Epoch 56, batch 650, loss[loss=0.1629, simple_loss=0.2528, pruned_loss=0.03652, over 24279.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2459, pruned_loss=0.03032, over 4612653.63 frames. ], batch size: 79, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:07:49,950 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.62 vs. limit=15.0 2023-12-04 15:07:53,877 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.153e+02 1.324e+02 1.446e+02 1.648e+02 2.050e+02, threshold=2.891e+02, percent-clipped=0.0 2023-12-04 15:08:31,306 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.78 vs. limit=12.0 2023-12-04 15:08:35,381 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=332700.0, ans=0.125 2023-12-04 15:08:49,797 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.40 vs. limit=22.5 2023-12-04 15:08:53,237 INFO [train.py:1087] (2/4) Epoch 56, batch 700, loss[loss=0.1589, simple_loss=0.2521, pruned_loss=0.03284, over 24511.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.246, pruned_loss=0.03022, over 4653088.28 frames. ], batch size: 77, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:08:56,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=332833.3333333333, ans=0.125 2023-12-04 15:09:02,719 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332833.3333333333, ans=0.125 2023-12-04 15:09:34,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=332966.6666666667, ans=0.125 2023-12-04 15:09:45,628 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.43 vs. limit=15.0 2023-12-04 15:09:55,302 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-12-04 15:10:07,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=333100.0, ans=0.125 2023-12-04 15:10:10,774 INFO [train.py:1087] (2/4) Epoch 56, batch 750, loss[loss=0.1949, simple_loss=0.2768, pruned_loss=0.05654, over 17400.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2458, pruned_loss=0.03025, over 4690862.79 frames. ], batch size: 176, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:10:19,520 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=333166.6666666667, ans=0.0 2023-12-04 15:10:28,021 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.050e+02 1.279e+02 1.376e+02 1.521e+02 2.298e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 15:10:34,375 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-12-04 15:10:45,177 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=333300.0, ans=0.09899494936611666 2023-12-04 15:10:46,422 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=333300.0, ans=0.125 2023-12-04 15:10:53,198 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.16 vs. limit=22.5 2023-12-04 15:11:01,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=333366.6666666667, ans=0.0 2023-12-04 15:11:19,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=333433.3333333333, ans=0.125 2023-12-04 15:11:25,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=333500.0, ans=0.0 2023-12-04 15:11:26,378 INFO [train.py:1087] (2/4) Epoch 56, batch 800, loss[loss=0.1467, simple_loss=0.2375, pruned_loss=0.0279, over 24849.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2457, pruned_loss=0.0303, over 4710749.89 frames. ], batch size: 68, lr: 4.37e-03, grad_scale: 32.0 2023-12-04 15:11:46,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=333566.6666666667, ans=0.125 2023-12-04 15:11:50,785 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=333566.6666666667, ans=0.0 2023-12-04 15:12:03,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=333633.3333333333, ans=0.2 2023-12-04 15:12:09,085 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.50 vs. limit=15.0 2023-12-04 15:12:36,228 INFO [train.py:1087] (2/4) Epoch 56, batch 850, loss[loss=0.1669, simple_loss=0.257, pruned_loss=0.03846, over 23946.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2462, pruned_loss=0.03058, over 4741052.63 frames. ], batch size: 87, lr: 4.36e-03, grad_scale: 32.0 2023-12-04 15:12:39,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=333833.3333333333, ans=0.0 2023-12-04 15:12:45,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=333833.3333333333, ans=0.125 2023-12-04 15:12:50,934 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.290e+02 1.407e+02 1.531e+02 2.076e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 15:13:31,123 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334100.0, ans=0.125 2023-12-04 15:13:59,687 INFO [train.py:1087] (2/4) Epoch 57, batch 0, loss[loss=0.1846, simple_loss=0.2679, pruned_loss=0.05068, over 17024.00 frames. ], tot_loss[loss=0.1846, simple_loss=0.2679, pruned_loss=0.05068, over 17024.00 frames. ], batch size: 176, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:13:59,690 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 15:14:16,498 INFO [train.py:1119] (2/4) Epoch 57, validation: loss=0.1509, simple_loss=0.2484, pruned_loss=0.02671, over 944034.00 frames. 2023-12-04 15:14:16,499 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 15:14:36,190 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.72 vs. limit=22.5 2023-12-04 15:14:42,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=334200.0, ans=0.125 2023-12-04 15:14:43,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=334200.0, ans=0.0 2023-12-04 15:14:48,585 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-12-04 15:15:00,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=334266.6666666667, ans=0.025 2023-12-04 15:15:14,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=334333.3333333333, ans=0.2 2023-12-04 15:15:16,304 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-12-04 15:15:29,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=334400.0, ans=0.125 2023-12-04 15:15:29,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=334400.0, ans=0.125 2023-12-04 15:15:34,471 INFO [train.py:1087] (2/4) Epoch 57, batch 50, loss[loss=0.1511, simple_loss=0.2469, pruned_loss=0.02765, over 24547.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2489, pruned_loss=0.03225, over 1060901.69 frames. ], batch size: 62, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:15:57,894 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.283e+02 1.393e+02 1.536e+02 2.601e+02, threshold=2.787e+02, percent-clipped=0.0 2023-12-04 15:16:31,030 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=334666.6666666667, ans=0.2 2023-12-04 15:16:45,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=334733.3333333333, ans=0.125 2023-12-04 15:16:50,068 INFO [train.py:1087] (2/4) Epoch 57, batch 100, loss[loss=0.1576, simple_loss=0.2485, pruned_loss=0.03336, over 23801.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2478, pruned_loss=0.03126, over 1891684.09 frames. ], batch size: 57, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:17:13,498 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-12-04 15:17:42,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=335000.0, ans=0.125 2023-12-04 15:17:53,529 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-12-04 15:18:08,210 INFO [train.py:1087] (2/4) Epoch 57, batch 150, loss[loss=0.146, simple_loss=0.2401, pruned_loss=0.02589, over 24776.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2472, pruned_loss=0.03064, over 2534811.07 frames. ], batch size: 70, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:18:32,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=335200.0, ans=10.0 2023-12-04 15:18:33,147 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.254e+02 1.319e+02 1.488e+02 2.110e+02, threshold=2.637e+02, percent-clipped=0.0 2023-12-04 15:18:44,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=335266.6666666667, ans=0.125 2023-12-04 15:18:50,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=335266.6666666667, ans=0.0 2023-12-04 15:18:59,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=335333.3333333333, ans=0.125 2023-12-04 15:19:19,595 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=335400.0, ans=0.125 2023-12-04 15:19:25,650 INFO [train.py:1087] (2/4) Epoch 57, batch 200, loss[loss=0.1604, simple_loss=0.2586, pruned_loss=0.03107, over 24865.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2465, pruned_loss=0.03, over 3056390.07 frames. ], batch size: 68, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:19:26,427 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.67 vs. limit=15.0 2023-12-04 15:20:18,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=335666.6666666667, ans=15.0 2023-12-04 15:20:42,498 INFO [train.py:1087] (2/4) Epoch 57, batch 250, loss[loss=0.151, simple_loss=0.2424, pruned_loss=0.02976, over 24572.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2463, pruned_loss=0.03007, over 3462968.19 frames. ], batch size: 65, lr: 4.31e-03, grad_scale: 32.0 2023-12-04 15:21:08,083 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.273e+02 1.365e+02 1.487e+02 1.960e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 15:21:08,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335866.6666666667, ans=0.1 2023-12-04 15:21:25,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=335933.3333333333, ans=0.0 2023-12-04 15:22:01,362 INFO [train.py:1087] (2/4) Epoch 57, batch 300, loss[loss=0.1638, simple_loss=0.256, pruned_loss=0.03578, over 16571.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2461, pruned_loss=0.02997, over 3768535.56 frames. ], batch size: 177, lr: 4.31e-03, grad_scale: 16.0 2023-12-04 15:22:05,330 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2023-12-04 15:22:07,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=336133.3333333333, ans=0.125 2023-12-04 15:22:12,475 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=336133.3333333333, ans=0.2 2023-12-04 15:22:51,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=336333.3333333333, ans=0.125 2023-12-04 15:22:51,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=336333.3333333333, ans=0.0 2023-12-04 15:23:19,048 INFO [train.py:1087] (2/4) Epoch 57, batch 350, loss[loss=0.1591, simple_loss=0.2519, pruned_loss=0.03316, over 23951.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2464, pruned_loss=0.03017, over 3981399.33 frames. ], batch size: 87, lr: 4.31e-03, grad_scale: 16.0 2023-12-04 15:23:31,271 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.28 vs. limit=10.0 2023-12-04 15:23:34,419 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.91 vs. limit=10.0 2023-12-04 15:23:35,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=336533.3333333333, ans=0.125 2023-12-04 15:23:41,897 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0 2023-12-04 15:23:45,587 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.295e+02 1.376e+02 1.514e+02 1.932e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 15:24:35,365 INFO [train.py:1087] (2/4) Epoch 57, batch 400, loss[loss=0.1527, simple_loss=0.2451, pruned_loss=0.03021, over 24563.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2459, pruned_loss=0.02987, over 4173497.39 frames. ], batch size: 63, lr: 4.31e-03, grad_scale: 32.0 2023-12-04 15:24:41,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=336800.0, ans=0.05 2023-12-04 15:25:15,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=336933.3333333333, ans=0.125 2023-12-04 15:25:15,847 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-12-04 15:25:42,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=337066.6666666667, ans=0.1 2023-12-04 15:25:48,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=337066.6666666667, ans=0.0 2023-12-04 15:25:53,928 INFO [train.py:1087] (2/4) Epoch 57, batch 450, loss[loss=0.1426, simple_loss=0.2387, pruned_loss=0.02329, over 24166.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2457, pruned_loss=0.0299, over 4319562.42 frames. ], batch size: 58, lr: 4.30e-03, grad_scale: 32.0 2023-12-04 15:26:05,090 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.91 vs. limit=15.0 2023-12-04 15:26:07,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=337200.0, ans=0.125 2023-12-04 15:26:07,913 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=337200.0, ans=0.125 2023-12-04 15:26:16,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=337200.0, ans=0.0 2023-12-04 15:26:20,986 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.284e+02 1.352e+02 1.480e+02 2.036e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 15:26:27,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=337266.6666666667, ans=0.0 2023-12-04 15:27:00,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=337400.0, ans=0.95 2023-12-04 15:27:08,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337400.0, ans=0.1 2023-12-04 15:27:12,856 INFO [train.py:1087] (2/4) Epoch 57, batch 500, loss[loss=0.1453, simple_loss=0.2383, pruned_loss=0.02617, over 24548.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03005, over 4415809.00 frames. ], batch size: 62, lr: 4.30e-03, grad_scale: 32.0 2023-12-04 15:27:17,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337466.6666666667, ans=0.1 2023-12-04 15:27:40,755 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=337533.3333333333, ans=0.125 2023-12-04 15:27:40,832 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=337533.3333333333, ans=0.125 2023-12-04 15:27:53,385 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.11 vs. limit=10.0 2023-12-04 15:27:59,317 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.18 vs. limit=15.0 2023-12-04 15:28:01,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=337666.6666666667, ans=0.125 2023-12-04 15:28:02,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=337666.6666666667, ans=0.125 2023-12-04 15:28:12,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=337666.6666666667, ans=0.125 2023-12-04 15:28:31,653 INFO [train.py:1087] (2/4) Epoch 57, batch 550, loss[loss=0.1418, simple_loss=0.2368, pruned_loss=0.02338, over 24762.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03008, over 4505233.38 frames. ], batch size: 64, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:28:42,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=337800.0, ans=0.0 2023-12-04 15:28:58,052 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.09 vs. limit=22.5 2023-12-04 15:28:58,571 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.296e+02 1.366e+02 1.475e+02 1.863e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 15:29:00,908 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.82 vs. limit=22.5 2023-12-04 15:29:43,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=338066.6666666667, ans=0.125 2023-12-04 15:29:48,229 INFO [train.py:1087] (2/4) Epoch 57, batch 600, loss[loss=0.1484, simple_loss=0.2417, pruned_loss=0.02761, over 24554.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2455, pruned_loss=0.02988, over 4579662.89 frames. ], batch size: 63, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:29:57,683 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.22 vs. limit=15.0 2023-12-04 15:30:09,609 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-12-04 15:30:50,245 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=338400.0, ans=10.0 2023-12-04 15:31:06,684 INFO [train.py:1087] (2/4) Epoch 57, batch 650, loss[loss=0.1543, simple_loss=0.2509, pruned_loss=0.02886, over 24550.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2458, pruned_loss=0.03002, over 4617393.97 frames. ], batch size: 63, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:31:34,225 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.254e+02 1.345e+02 1.461e+02 1.814e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 15:31:35,911 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338600.0, ans=0.1 2023-12-04 15:31:37,956 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.93 vs. limit=10.0 2023-12-04 15:31:38,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=338600.0, ans=0.125 2023-12-04 15:31:48,947 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-12-04 15:32:14,027 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-12-04 15:32:17,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=338733.3333333333, ans=0.2 2023-12-04 15:32:23,353 INFO [train.py:1087] (2/4) Epoch 57, batch 700, loss[loss=0.1426, simple_loss=0.2359, pruned_loss=0.02464, over 24725.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03013, over 4643781.01 frames. ], batch size: 67, lr: 4.29e-03, grad_scale: 16.0 2023-12-04 15:32:45,013 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=338866.6666666667, ans=0.035 2023-12-04 15:32:49,604 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:33:29,066 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=339066.6666666667, ans=0.0 2023-12-04 15:33:41,806 INFO [train.py:1087] (2/4) Epoch 57, batch 750, loss[loss=0.1519, simple_loss=0.2425, pruned_loss=0.03067, over 24788.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2453, pruned_loss=0.02977, over 4699508.66 frames. ], batch size: 62, lr: 4.29e-03, grad_scale: 16.0 2023-12-04 15:34:09,570 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.274e+02 1.366e+02 1.466e+02 1.796e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-04 15:34:30,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=339333.3333333333, ans=0.125 2023-12-04 15:34:31,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=339333.3333333333, ans=0.125 2023-12-04 15:34:58,949 INFO [train.py:1087] (2/4) Epoch 57, batch 800, loss[loss=0.1515, simple_loss=0.2427, pruned_loss=0.0301, over 24739.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2449, pruned_loss=0.02968, over 4727624.13 frames. ], batch size: 63, lr: 4.29e-03, grad_scale: 32.0 2023-12-04 15:35:10,262 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=12.0 2023-12-04 15:35:24,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=339533.3333333333, ans=0.0 2023-12-04 15:36:09,399 INFO [train.py:1087] (2/4) Epoch 57, batch 850, loss[loss=0.1601, simple_loss=0.2544, pruned_loss=0.03288, over 24545.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2448, pruned_loss=0.02966, over 4752683.79 frames. ], batch size: 62, lr: 4.29e-03, grad_scale: 32.0 2023-12-04 15:36:33,913 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.047e+02 1.261e+02 1.353e+02 1.424e+02 1.970e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 15:36:37,557 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.71 vs. limit=15.0 2023-12-04 15:36:55,077 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:37:22,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=340100.0, ans=0.0 2023-12-04 15:37:36,190 INFO [train.py:1087] (2/4) Epoch 58, batch 0, loss[loss=0.1476, simple_loss=0.2426, pruned_loss=0.02631, over 24548.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2426, pruned_loss=0.02631, over 24548.00 frames. ], batch size: 66, lr: 4.25e-03, grad_scale: 32.0 2023-12-04 15:37:36,191 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 15:37:52,583 INFO [train.py:1119] (2/4) Epoch 58, validation: loss=0.1514, simple_loss=0.2484, pruned_loss=0.02714, over 944034.00 frames. 2023-12-04 15:37:52,584 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 15:38:21,078 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=340166.6666666667, ans=0.025 2023-12-04 15:38:33,852 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-12-04 15:38:56,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=340366.6666666667, ans=0.0 2023-12-04 15:39:09,602 INFO [train.py:1087] (2/4) Epoch 58, batch 50, loss[loss=0.1511, simple_loss=0.2419, pruned_loss=0.03017, over 24765.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.247, pruned_loss=0.03067, over 1097598.62 frames. ], batch size: 70, lr: 4.25e-03, grad_scale: 32.0 2023-12-04 15:39:29,722 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.64 vs. limit=15.0 2023-12-04 15:39:44,476 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.274e+02 1.368e+02 1.509e+02 2.414e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 15:39:52,923 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.32 vs. limit=15.0 2023-12-04 15:40:07,937 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340633.3333333333, ans=0.1 2023-12-04 15:40:25,858 INFO [train.py:1087] (2/4) Epoch 58, batch 100, loss[loss=0.1619, simple_loss=0.255, pruned_loss=0.03441, over 24442.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2473, pruned_loss=0.03071, over 1919284.67 frames. ], batch size: 77, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:40:42,976 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.49 vs. limit=6.0 2023-12-04 15:40:56,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=340900.0, ans=0.0 2023-12-04 15:41:00,766 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340900.0, ans=0.125 2023-12-04 15:41:13,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=340966.6666666667, ans=0.07 2023-12-04 15:41:15,174 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=340966.6666666667, ans=0.125 2023-12-04 15:41:19,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=340966.6666666667, ans=0.2 2023-12-04 15:41:23,117 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-12-04 15:41:28,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=341033.3333333333, ans=0.09899494936611666 2023-12-04 15:41:41,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=341100.0, ans=0.125 2023-12-04 15:41:42,206 INFO [train.py:1087] (2/4) Epoch 58, batch 150, loss[loss=0.1559, simple_loss=0.2518, pruned_loss=0.02997, over 24521.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2471, pruned_loss=0.03059, over 2557923.41 frames. ], batch size: 75, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:42:19,233 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.290e+02 1.368e+02 1.495e+02 2.252e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 15:42:27,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=341300.0, ans=0.2 2023-12-04 15:42:30,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341300.0, ans=0.0 2023-12-04 15:42:43,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=341366.6666666667, ans=0.0 2023-12-04 15:42:47,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=341366.6666666667, ans=0.125 2023-12-04 15:42:52,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=341366.6666666667, ans=0.125 2023-12-04 15:42:52,980 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=341366.6666666667, ans=0.95 2023-12-04 15:42:59,690 INFO [train.py:1087] (2/4) Epoch 58, batch 200, loss[loss=0.1484, simple_loss=0.2401, pruned_loss=0.02831, over 24576.00 frames. ], tot_loss[loss=0.1538, simple_loss=0.2468, pruned_loss=0.03041, over 3058527.62 frames. ], batch size: 65, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:43:09,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341433.3333333333, ans=0.125 2023-12-04 15:43:13,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=341500.0, ans=0.125 2023-12-04 15:43:20,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=341500.0, ans=0.035 2023-12-04 15:43:30,330 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341566.6666666667, ans=0.1 2023-12-04 15:43:32,050 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.08 vs. limit=15.0 2023-12-04 15:43:42,446 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-12-04 15:44:05,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=341700.0, ans=0.125 2023-12-04 15:44:05,534 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=341700.0, ans=0.125 2023-12-04 15:44:17,131 INFO [train.py:1087] (2/4) Epoch 58, batch 250, loss[loss=0.1475, simple_loss=0.2424, pruned_loss=0.02631, over 24753.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2462, pruned_loss=0.03005, over 3451245.87 frames. ], batch size: 66, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:44:54,077 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.304e+02 1.398e+02 1.543e+02 2.127e+02, threshold=2.795e+02, percent-clipped=0.0 2023-12-04 15:45:02,236 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341966.6666666667, ans=0.125 2023-12-04 15:45:11,561 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.23 vs. limit=15.0 2023-12-04 15:45:19,389 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-12-04 15:45:21,620 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=342033.3333333333, ans=0.2 2023-12-04 15:45:24,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=342033.3333333333, ans=0.0 2023-12-04 15:45:29,233 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.01 vs. limit=15.0 2023-12-04 15:45:34,526 INFO [train.py:1087] (2/4) Epoch 58, batch 300, loss[loss=0.1546, simple_loss=0.2452, pruned_loss=0.03197, over 24224.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2461, pruned_loss=0.02998, over 3760666.41 frames. ], batch size: 82, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:45:36,564 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=15.0 2023-12-04 15:45:37,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342100.0, ans=0.1 2023-12-04 15:46:00,775 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-12-04 15:46:03,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=342233.3333333333, ans=0.2 2023-12-04 15:46:09,367 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=342233.3333333333, ans=0.125 2023-12-04 15:46:30,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=342300.0, ans=0.125 2023-12-04 15:46:33,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=342300.0, ans=0.0 2023-12-04 15:46:49,324 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=342433.3333333333, ans=0.0 2023-12-04 15:46:50,648 INFO [train.py:1087] (2/4) Epoch 58, batch 350, loss[loss=0.1439, simple_loss=0.2389, pruned_loss=0.02444, over 24770.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2459, pruned_loss=0.03006, over 3986205.45 frames. ], batch size: 70, lr: 4.23e-03, grad_scale: 16.0 2023-12-04 15:47:14,994 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:47:16,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=342500.0, ans=0.125 2023-12-04 15:47:25,554 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.239e+02 1.325e+02 1.427e+02 1.705e+02, threshold=2.650e+02, percent-clipped=0.0 2023-12-04 15:47:32,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=342633.3333333333, ans=0.04949747468305833 2023-12-04 15:48:01,570 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=22.5 2023-12-04 15:48:03,419 INFO [train.py:1087] (2/4) Epoch 58, batch 400, loss[loss=0.1505, simple_loss=0.2431, pruned_loss=0.02894, over 24801.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2459, pruned_loss=0.03036, over 4154248.66 frames. ], batch size: 62, lr: 4.23e-03, grad_scale: 32.0 2023-12-04 15:48:11,651 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=342766.6666666667, ans=0.0 2023-12-04 15:48:46,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=342966.6666666667, ans=0.125 2023-12-04 15:49:05,399 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=343033.3333333333, ans=0.125 2023-12-04 15:49:14,456 INFO [train.py:1087] (2/4) Epoch 58, batch 450, loss[loss=0.156, simple_loss=0.2511, pruned_loss=0.03044, over 24694.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2456, pruned_loss=0.03012, over 4311848.43 frames. ], batch size: 69, lr: 4.23e-03, grad_scale: 16.0 2023-12-04 15:49:16,082 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:49:29,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=343166.6666666667, ans=0.025 2023-12-04 15:49:49,386 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-12-04 15:49:50,152 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=343233.3333333333, ans=0.125 2023-12-04 15:49:51,088 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.271e+02 1.365e+02 1.481e+02 1.907e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 15:49:59,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=343300.0, ans=0.0 2023-12-04 15:50:08,997 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=343366.6666666667, ans=0.0 2023-12-04 15:50:10,701 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=343366.6666666667, ans=0.125 2023-12-04 15:50:13,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=343366.6666666667, ans=0.0 2023-12-04 15:50:23,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=343433.3333333333, ans=0.0 2023-12-04 15:50:25,082 INFO [train.py:1087] (2/4) Epoch 58, batch 500, loss[loss=0.1429, simple_loss=0.2396, pruned_loss=0.02305, over 24612.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2461, pruned_loss=0.03028, over 4403829.03 frames. ], batch size: 68, lr: 4.23e-03, grad_scale: 8.0 2023-12-04 15:50:36,220 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=343433.3333333333, ans=0.0 2023-12-04 15:50:45,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=343500.0, ans=0.07 2023-12-04 15:50:52,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=343566.6666666667, ans=0.0 2023-12-04 15:50:57,361 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:51:19,877 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=343700.0, ans=0.1 2023-12-04 15:51:35,269 INFO [train.py:1087] (2/4) Epoch 58, batch 550, loss[loss=0.143, simple_loss=0.2398, pruned_loss=0.0231, over 24613.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2464, pruned_loss=0.03043, over 4465722.25 frames. ], batch size: 68, lr: 4.23e-03, grad_scale: 8.0 2023-12-04 15:51:39,634 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=343766.6666666667, ans=0.125 2023-12-04 15:51:44,158 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.82 vs. limit=22.5 2023-12-04 15:52:01,724 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=343900.0, ans=0.0 2023-12-04 15:52:11,271 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.282e+02 1.355e+02 1.434e+02 2.119e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 15:52:17,623 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:52:27,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=343966.6666666667, ans=0.2 2023-12-04 15:52:28,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=343966.6666666667, ans=0.125 2023-12-04 15:52:39,169 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344033.3333333333, ans=0.0 2023-12-04 15:52:45,364 INFO [train.py:1087] (2/4) Epoch 58, batch 600, loss[loss=0.1517, simple_loss=0.2454, pruned_loss=0.02899, over 24864.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2463, pruned_loss=0.03026, over 4540797.21 frames. ], batch size: 68, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:52:49,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=344100.0, ans=0.125 2023-12-04 15:53:39,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=344300.0, ans=0.125 2023-12-04 15:53:41,648 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.72 vs. limit=15.0 2023-12-04 15:53:55,584 INFO [train.py:1087] (2/4) Epoch 58, batch 650, loss[loss=0.158, simple_loss=0.2511, pruned_loss=0.03252, over 24540.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2462, pruned_loss=0.03023, over 4595886.85 frames. ], batch size: 75, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:53:55,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=344433.3333333333, ans=0.0 2023-12-04 15:53:59,252 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-12-04 15:54:05,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=344433.3333333333, ans=0.2 2023-12-04 15:54:09,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=344500.0, ans=0.0 2023-12-04 15:54:30,565 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=344566.6666666667, ans=0.125 2023-12-04 15:54:31,380 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.273e+02 1.359e+02 1.478e+02 1.827e+02, threshold=2.719e+02, percent-clipped=0.0 2023-12-04 15:54:48,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=344633.3333333333, ans=0.0 2023-12-04 15:54:53,593 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=344700.0, ans=0.125 2023-12-04 15:55:01,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=344700.0, ans=0.125 2023-12-04 15:55:05,470 INFO [train.py:1087] (2/4) Epoch 58, batch 700, loss[loss=0.1894, simple_loss=0.2738, pruned_loss=0.05245, over 17176.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2461, pruned_loss=0.03032, over 4635219.62 frames. ], batch size: 177, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:55:41,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=344900.0, ans=0.125 2023-12-04 15:55:41,666 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.10 vs. limit=12.0 2023-12-04 15:55:43,202 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=22.5 2023-12-04 15:55:51,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=344966.6666666667, ans=0.0 2023-12-04 15:56:01,219 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=345033.3333333333, ans=0.2 2023-12-04 15:56:02,648 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=345033.3333333333, ans=0.1 2023-12-04 15:56:13,664 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.34 vs. limit=22.5 2023-12-04 15:56:14,410 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=345100.0, ans=0.125 2023-12-04 15:56:15,844 INFO [train.py:1087] (2/4) Epoch 58, batch 750, loss[loss=0.1629, simple_loss=0.2618, pruned_loss=0.03194, over 21035.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.246, pruned_loss=0.03032, over 4668701.48 frames. ], batch size: 127, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:56:16,211 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=345100.0, ans=0.125 2023-12-04 15:56:16,629 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2023-12-04 15:56:46,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=345233.3333333333, ans=0.0 2023-12-04 15:56:47,253 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-12-04 15:56:53,205 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.255e+02 1.327e+02 1.411e+02 2.341e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 15:57:27,038 INFO [train.py:1087] (2/4) Epoch 58, batch 800, loss[loss=0.1564, simple_loss=0.2534, pruned_loss=0.02966, over 24770.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2458, pruned_loss=0.03027, over 4698720.35 frames. ], batch size: 70, lr: 4.22e-03, grad_scale: 16.0 2023-12-04 15:57:35,641 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=345433.3333333333, ans=0.0 2023-12-04 15:57:45,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=345500.0, ans=0.0 2023-12-04 15:57:57,901 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=345566.6666666667, ans=0.125 2023-12-04 15:58:28,589 INFO [train.py:1087] (2/4) Epoch 58, batch 850, loss[loss=0.1471, simple_loss=0.2424, pruned_loss=0.02589, over 24786.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2463, pruned_loss=0.03049, over 4722365.26 frames. ], batch size: 71, lr: 4.21e-03, grad_scale: 16.0 2023-12-04 15:58:34,939 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=345766.6666666667, ans=0.125 2023-12-04 15:58:40,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=345833.3333333333, ans=0.0 2023-12-04 15:58:41,019 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=345833.3333333333, ans=0.0 2023-12-04 15:58:43,631 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.70 vs. limit=15.0 2023-12-04 15:58:47,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=345833.3333333333, ans=0.125 2023-12-04 15:59:00,494 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.280e+02 1.372e+02 1.495e+02 2.014e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 15:59:44,938 INFO [train.py:1087] (2/4) Epoch 59, batch 0, loss[loss=0.1526, simple_loss=0.2449, pruned_loss=0.03012, over 24321.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2449, pruned_loss=0.03012, over 24321.00 frames. ], batch size: 79, lr: 4.18e-03, grad_scale: 32.0 2023-12-04 15:59:44,940 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 15:59:56,478 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.8187, 5.6187, 5.6751, 5.3688], device='cuda:2') 2023-12-04 16:00:01,550 INFO [train.py:1119] (2/4) Epoch 59, validation: loss=0.151, simple_loss=0.2482, pruned_loss=0.02689, over 944034.00 frames. 2023-12-04 16:00:01,551 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 16:00:05,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=346066.6666666667, ans=0.09899494936611666 2023-12-04 16:00:17,283 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=346133.3333333333, ans=0.1 2023-12-04 16:00:20,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=346133.3333333333, ans=0.125 2023-12-04 16:00:43,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=346266.6666666667, ans=0.125 2023-12-04 16:00:57,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=346333.3333333333, ans=0.125 2023-12-04 16:01:04,835 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-12-04 16:01:05,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=346333.3333333333, ans=0.125 2023-12-04 16:01:10,439 INFO [train.py:1087] (2/4) Epoch 59, batch 50, loss[loss=0.1544, simple_loss=0.2502, pruned_loss=0.02934, over 24699.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2458, pruned_loss=0.0297, over 1094183.47 frames. ], batch size: 69, lr: 4.17e-03, grad_scale: 32.0 2023-12-04 16:01:44,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=346533.3333333333, ans=0.125 2023-12-04 16:01:46,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346533.3333333333, ans=0.1 2023-12-04 16:01:53,329 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.301e+02 1.399e+02 1.506e+02 1.775e+02, threshold=2.799e+02, percent-clipped=0.0 2023-12-04 16:01:55,504 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-12-04 16:02:22,202 INFO [train.py:1087] (2/4) Epoch 59, batch 100, loss[loss=0.1554, simple_loss=0.2505, pruned_loss=0.03019, over 24241.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2457, pruned_loss=0.02954, over 1912099.98 frames. ], batch size: 82, lr: 4.17e-03, grad_scale: 16.0 2023-12-04 16:02:30,102 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=346733.3333333333, ans=0.0 2023-12-04 16:02:53,451 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-12-04 16:03:01,482 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=346866.6666666667, ans=0.125 2023-12-04 16:03:31,124 INFO [train.py:1087] (2/4) Epoch 59, batch 150, loss[loss=0.1446, simple_loss=0.2381, pruned_loss=0.02556, over 24567.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2458, pruned_loss=0.02971, over 2559469.23 frames. ], batch size: 66, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:03:35,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=347066.6666666667, ans=0.125 2023-12-04 16:04:01,535 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.08 vs. limit=22.5 2023-12-04 16:04:06,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=347200.0, ans=0.025 2023-12-04 16:04:14,550 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-12-04 16:04:16,217 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.284e+02 1.386e+02 1.532e+02 2.042e+02, threshold=2.772e+02, percent-clipped=0.0 2023-12-04 16:04:39,895 INFO [train.py:1087] (2/4) Epoch 59, batch 200, loss[loss=0.1507, simple_loss=0.2454, pruned_loss=0.028, over 24585.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2451, pruned_loss=0.0294, over 3077665.73 frames. ], batch size: 68, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:04:41,902 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-12-04 16:05:09,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347533.3333333333, ans=0.0 2023-12-04 16:05:29,196 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=347600.0, ans=0.0 2023-12-04 16:05:31,842 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=347600.0, ans=0.125 2023-12-04 16:05:44,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=347666.6666666667, ans=0.1 2023-12-04 16:05:45,411 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:05:48,968 INFO [train.py:1087] (2/4) Epoch 59, batch 250, loss[loss=0.15, simple_loss=0.2404, pruned_loss=0.02977, over 24462.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2457, pruned_loss=0.02979, over 3451295.76 frames. ], batch size: 75, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:05:53,245 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=347733.3333333333, ans=0.0 2023-12-04 16:06:16,729 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.45 vs. limit=10.0 2023-12-04 16:06:22,078 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=12.0 2023-12-04 16:06:33,874 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.290e+02 1.373e+02 1.506e+02 1.799e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 16:06:43,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=348000.0, ans=0.1 2023-12-04 16:06:57,408 INFO [train.py:1087] (2/4) Epoch 59, batch 300, loss[loss=0.1506, simple_loss=0.2428, pruned_loss=0.02917, over 24460.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2456, pruned_loss=0.02989, over 3756701.24 frames. ], batch size: 77, lr: 4.16e-03, grad_scale: 8.0 2023-12-04 16:07:24,769 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=348200.0, ans=0.2 2023-12-04 16:07:58,956 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348333.3333333333, ans=0.1 2023-12-04 16:08:05,355 INFO [train.py:1087] (2/4) Epoch 59, batch 350, loss[loss=0.1504, simple_loss=0.2396, pruned_loss=0.03065, over 24489.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2457, pruned_loss=0.03014, over 3990004.20 frames. ], batch size: 77, lr: 4.16e-03, grad_scale: 8.0 2023-12-04 16:08:11,011 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=348400.0, ans=0.125 2023-12-04 16:08:19,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=348466.6666666667, ans=0.2 2023-12-04 16:08:30,266 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=348466.6666666667, ans=0.09899494936611666 2023-12-04 16:08:39,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=348533.3333333333, ans=0.0 2023-12-04 16:08:45,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-12-04 16:08:51,067 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.058e+02 1.283e+02 1.381e+02 1.500e+02 2.104e+02, threshold=2.763e+02, percent-clipped=0.0 2023-12-04 16:09:03,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=348666.6666666667, ans=0.0 2023-12-04 16:09:15,586 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-12-04 16:09:15,968 INFO [train.py:1087] (2/4) Epoch 59, batch 400, loss[loss=0.1482, simple_loss=0.2391, pruned_loss=0.02862, over 24556.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2452, pruned_loss=0.02973, over 4174636.58 frames. ], batch size: 62, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:09:22,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=348733.3333333333, ans=0.125 2023-12-04 16:09:35,813 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-12-04 16:09:52,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=348866.6666666667, ans=0.125 2023-12-04 16:09:58,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=348933.3333333333, ans=0.0 2023-12-04 16:10:25,370 INFO [train.py:1087] (2/4) Epoch 59, batch 450, loss[loss=0.1483, simple_loss=0.2379, pruned_loss=0.02939, over 24852.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2457, pruned_loss=0.03001, over 4292576.75 frames. ], batch size: 68, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:10:41,781 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-12-04 16:10:54,990 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=349200.0, ans=0.0 2023-12-04 16:11:09,786 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.274e+02 1.345e+02 1.464e+02 2.678e+02, threshold=2.690e+02, percent-clipped=0.0 2023-12-04 16:11:13,138 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.63 vs. limit=15.0 2023-12-04 16:11:22,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=349333.3333333333, ans=0.0 2023-12-04 16:11:34,068 INFO [train.py:1087] (2/4) Epoch 59, batch 500, loss[loss=0.1451, simple_loss=0.2348, pruned_loss=0.02765, over 24736.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2452, pruned_loss=0.02986, over 4414683.10 frames. ], batch size: 61, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:11:45,537 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.82 vs. limit=15.0 2023-12-04 16:12:07,891 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=349533.3333333333, ans=0.0 2023-12-04 16:12:29,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=349666.6666666667, ans=0.2 2023-12-04 16:12:41,930 INFO [train.py:1087] (2/4) Epoch 59, batch 550, loss[loss=0.1504, simple_loss=0.2467, pruned_loss=0.02708, over 24803.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2454, pruned_loss=0.02976, over 4516665.80 frames. ], batch size: 72, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:13:18,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=349866.6666666667, ans=0.125 2023-12-04 16:13:28,807 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.263e+02 1.357e+02 1.457e+02 1.883e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-04 16:13:41,485 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=350000.0, ans=0.125 2023-12-04 16:13:52,090 INFO [train.py:1087] (2/4) Epoch 59, batch 600, loss[loss=0.1577, simple_loss=0.2512, pruned_loss=0.03206, over 24500.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2453, pruned_loss=0.02985, over 4591936.64 frames. ], batch size: 75, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:14:04,162 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=350133.3333333333, ans=0.125 2023-12-04 16:14:11,083 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=12.0 2023-12-04 16:14:22,463 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=350200.0, ans=0.0 2023-12-04 16:14:25,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=350200.0, ans=0.0 2023-12-04 16:15:01,655 INFO [train.py:1087] (2/4) Epoch 59, batch 650, loss[loss=0.1522, simple_loss=0.2457, pruned_loss=0.02934, over 24545.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2457, pruned_loss=0.0301, over 4620213.78 frames. ], batch size: 66, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:15:29,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=350533.3333333333, ans=0.125 2023-12-04 16:15:32,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=350533.3333333333, ans=0.0 2023-12-04 16:15:41,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=350533.3333333333, ans=0.0 2023-12-04 16:15:47,224 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.290e+02 1.349e+02 1.494e+02 3.039e+02, threshold=2.697e+02, percent-clipped=1.0 2023-12-04 16:16:00,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=350666.6666666667, ans=0.125 2023-12-04 16:16:12,218 INFO [train.py:1087] (2/4) Epoch 59, batch 700, loss[loss=0.1454, simple_loss=0.2372, pruned_loss=0.02682, over 24716.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2458, pruned_loss=0.03027, over 4663589.16 frames. ], batch size: 74, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:16:12,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=350733.3333333333, ans=0.125 2023-12-04 16:16:40,261 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=350866.6666666667, ans=0.125 2023-12-04 16:17:04,275 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:17:16,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351000.0, ans=0.125 2023-12-04 16:17:20,284 INFO [train.py:1087] (2/4) Epoch 59, batch 750, loss[loss=0.1447, simple_loss=0.2354, pruned_loss=0.02696, over 24545.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2455, pruned_loss=0.03005, over 4706034.78 frames. ], batch size: 66, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:17:34,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=351133.3333333333, ans=0.125 2023-12-04 16:17:34,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=351133.3333333333, ans=0.125 2023-12-04 16:17:59,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351200.0, ans=0.125 2023-12-04 16:18:06,840 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.253e+02 1.352e+02 1.498e+02 1.873e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 16:18:19,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=351333.3333333333, ans=0.125 2023-12-04 16:18:21,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=351333.3333333333, ans=0.125 2023-12-04 16:18:29,374 INFO [train.py:1087] (2/4) Epoch 59, batch 800, loss[loss=0.1495, simple_loss=0.2411, pruned_loss=0.02892, over 24566.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2454, pruned_loss=0.03011, over 4722210.08 frames. ], batch size: 63, lr: 4.15e-03, grad_scale: 32.0 2023-12-04 16:18:35,549 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=351400.0, ans=0.0 2023-12-04 16:18:53,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=351466.6666666667, ans=0.125 2023-12-04 16:19:06,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=351533.3333333333, ans=0.0 2023-12-04 16:19:08,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=351600.0, ans=0.125 2023-12-04 16:19:19,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351666.6666666667, ans=0.1 2023-12-04 16:19:32,094 INFO [train.py:1087] (2/4) Epoch 59, batch 850, loss[loss=0.1613, simple_loss=0.2542, pruned_loss=0.03421, over 24545.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2454, pruned_loss=0.03005, over 4747566.22 frames. ], batch size: 62, lr: 4.14e-03, grad_scale: 16.0 2023-12-04 16:19:32,971 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=12.0 2023-12-04 16:19:36,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=351733.3333333333, ans=0.2 2023-12-04 16:19:39,490 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=351733.3333333333, ans=0.125 2023-12-04 16:19:40,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=351733.3333333333, ans=0.125 2023-12-04 16:19:41,784 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=351733.3333333333, ans=0.0 2023-12-04 16:19:57,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=351866.6666666667, ans=0.125 2023-12-04 16:20:09,862 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=351933.3333333333, ans=0.0 2023-12-04 16:20:12,975 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.060e+02 1.281e+02 1.392e+02 1.518e+02 2.446e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 16:20:18,082 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:20:24,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=352000.0, ans=0.0 2023-12-04 16:20:43,000 INFO [train.py:1087] (2/4) Epoch 60, batch 0, loss[loss=0.1626, simple_loss=0.2506, pruned_loss=0.03734, over 20918.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.2506, pruned_loss=0.03734, over 20918.00 frames. ], batch size: 50, lr: 4.11e-03, grad_scale: 32.0 2023-12-04 16:20:43,002 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 16:20:59,316 INFO [train.py:1119] (2/4) Epoch 60, validation: loss=0.1512, simple_loss=0.2484, pruned_loss=0.027, over 944034.00 frames. 2023-12-04 16:20:59,319 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 16:21:05,549 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.23 vs. limit=15.0 2023-12-04 16:21:16,739 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352100.0, ans=0.0 2023-12-04 16:21:17,512 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=352100.0, ans=0.0 2023-12-04 16:21:26,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=352166.6666666667, ans=0.0 2023-12-04 16:21:36,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=352166.6666666667, ans=0.0 2023-12-04 16:21:44,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=352233.3333333333, ans=0.125 2023-12-04 16:21:45,971 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=352233.3333333333, ans=0.035 2023-12-04 16:21:47,335 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=352233.3333333333, ans=0.2 2023-12-04 16:21:47,369 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=352233.3333333333, ans=0.125 2023-12-04 16:21:53,156 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-12-04 16:21:57,588 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.60 vs. limit=15.0 2023-12-04 16:22:08,442 INFO [train.py:1087] (2/4) Epoch 60, batch 50, loss[loss=0.1624, simple_loss=0.2544, pruned_loss=0.03518, over 24481.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.03049, over 1089572.13 frames. ], batch size: 77, lr: 4.10e-03, grad_scale: 32.0 2023-12-04 16:22:08,741 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=352366.6666666667, ans=0.125 2023-12-04 16:22:10,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=352366.6666666667, ans=0.125 2023-12-04 16:22:10,229 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=352366.6666666667, ans=0.0 2023-12-04 16:22:22,098 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=352433.3333333333, ans=0.125 2023-12-04 16:22:30,387 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=352433.3333333333, ans=15.0 2023-12-04 16:23:02,003 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.270e+02 1.378e+02 1.472e+02 2.466e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 16:23:10,523 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.37 vs. limit=15.0 2023-12-04 16:23:12,019 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-12-04 16:23:17,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=352700.0, ans=15.0 2023-12-04 16:23:17,458 INFO [train.py:1087] (2/4) Epoch 60, batch 100, loss[loss=0.1494, simple_loss=0.2436, pruned_loss=0.02765, over 24785.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2444, pruned_loss=0.02935, over 1925151.88 frames. ], batch size: 71, lr: 4.10e-03, grad_scale: 32.0 2023-12-04 16:23:27,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=352700.0, ans=0.0 2023-12-04 16:23:29,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=352700.0, ans=0.05 2023-12-04 16:23:30,764 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=15.0 2023-12-04 16:24:25,937 INFO [train.py:1087] (2/4) Epoch 60, batch 150, loss[loss=0.1543, simple_loss=0.2503, pruned_loss=0.02915, over 24845.00 frames. ], tot_loss[loss=0.152, simple_loss=0.245, pruned_loss=0.02947, over 2560947.20 frames. ], batch size: 68, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:24:28,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=353033.3333333333, ans=0.125 2023-12-04 16:24:46,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=353100.0, ans=0.025 2023-12-04 16:24:52,011 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=353100.0, ans=0.2 2023-12-04 16:24:54,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=353166.6666666667, ans=0.025 2023-12-04 16:25:02,898 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.50 vs. limit=10.0 2023-12-04 16:25:22,238 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.281e+02 1.355e+02 1.539e+02 2.129e+02, threshold=2.710e+02, percent-clipped=0.0 2023-12-04 16:25:33,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=353300.0, ans=0.0 2023-12-04 16:25:36,387 INFO [train.py:1087] (2/4) Epoch 60, batch 200, loss[loss=0.1441, simple_loss=0.2396, pruned_loss=0.02431, over 24715.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2452, pruned_loss=0.02968, over 3072404.64 frames. ], batch size: 74, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:25:44,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=353366.6666666667, ans=0.125 2023-12-04 16:25:51,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=353433.3333333333, ans=0.125 2023-12-04 16:26:44,646 INFO [train.py:1087] (2/4) Epoch 60, batch 250, loss[loss=0.1453, simple_loss=0.2335, pruned_loss=0.02854, over 24753.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.02946, over 3455818.40 frames. ], batch size: 64, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:27:32,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=353900.0, ans=0.04949747468305833 2023-12-04 16:27:35,406 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.47 vs. limit=10.0 2023-12-04 16:27:40,354 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.267e+02 1.344e+02 1.454e+02 1.870e+02, threshold=2.689e+02, percent-clipped=0.0 2023-12-04 16:27:54,531 INFO [train.py:1087] (2/4) Epoch 60, batch 300, loss[loss=0.1574, simple_loss=0.247, pruned_loss=0.03391, over 24802.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2445, pruned_loss=0.02912, over 3754574.03 frames. ], batch size: 62, lr: 4.09e-03, grad_scale: 16.0 2023-12-04 16:28:05,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354033.3333333333, ans=0.1 2023-12-04 16:28:13,653 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=354100.0, ans=0.125 2023-12-04 16:28:44,409 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=354233.3333333333, ans=0.125 2023-12-04 16:28:50,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=354300.0, ans=0.125 2023-12-04 16:28:57,101 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=354300.0, ans=0.125 2023-12-04 16:29:02,429 INFO [train.py:1087] (2/4) Epoch 60, batch 350, loss[loss=0.1566, simple_loss=0.2488, pruned_loss=0.03221, over 24553.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2446, pruned_loss=0.02911, over 3990840.93 frames. ], batch size: 62, lr: 4.09e-03, grad_scale: 16.0 2023-12-04 16:29:54,439 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.77 vs. limit=22.5 2023-12-04 16:29:57,641 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.285e+02 1.349e+02 1.453e+02 2.187e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 16:30:11,193 INFO [train.py:1087] (2/4) Epoch 60, batch 400, loss[loss=0.1569, simple_loss=0.2491, pruned_loss=0.03237, over 24868.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.245, pruned_loss=0.02934, over 4177024.57 frames. ], batch size: 68, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:31:00,482 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=354900.0, ans=0.125 2023-12-04 16:31:02,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=354900.0, ans=0.015 2023-12-04 16:31:02,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=354900.0, ans=0.125 2023-12-04 16:31:12,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354966.6666666667, ans=0.1 2023-12-04 16:31:19,412 INFO [train.py:1087] (2/4) Epoch 60, batch 450, loss[loss=0.1485, simple_loss=0.2416, pruned_loss=0.02774, over 24746.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.02947, over 4314921.63 frames. ], batch size: 63, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:31:29,069 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.52 vs. limit=15.0 2023-12-04 16:31:41,369 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=355100.0, ans=0.125 2023-12-04 16:31:47,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=355166.6666666667, ans=0.125 2023-12-04 16:31:58,916 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=355233.3333333333, ans=0.015 2023-12-04 16:32:02,273 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.61 vs. limit=10.0 2023-12-04 16:32:13,256 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.286e+02 1.410e+02 1.537e+02 2.012e+02, threshold=2.820e+02, percent-clipped=0.0 2023-12-04 16:32:28,599 INFO [train.py:1087] (2/4) Epoch 60, batch 500, loss[loss=0.1574, simple_loss=0.2478, pruned_loss=0.03352, over 24503.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2451, pruned_loss=0.02976, over 4415464.34 frames. ], batch size: 75, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:32:36,921 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=355366.6666666667, ans=0.125 2023-12-04 16:33:09,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355566.6666666667, ans=0.0 2023-12-04 16:33:37,221 INFO [train.py:1087] (2/4) Epoch 60, batch 550, loss[loss=0.1587, simple_loss=0.2505, pruned_loss=0.03349, over 24745.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2449, pruned_loss=0.02969, over 4507003.93 frames. ], batch size: 63, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:33:38,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=355700.0, ans=0.0 2023-12-04 16:33:40,181 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=355700.0, ans=0.125 2023-12-04 16:33:43,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=355700.0, ans=0.05 2023-12-04 16:33:56,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=355766.6666666667, ans=0.125 2023-12-04 16:34:07,439 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=355833.3333333333, ans=0.125 2023-12-04 16:34:26,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=355900.0, ans=0.0 2023-12-04 16:34:32,276 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.271e+02 1.366e+02 1.477e+02 1.822e+02, threshold=2.733e+02, percent-clipped=0.0 2023-12-04 16:34:35,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=355966.6666666667, ans=0.125 2023-12-04 16:34:38,771 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-12-04 16:34:45,821 INFO [train.py:1087] (2/4) Epoch 60, batch 600, loss[loss=0.1501, simple_loss=0.2409, pruned_loss=0.02968, over 24790.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02963, over 4567529.91 frames. ], batch size: 71, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:34:57,458 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=356033.3333333333, ans=0.0 2023-12-04 16:34:59,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356100.0, ans=0.125 2023-12-04 16:35:06,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=356100.0, ans=0.125 2023-12-04 16:35:13,859 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.11 vs. limit=15.0 2023-12-04 16:35:17,679 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.29 vs. limit=15.0 2023-12-04 16:35:24,491 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=15.0 2023-12-04 16:35:31,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=356233.3333333333, ans=0.0 2023-12-04 16:35:32,276 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.84 vs. limit=10.0 2023-12-04 16:35:56,919 INFO [train.py:1087] (2/4) Epoch 60, batch 650, loss[loss=0.1482, simple_loss=0.2433, pruned_loss=0.02658, over 24724.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2451, pruned_loss=0.02953, over 4637942.28 frames. ], batch size: 67, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:36:23,114 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=356500.0, ans=0.04949747468305833 2023-12-04 16:36:34,752 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-12-04 16:36:41,639 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.93 vs. limit=22.5 2023-12-04 16:36:45,517 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.45 vs. limit=15.0 2023-12-04 16:36:52,430 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.038e+02 1.271e+02 1.373e+02 1.481e+02 1.948e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 16:36:55,849 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.16 vs. limit=22.5 2023-12-04 16:37:07,586 INFO [train.py:1087] (2/4) Epoch 60, batch 700, loss[loss=0.1531, simple_loss=0.2465, pruned_loss=0.0298, over 24502.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2449, pruned_loss=0.02959, over 4685316.94 frames. ], batch size: 75, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:37:11,088 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.90 vs. limit=12.0 2023-12-04 16:37:45,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=356833.3333333333, ans=0.0 2023-12-04 16:37:50,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=356900.0, ans=0.125 2023-12-04 16:38:16,519 INFO [train.py:1087] (2/4) Epoch 60, batch 750, loss[loss=0.1533, simple_loss=0.2485, pruned_loss=0.02901, over 24459.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2451, pruned_loss=0.02967, over 4712889.87 frames. ], batch size: 77, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:38:26,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=357033.3333333333, ans=0.125 2023-12-04 16:38:34,693 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=357100.0, ans=0.04949747468305833 2023-12-04 16:39:12,776 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.281e+02 1.351e+02 1.479e+02 1.994e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 16:39:27,488 INFO [train.py:1087] (2/4) Epoch 60, batch 800, loss[loss=0.1516, simple_loss=0.2469, pruned_loss=0.02811, over 24281.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2453, pruned_loss=0.02974, over 4728635.04 frames. ], batch size: 79, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:39:47,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=357433.3333333333, ans=0.125 2023-12-04 16:40:01,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=357500.0, ans=0.2 2023-12-04 16:40:30,170 INFO [train.py:1087] (2/4) Epoch 60, batch 850, loss[loss=0.1456, simple_loss=0.2382, pruned_loss=0.02651, over 24621.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2449, pruned_loss=0.02954, over 4756219.06 frames. ], batch size: 68, lr: 4.07e-03, grad_scale: 32.0 2023-12-04 16:40:34,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-12-04 16:40:47,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=357766.6666666667, ans=0.125 2023-12-04 16:40:53,565 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=357833.3333333333, ans=0.125 2023-12-04 16:40:57,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=357833.3333333333, ans=0.125 2023-12-04 16:41:03,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=357833.3333333333, ans=0.125 2023-12-04 16:41:10,318 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=357900.0, ans=0.125 2023-12-04 16:41:11,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=357900.0, ans=0.125 2023-12-04 16:41:18,299 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.265e+02 1.348e+02 1.477e+02 2.020e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 16:41:45,593 INFO [train.py:1087] (2/4) Epoch 61, batch 0, loss[loss=0.1454, simple_loss=0.2374, pruned_loss=0.0267, over 24549.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2374, pruned_loss=0.0267, over 24549.00 frames. ], batch size: 66, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:41:45,596 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 16:42:02,325 INFO [train.py:1119] (2/4) Epoch 61, validation: loss=0.1508, simple_loss=0.248, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 16:42:02,326 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 16:42:10,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=358000.0, ans=0.025 2023-12-04 16:42:17,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=358066.6666666667, ans=0.125 2023-12-04 16:42:36,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=358133.3333333333, ans=0.0 2023-12-04 16:42:43,641 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.35 vs. limit=10.0 2023-12-04 16:42:52,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=358200.0, ans=0.125 2023-12-04 16:42:56,820 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.88 vs. limit=15.0 2023-12-04 16:43:04,266 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=358266.6666666667, ans=0.125 2023-12-04 16:43:12,524 INFO [train.py:1087] (2/4) Epoch 61, batch 50, loss[loss=0.1826, simple_loss=0.2651, pruned_loss=0.05003, over 17180.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2464, pruned_loss=0.0302, over 1070424.85 frames. ], batch size: 177, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:43:14,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=358333.3333333333, ans=0.125 2023-12-04 16:43:38,582 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=358466.6666666667, ans=0.125 2023-12-04 16:43:48,082 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-12-04 16:44:03,632 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=358533.3333333333, ans=0.125 2023-12-04 16:44:06,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=358600.0, ans=0.125 2023-12-04 16:44:13,608 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.266e+02 1.367e+02 1.459e+02 2.673e+02, threshold=2.733e+02, percent-clipped=0.0 2023-12-04 16:44:20,725 INFO [train.py:1087] (2/4) Epoch 61, batch 100, loss[loss=0.1628, simple_loss=0.2511, pruned_loss=0.03729, over 24297.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2455, pruned_loss=0.0293, over 1904050.78 frames. ], batch size: 79, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:44:29,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=358666.6666666667, ans=0.125 2023-12-04 16:44:33,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358733.3333333333, ans=0.125 2023-12-04 16:44:38,754 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-12-04 16:44:49,465 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=358800.0, ans=0.025 2023-12-04 16:44:57,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=358800.0, ans=0.1 2023-12-04 16:45:30,337 INFO [train.py:1087] (2/4) Epoch 61, batch 150, loss[loss=0.1527, simple_loss=0.2452, pruned_loss=0.03006, over 24760.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2455, pruned_loss=0.02934, over 2547509.68 frames. ], batch size: 64, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:45:32,364 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.80 vs. limit=15.0 2023-12-04 16:45:40,878 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=359000.0, ans=0.125 2023-12-04 16:46:33,179 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.244e+02 1.317e+02 1.412e+02 1.866e+02, threshold=2.633e+02, percent-clipped=0.0 2023-12-04 16:46:39,781 INFO [train.py:1087] (2/4) Epoch 61, batch 200, loss[loss=0.1448, simple_loss=0.241, pruned_loss=0.02429, over 21165.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2451, pruned_loss=0.02945, over 3049139.62 frames. ], batch size: 127, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:47:25,167 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.63 vs. limit=15.0 2023-12-04 16:47:29,198 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-12-04 16:47:48,469 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=359666.6666666667, ans=0.0 2023-12-04 16:47:49,347 INFO [train.py:1087] (2/4) Epoch 61, batch 250, loss[loss=0.1513, simple_loss=0.2449, pruned_loss=0.02881, over 24570.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2458, pruned_loss=0.02989, over 3427835.21 frames. ], batch size: 64, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:48:17,766 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=359800.0, ans=0.125 2023-12-04 16:48:35,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=359866.6666666667, ans=0.2 2023-12-04 16:48:41,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=359866.6666666667, ans=0.125 2023-12-04 16:48:51,443 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.284e+02 1.364e+02 1.504e+02 1.772e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 16:48:51,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=359933.3333333333, ans=0.0 2023-12-04 16:48:53,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=359933.3333333333, ans=0.125 2023-12-04 16:48:56,497 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.07 vs. limit=6.0 2023-12-04 16:48:59,330 INFO [train.py:1087] (2/4) Epoch 61, batch 300, loss[loss=0.1569, simple_loss=0.2524, pruned_loss=0.03063, over 24570.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2455, pruned_loss=0.02984, over 3743590.09 frames. ], batch size: 65, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:49:17,849 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=360066.6666666667, ans=0.125 2023-12-04 16:50:00,842 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-12-04 16:50:03,253 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360266.6666666667, ans=0.1 2023-12-04 16:50:07,331 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=360266.6666666667, ans=0.09899494936611666 2023-12-04 16:50:09,571 INFO [train.py:1087] (2/4) Epoch 61, batch 350, loss[loss=0.1553, simple_loss=0.2457, pruned_loss=0.03241, over 24557.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2452, pruned_loss=0.02963, over 3982994.58 frames. ], batch size: 63, lr: 4.03e-03, grad_scale: 16.0 2023-12-04 16:50:17,056 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=360333.3333333333, ans=0.0 2023-12-04 16:50:20,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=360333.3333333333, ans=0.0 2023-12-04 16:50:37,122 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360466.6666666667, ans=0.1 2023-12-04 16:50:42,231 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=360466.6666666667, ans=0.0 2023-12-04 16:50:59,884 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=360533.3333333333, ans=0.0 2023-12-04 16:51:02,312 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=9.508e-02 2023-12-04 16:51:14,388 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.266e+02 1.361e+02 1.516e+02 2.406e+02, threshold=2.722e+02, percent-clipped=0.0 2023-12-04 16:51:19,705 INFO [train.py:1087] (2/4) Epoch 61, batch 400, loss[loss=0.1433, simple_loss=0.235, pruned_loss=0.02579, over 24771.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02961, over 4164925.94 frames. ], batch size: 70, lr: 4.02e-03, grad_scale: 32.0 2023-12-04 16:51:26,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=360666.6666666667, ans=0.125 2023-12-04 16:51:38,813 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=22.5 2023-12-04 16:51:49,122 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=15.0 2023-12-04 16:51:54,733 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-12-04 16:52:12,858 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2023-12-04 16:52:21,361 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=360933.3333333333, ans=0.0 2023-12-04 16:52:21,780 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2023-12-04 16:52:29,204 INFO [train.py:1087] (2/4) Epoch 61, batch 450, loss[loss=0.1614, simple_loss=0.2519, pruned_loss=0.03545, over 23583.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.0295, over 4318821.41 frames. ], batch size: 94, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:52:34,098 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-12-04 16:52:36,135 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=361000.0, ans=0.09899494936611666 2023-12-04 16:52:43,135 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.67 vs. limit=22.5 2023-12-04 16:52:55,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=361133.3333333333, ans=0.05 2023-12-04 16:53:05,104 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.42 vs. limit=6.0 2023-12-04 16:53:23,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=361266.6666666667, ans=0.125 2023-12-04 16:53:34,669 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.242e+02 1.346e+02 1.491e+02 2.478e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 16:53:40,294 INFO [train.py:1087] (2/4) Epoch 61, batch 500, loss[loss=0.1446, simple_loss=0.2391, pruned_loss=0.02504, over 24722.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.245, pruned_loss=0.02964, over 4428867.79 frames. ], batch size: 69, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:53:59,498 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=361400.0, ans=0.0 2023-12-04 16:54:09,427 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2023-12-04 16:54:27,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=361533.3333333333, ans=0.125 2023-12-04 16:54:31,181 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=361533.3333333333, ans=0.125 2023-12-04 16:54:49,775 INFO [train.py:1087] (2/4) Epoch 61, batch 550, loss[loss=0.163, simple_loss=0.2524, pruned_loss=0.03675, over 24308.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2452, pruned_loss=0.0295, over 4524105.94 frames. ], batch size: 79, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:54:57,042 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-12-04 16:55:01,195 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361666.6666666667, ans=0.1 2023-12-04 16:55:04,534 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=361733.3333333333, ans=0.125 2023-12-04 16:55:13,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=361733.3333333333, ans=0.0 2023-12-04 16:55:14,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=361733.3333333333, ans=0.125 2023-12-04 16:55:20,303 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:55:44,381 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=361866.6666666667, ans=0.04949747468305833 2023-12-04 16:55:57,445 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.074e+02 1.276e+02 1.374e+02 1.486e+02 2.079e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 16:56:00,071 INFO [train.py:1087] (2/4) Epoch 61, batch 600, loss[loss=0.1517, simple_loss=0.2442, pruned_loss=0.02961, over 24767.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2449, pruned_loss=0.0294, over 4586014.39 frames. ], batch size: 70, lr: 4.02e-03, grad_scale: 8.0 2023-12-04 16:56:39,372 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=362133.3333333333, ans=0.125 2023-12-04 16:56:47,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=362200.0, ans=0.125 2023-12-04 16:57:09,916 INFO [train.py:1087] (2/4) Epoch 61, batch 650, loss[loss=0.1583, simple_loss=0.249, pruned_loss=0.03382, over 24433.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2447, pruned_loss=0.02932, over 4649285.70 frames. ], batch size: 77, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:57:10,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=362333.3333333333, ans=0.125 2023-12-04 16:57:20,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=362333.3333333333, ans=0.1 2023-12-04 16:57:26,606 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-12-04 16:57:27,603 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=362400.0, ans=0.125 2023-12-04 16:57:36,554 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=362466.6666666667, ans=0.0 2023-12-04 16:58:14,574 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.47 vs. limit=10.0 2023-12-04 16:58:16,580 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.291e+02 1.410e+02 1.493e+02 2.580e+02, threshold=2.821e+02, percent-clipped=0.0 2023-12-04 16:58:18,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=362666.6666666667, ans=0.09899494936611666 2023-12-04 16:58:19,711 INFO [train.py:1087] (2/4) Epoch 61, batch 700, loss[loss=0.1553, simple_loss=0.2462, pruned_loss=0.03222, over 24782.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2447, pruned_loss=0.0294, over 4686514.64 frames. ], batch size: 73, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:58:34,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=362733.3333333333, ans=0.125 2023-12-04 16:59:29,994 INFO [train.py:1087] (2/4) Epoch 61, batch 750, loss[loss=0.1647, simple_loss=0.2589, pruned_loss=0.03527, over 23946.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2449, pruned_loss=0.02962, over 4721422.43 frames. ], batch size: 87, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:59:36,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=363000.0, ans=0.125 2023-12-04 16:59:55,432 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363066.6666666667, ans=0.125 2023-12-04 16:59:59,283 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=363133.3333333333, ans=0.125 2023-12-04 17:00:04,494 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363133.3333333333, ans=0.1 2023-12-04 17:00:16,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363200.0, ans=0.125 2023-12-04 17:00:29,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=363266.6666666667, ans=0.125 2023-12-04 17:00:36,607 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.303e+02 1.435e+02 1.618e+02 2.179e+02, threshold=2.869e+02, percent-clipped=0.0 2023-12-04 17:00:39,248 INFO [train.py:1087] (2/4) Epoch 61, batch 800, loss[loss=0.1524, simple_loss=0.242, pruned_loss=0.0314, over 24607.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02936, over 4760720.09 frames. ], batch size: 68, lr: 4.01e-03, grad_scale: 16.0 2023-12-04 17:00:40,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=363333.3333333333, ans=0.125 2023-12-04 17:00:47,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363333.3333333333, ans=0.1 2023-12-04 17:01:27,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=363533.3333333333, ans=0.2 2023-12-04 17:01:32,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=363600.0, ans=0.1 2023-12-04 17:01:34,112 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2023-12-04 17:01:41,754 INFO [train.py:1087] (2/4) Epoch 61, batch 850, loss[loss=0.1491, simple_loss=0.2471, pruned_loss=0.02548, over 24552.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2445, pruned_loss=0.02919, over 4783219.01 frames. ], batch size: 62, lr: 4.01e-03, grad_scale: 16.0 2023-12-04 17:02:32,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=363933.3333333333, ans=0.04949747468305833 2023-12-04 17:02:48,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363966.6666666667, ans=0.125 2023-12-04 17:03:02,157 INFO [train.py:1087] (2/4) Epoch 62, batch 0, loss[loss=0.1403, simple_loss=0.2341, pruned_loss=0.02331, over 24726.00 frames. ], tot_loss[loss=0.1403, simple_loss=0.2341, pruned_loss=0.02331, over 24726.00 frames. ], batch size: 67, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:03:02,158 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 17:03:18,758 INFO [train.py:1119] (2/4) Epoch 62, validation: loss=0.1507, simple_loss=0.2477, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 17:03:18,761 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 17:03:22,814 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.310e+02 1.390e+02 1.540e+02 2.512e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 17:03:54,116 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:04:02,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=364166.6666666667, ans=0.125 2023-12-04 17:04:28,797 INFO [train.py:1087] (2/4) Epoch 62, batch 50, loss[loss=0.1622, simple_loss=0.2524, pruned_loss=0.03597, over 21488.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2453, pruned_loss=0.02986, over 1081141.03 frames. ], batch size: 127, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:04:30,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364300.0, ans=0.1 2023-12-04 17:05:09,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=364500.0, ans=0.0 2023-12-04 17:05:33,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=364566.6666666667, ans=0.0 2023-12-04 17:05:34,424 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=22.5 2023-12-04 17:05:35,217 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=364566.6666666667, ans=0.0 2023-12-04 17:05:38,084 INFO [train.py:1087] (2/4) Epoch 62, batch 100, loss[loss=0.1557, simple_loss=0.2502, pruned_loss=0.03057, over 23494.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2451, pruned_loss=0.02976, over 1898153.34 frames. ], batch size: 94, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:05:41,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=364633.3333333333, ans=0.0 2023-12-04 17:05:42,761 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.257e+02 1.334e+02 1.479e+02 2.248e+02, threshold=2.667e+02, percent-clipped=0.0 2023-12-04 17:05:48,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364633.3333333333, ans=0.1 2023-12-04 17:05:48,478 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=364633.3333333333, ans=0.2 2023-12-04 17:06:15,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=364766.6666666667, ans=0.125 2023-12-04 17:06:27,726 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.05 vs. limit=22.5 2023-12-04 17:06:29,911 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=364833.3333333333, ans=0.125 2023-12-04 17:06:29,948 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=364833.3333333333, ans=0.2 2023-12-04 17:06:35,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=364900.0, ans=0.05 2023-12-04 17:06:42,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364900.0, ans=0.1 2023-12-04 17:06:48,108 INFO [train.py:1087] (2/4) Epoch 62, batch 150, loss[loss=0.1435, simple_loss=0.2385, pruned_loss=0.02419, over 24598.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.245, pruned_loss=0.02972, over 2543663.94 frames. ], batch size: 68, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:06:54,200 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.56 vs. limit=15.0 2023-12-04 17:07:46,463 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.24 vs. limit=22.5 2023-12-04 17:07:58,187 INFO [train.py:1087] (2/4) Epoch 62, batch 200, loss[loss=0.1634, simple_loss=0.2527, pruned_loss=0.03699, over 24492.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2456, pruned_loss=0.03018, over 3032764.84 frames. ], batch size: 77, lr: 3.97e-03, grad_scale: 16.0 2023-12-04 17:07:58,477 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=365300.0, ans=0.0 2023-12-04 17:08:03,394 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.286e+02 1.429e+02 1.574e+02 2.172e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 17:08:31,153 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=365433.3333333333, ans=0.0 2023-12-04 17:08:33,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=365433.3333333333, ans=0.125 2023-12-04 17:09:06,087 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=365633.3333333333, ans=0.125 2023-12-04 17:09:07,630 INFO [train.py:1087] (2/4) Epoch 62, batch 250, loss[loss=0.1425, simple_loss=0.2312, pruned_loss=0.02695, over 24568.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2451, pruned_loss=0.02991, over 3431108.21 frames. ], batch size: 64, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:09:09,285 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=365633.3333333333, ans=0.2 2023-12-04 17:09:36,439 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=365766.6666666667, ans=0.2 2023-12-04 17:09:47,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=365766.6666666667, ans=10.0 2023-12-04 17:10:14,027 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=365900.0, ans=0.125 2023-12-04 17:10:18,236 INFO [train.py:1087] (2/4) Epoch 62, batch 300, loss[loss=0.1535, simple_loss=0.245, pruned_loss=0.03099, over 24610.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2448, pruned_loss=0.02971, over 3731311.05 frames. ], batch size: 68, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:10:23,652 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.294e+02 1.414e+02 1.524e+02 2.154e+02, threshold=2.827e+02, percent-clipped=0.0 2023-12-04 17:10:39,128 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366033.3333333333, ans=0.1 2023-12-04 17:11:12,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=366233.3333333333, ans=0.0 2023-12-04 17:11:20,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=366233.3333333333, ans=0.125 2023-12-04 17:11:23,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=366233.3333333333, ans=0.0 2023-12-04 17:11:25,107 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.81 vs. limit=15.0 2023-12-04 17:11:27,236 INFO [train.py:1087] (2/4) Epoch 62, batch 350, loss[loss=0.1603, simple_loss=0.2533, pruned_loss=0.03366, over 24810.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2449, pruned_loss=0.02985, over 3970917.81 frames. ], batch size: 62, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:11:50,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=366366.6666666667, ans=0.125 2023-12-04 17:11:59,951 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=366433.3333333333, ans=0.04949747468305833 2023-12-04 17:12:08,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=366500.0, ans=0.0 2023-12-04 17:12:37,919 INFO [train.py:1087] (2/4) Epoch 62, batch 400, loss[loss=0.1432, simple_loss=0.2372, pruned_loss=0.02455, over 24717.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2447, pruned_loss=0.02947, over 4158123.04 frames. ], batch size: 69, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:12:39,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=366633.3333333333, ans=0.05 2023-12-04 17:12:43,082 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.289e+02 1.370e+02 1.489e+02 1.950e+02, threshold=2.740e+02, percent-clipped=0.0 2023-12-04 17:12:43,461 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=366633.3333333333, ans=0.1 2023-12-04 17:13:12,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=366766.6666666667, ans=0.0 2023-12-04 17:13:30,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366833.3333333333, ans=0.1 2023-12-04 17:13:49,169 INFO [train.py:1087] (2/4) Epoch 62, batch 450, loss[loss=0.1425, simple_loss=0.2307, pruned_loss=0.02712, over 24763.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02942, over 4292878.64 frames. ], batch size: 65, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:14:27,022 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:14:32,079 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367166.6666666667, ans=0.125 2023-12-04 17:14:34,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=367166.6666666667, ans=0.125 2023-12-04 17:14:57,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=367300.0, ans=0.125 2023-12-04 17:14:58,527 INFO [train.py:1087] (2/4) Epoch 62, batch 500, loss[loss=0.1558, simple_loss=0.2481, pruned_loss=0.03172, over 22754.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2444, pruned_loss=0.02927, over 4406652.96 frames. ], batch size: 106, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:15:01,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=367300.0, ans=0.5 2023-12-04 17:15:04,485 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.254e+02 1.390e+02 1.611e+02 2.032e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 17:15:12,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=367366.6666666667, ans=0.125 2023-12-04 17:15:17,597 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-12-04 17:15:25,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=367433.3333333333, ans=0.125 2023-12-04 17:15:36,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=367433.3333333333, ans=0.1 2023-12-04 17:16:08,875 INFO [train.py:1087] (2/4) Epoch 62, batch 550, loss[loss=0.1609, simple_loss=0.2525, pruned_loss=0.03467, over 24796.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02939, over 4487497.47 frames. ], batch size: 62, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:16:09,608 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-12-04 17:16:12,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=367633.3333333333, ans=0.2 2023-12-04 17:16:23,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=367700.0, ans=0.0 2023-12-04 17:16:25,318 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=367700.0, ans=0.07 2023-12-04 17:16:29,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=367700.0, ans=0.125 2023-12-04 17:16:47,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=367766.6666666667, ans=0.2 2023-12-04 17:17:01,423 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.38 vs. limit=10.0 2023-12-04 17:17:05,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=367900.0, ans=0.2 2023-12-04 17:17:19,972 INFO [train.py:1087] (2/4) Epoch 62, batch 600, loss[loss=0.1433, simple_loss=0.2353, pruned_loss=0.02568, over 24750.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2448, pruned_loss=0.02966, over 4529145.72 frames. ], batch size: 65, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:17:25,847 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.279e+02 1.362e+02 1.462e+02 1.926e+02, threshold=2.724e+02, percent-clipped=0.0 2023-12-04 17:17:34,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=368033.3333333333, ans=0.125 2023-12-04 17:18:05,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=368166.6666666667, ans=0.125 2023-12-04 17:18:30,241 INFO [train.py:1087] (2/4) Epoch 62, batch 650, loss[loss=0.1734, simple_loss=0.262, pruned_loss=0.04243, over 24457.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02944, over 4611680.57 frames. ], batch size: 75, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:18:45,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=368366.6666666667, ans=0.0 2023-12-04 17:18:57,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=368433.3333333333, ans=0.0 2023-12-04 17:19:08,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=368433.3333333333, ans=0.125 2023-12-04 17:19:30,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=368566.6666666667, ans=0.125 2023-12-04 17:19:31,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=368566.6666666667, ans=0.035 2023-12-04 17:19:40,827 INFO [train.py:1087] (2/4) Epoch 62, batch 700, loss[loss=0.1514, simple_loss=0.2464, pruned_loss=0.02813, over 24751.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02935, over 4663522.26 frames. ], batch size: 61, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:19:42,497 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=368633.3333333333, ans=0.05 2023-12-04 17:19:46,059 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.235e+02 1.327e+02 1.428e+02 1.711e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 17:20:26,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=368833.3333333333, ans=0.0 2023-12-04 17:20:45,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=368900.0, ans=0.0 2023-12-04 17:20:50,495 INFO [train.py:1087] (2/4) Epoch 62, batch 750, loss[loss=0.1496, simple_loss=0.2446, pruned_loss=0.02732, over 24604.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2446, pruned_loss=0.02957, over 4703159.68 frames. ], batch size: 68, lr: 3.95e-03, grad_scale: 16.0 2023-12-04 17:21:01,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=368966.6666666667, ans=0.0 2023-12-04 17:21:35,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=369166.6666666667, ans=0.125 2023-12-04 17:21:36,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=369166.6666666667, ans=0.125 2023-12-04 17:21:38,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=369166.6666666667, ans=0.02 2023-12-04 17:21:58,783 INFO [train.py:1087] (2/4) Epoch 62, batch 800, loss[loss=0.1493, simple_loss=0.2444, pruned_loss=0.02705, over 24774.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2443, pruned_loss=0.02948, over 4747522.27 frames. ], batch size: 65, lr: 3.94e-03, grad_scale: 32.0 2023-12-04 17:22:00,161 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=369300.0, ans=0.125 2023-12-04 17:22:05,817 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.274e+02 1.367e+02 1.528e+02 1.973e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 17:22:18,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=369366.6666666667, ans=0.125 2023-12-04 17:22:35,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=369500.0, ans=0.5 2023-12-04 17:22:37,933 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.57 vs. limit=15.0 2023-12-04 17:22:51,196 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:22:53,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=369566.6666666667, ans=0.0 2023-12-04 17:22:55,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=369566.6666666667, ans=0.0 2023-12-04 17:23:00,455 INFO [train.py:1087] (2/4) Epoch 62, batch 850, loss[loss=0.1439, simple_loss=0.2387, pruned_loss=0.02458, over 24727.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2448, pruned_loss=0.02965, over 4766029.67 frames. ], batch size: 67, lr: 3.94e-03, grad_scale: 32.0 2023-12-04 17:23:08,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=369633.3333333333, ans=0.2 2023-12-04 17:23:10,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=369633.3333333333, ans=0.125 2023-12-04 17:23:23,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=369766.6666666667, ans=0.2 2023-12-04 17:23:41,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=369833.3333333333, ans=0.125 2023-12-04 17:23:52,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=369900.0, ans=0.95 2023-12-04 17:24:05,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=369933.3333333333, ans=0.0 2023-12-04 17:24:17,808 INFO [train.py:1087] (2/4) Epoch 63, batch 0, loss[loss=0.1572, simple_loss=0.2539, pruned_loss=0.03025, over 24194.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2539, pruned_loss=0.03025, over 24194.00 frames. ], batch size: 82, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:24:17,809 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 17:24:29,187 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.7945, 5.0604, 4.3789, 4.8918], device='cuda:2') 2023-12-04 17:24:33,804 INFO [train.py:1119] (2/4) Epoch 63, validation: loss=0.1507, simple_loss=0.2477, pruned_loss=0.02691, over 944034.00 frames. 2023-12-04 17:24:33,805 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 17:24:40,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=369933.3333333333, ans=0.1 2023-12-04 17:24:45,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=369933.3333333333, ans=10.0 2023-12-04 17:24:47,335 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.268e+02 1.369e+02 1.467e+02 2.072e+02, threshold=2.738e+02, percent-clipped=0.0 2023-12-04 17:25:43,159 INFO [train.py:1087] (2/4) Epoch 63, batch 50, loss[loss=0.1504, simple_loss=0.2449, pruned_loss=0.02794, over 24806.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2435, pruned_loss=0.02889, over 1077944.61 frames. ], batch size: 72, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:25:49,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=370266.6666666667, ans=0.0 2023-12-04 17:26:03,048 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=370333.3333333333, ans=0.0 2023-12-04 17:26:16,270 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-12-04 17:26:52,598 INFO [train.py:1087] (2/4) Epoch 63, batch 100, loss[loss=0.1486, simple_loss=0.2473, pruned_loss=0.02494, over 21352.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2436, pruned_loss=0.0285, over 1902084.74 frames. ], batch size: 128, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:27:03,026 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2023-12-04 17:27:07,285 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.263e+02 1.363e+02 1.431e+02 2.144e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 17:27:09,224 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-12-04 17:27:12,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=370666.6666666667, ans=0.015 2023-12-04 17:27:16,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=370666.6666666667, ans=0.07 2023-12-04 17:27:16,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=370666.6666666667, ans=0.125 2023-12-04 17:27:17,654 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370666.6666666667, ans=0.1 2023-12-04 17:27:49,584 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:27:52,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=370866.6666666667, ans=0.95 2023-12-04 17:27:55,650 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-12-04 17:28:03,093 INFO [train.py:1087] (2/4) Epoch 63, batch 150, loss[loss=0.1492, simple_loss=0.2398, pruned_loss=0.02928, over 24285.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2434, pruned_loss=0.02858, over 2551549.73 frames. ], batch size: 79, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:28:06,233 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=370933.3333333333, ans=0.125 2023-12-04 17:28:34,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=371066.6666666667, ans=0.125 2023-12-04 17:28:44,963 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:28:46,743 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.22 vs. limit=22.5 2023-12-04 17:29:05,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.61 vs. limit=15.0 2023-12-04 17:29:13,697 INFO [train.py:1087] (2/4) Epoch 63, batch 200, loss[loss=0.1621, simple_loss=0.2554, pruned_loss=0.03444, over 24483.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2436, pruned_loss=0.02868, over 3033526.73 frames. ], batch size: 75, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:29:26,807 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.284e+02 1.367e+02 1.487e+02 1.776e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 17:29:33,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=371333.3333333333, ans=0.0 2023-12-04 17:29:34,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371333.3333333333, ans=0.1 2023-12-04 17:30:21,838 INFO [train.py:1087] (2/4) Epoch 63, batch 250, loss[loss=0.1433, simple_loss=0.2346, pruned_loss=0.02594, over 24751.00 frames. ], tot_loss[loss=0.1512, simple_loss=0.2442, pruned_loss=0.0291, over 3426398.94 frames. ], batch size: 66, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:30:34,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=371666.6666666667, ans=0.95 2023-12-04 17:30:37,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=371666.6666666667, ans=0.125 2023-12-04 17:30:45,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=371666.6666666667, ans=0.2 2023-12-04 17:30:55,847 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.65 vs. limit=15.0 2023-12-04 17:31:10,773 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.92 vs. limit=22.5 2023-12-04 17:31:30,988 INFO [train.py:1087] (2/4) Epoch 63, batch 300, loss[loss=0.1452, simple_loss=0.2354, pruned_loss=0.02746, over 24775.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.244, pruned_loss=0.02869, over 3743062.65 frames. ], batch size: 64, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:31:45,105 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.246e+02 1.351e+02 1.452e+02 1.759e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 17:32:04,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=372066.6666666667, ans=0.125 2023-12-04 17:32:08,456 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.55 vs. limit=15.0 2023-12-04 17:32:12,927 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.99 vs. limit=10.0 2023-12-04 17:32:13,859 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=372133.3333333333, ans=0.95 2023-12-04 17:32:15,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=372133.3333333333, ans=0.125 2023-12-04 17:32:16,607 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=372133.3333333333, ans=0.125 2023-12-04 17:32:16,763 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-12-04 17:32:17,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=372133.3333333333, ans=0.125 2023-12-04 17:32:33,599 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=372200.0, ans=0.125 2023-12-04 17:32:40,444 INFO [train.py:1087] (2/4) Epoch 63, batch 350, loss[loss=0.1487, simple_loss=0.2383, pruned_loss=0.02955, over 24811.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2441, pruned_loss=0.02875, over 3971144.16 frames. ], batch size: 62, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:32:40,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=372266.6666666667, ans=0.0 2023-12-04 17:33:12,462 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.98 vs. limit=10.0 2023-12-04 17:33:25,335 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=372466.6666666667, ans=0.1 2023-12-04 17:33:34,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=372533.3333333333, ans=0.125 2023-12-04 17:33:36,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=372533.3333333333, ans=0.0 2023-12-04 17:33:45,148 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=372533.3333333333, ans=0.125 2023-12-04 17:33:46,781 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.57 vs. limit=22.5 2023-12-04 17:33:48,540 INFO [train.py:1087] (2/4) Epoch 63, batch 400, loss[loss=0.1578, simple_loss=0.249, pruned_loss=0.03335, over 24104.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2445, pruned_loss=0.02922, over 4151187.96 frames. ], batch size: 87, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:33:52,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=372600.0, ans=0.05 2023-12-04 17:33:58,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=372600.0, ans=0.0 2023-12-04 17:34:02,379 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.304e+02 1.404e+02 1.512e+02 1.872e+02, threshold=2.808e+02, percent-clipped=0.0 2023-12-04 17:34:18,135 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-12-04 17:34:36,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=372800.0, ans=0.0 2023-12-04 17:34:39,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372800.0, ans=0.1 2023-12-04 17:34:40,644 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.81 vs. limit=5.0 2023-12-04 17:34:43,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=372866.6666666667, ans=0.95 2023-12-04 17:34:57,690 INFO [train.py:1087] (2/4) Epoch 63, batch 450, loss[loss=0.1569, simple_loss=0.2497, pruned_loss=0.03207, over 22972.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2444, pruned_loss=0.02907, over 4297696.11 frames. ], batch size: 106, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:35:13,409 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=373000.0, ans=0.2 2023-12-04 17:35:25,477 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:35:30,146 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-12-04 17:36:05,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=373200.0, ans=0.0 2023-12-04 17:36:08,736 INFO [train.py:1087] (2/4) Epoch 63, batch 500, loss[loss=0.141, simple_loss=0.235, pruned_loss=0.02354, over 24750.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2441, pruned_loss=0.02896, over 4415319.48 frames. ], batch size: 65, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:36:21,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=373333.3333333333, ans=0.0 2023-12-04 17:36:26,480 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.037e+02 1.250e+02 1.321e+02 1.428e+02 2.145e+02, threshold=2.643e+02, percent-clipped=0.0 2023-12-04 17:36:39,796 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:36:47,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=373400.0, ans=0.125 2023-12-04 17:36:47,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=373400.0, ans=0.125 2023-12-04 17:37:21,410 INFO [train.py:1087] (2/4) Epoch 63, batch 550, loss[loss=0.1616, simple_loss=0.2507, pruned_loss=0.03624, over 24472.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2439, pruned_loss=0.02892, over 4514820.64 frames. ], batch size: 77, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:37:30,916 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=373600.0, ans=0.125 2023-12-04 17:37:57,655 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2023-12-04 17:38:27,388 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-12-04 17:38:29,453 INFO [train.py:1087] (2/4) Epoch 63, batch 600, loss[loss=0.1768, simple_loss=0.259, pruned_loss=0.0473, over 17613.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2439, pruned_loss=0.02899, over 4555074.58 frames. ], batch size: 177, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:38:31,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=373933.3333333333, ans=0.125 2023-12-04 17:38:43,603 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.277e+02 1.352e+02 1.461e+02 2.062e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 17:38:45,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=374000.0, ans=0.125 2023-12-04 17:38:46,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=374000.0, ans=0.07 2023-12-04 17:38:50,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=374000.0, ans=0.2 2023-12-04 17:39:01,521 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.69 vs. limit=15.0 2023-12-04 17:39:08,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=374133.3333333333, ans=0.0 2023-12-04 17:39:11,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=374133.3333333333, ans=0.125 2023-12-04 17:39:24,530 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-12-04 17:39:28,045 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=374200.0, ans=0.0 2023-12-04 17:39:30,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=374200.0, ans=0.125 2023-12-04 17:39:36,775 INFO [train.py:1087] (2/4) Epoch 63, batch 650, loss[loss=0.143, simple_loss=0.2424, pruned_loss=0.0218, over 24849.00 frames. ], tot_loss[loss=0.151, simple_loss=0.244, pruned_loss=0.02899, over 4628300.34 frames. ], batch size: 68, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:39:37,564 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.49 vs. limit=22.5 2023-12-04 17:39:38,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=374266.6666666667, ans=0.0 2023-12-04 17:39:40,718 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-12-04 17:39:42,760 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:39:47,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=374266.6666666667, ans=0.0 2023-12-04 17:39:55,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=374333.3333333333, ans=0.2 2023-12-04 17:40:00,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=374333.3333333333, ans=0.1 2023-12-04 17:40:06,044 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=374400.0, ans=0.0 2023-12-04 17:40:18,153 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.08 vs. limit=15.0 2023-12-04 17:40:22,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=374466.6666666667, ans=0.2 2023-12-04 17:40:28,613 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=22.5 2023-12-04 17:40:33,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=374533.3333333333, ans=0.125 2023-12-04 17:40:36,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=374533.3333333333, ans=0.0 2023-12-04 17:40:44,435 INFO [train.py:1087] (2/4) Epoch 63, batch 700, loss[loss=0.1569, simple_loss=0.2521, pruned_loss=0.03089, over 24172.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2441, pruned_loss=0.02891, over 4657640.47 frames. ], batch size: 82, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:40:58,397 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.326e+02 1.433e+02 1.543e+02 2.132e+02, threshold=2.865e+02, percent-clipped=0.0 2023-12-04 17:41:26,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=374800.0, ans=0.125 2023-12-04 17:41:36,608 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=374800.0, ans=0.0 2023-12-04 17:41:39,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=374866.6666666667, ans=0.025 2023-12-04 17:41:52,553 INFO [train.py:1087] (2/4) Epoch 63, batch 750, loss[loss=0.1566, simple_loss=0.2486, pruned_loss=0.03225, over 23715.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2444, pruned_loss=0.02914, over 4673460.98 frames. ], batch size: 57, lr: 3.88e-03, grad_scale: 32.0 2023-12-04 17:41:54,624 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-12-04 17:42:31,390 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=375133.3333333333, ans=0.0 2023-12-04 17:42:35,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=375133.3333333333, ans=0.0 2023-12-04 17:42:59,296 INFO [train.py:1087] (2/4) Epoch 63, batch 800, loss[loss=0.1766, simple_loss=0.2624, pruned_loss=0.04537, over 17250.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2445, pruned_loss=0.02918, over 4699896.56 frames. ], batch size: 177, lr: 3.88e-03, grad_scale: 32.0 2023-12-04 17:43:11,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=375333.3333333333, ans=0.2 2023-12-04 17:43:13,168 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.287e+02 1.371e+02 1.480e+02 1.982e+02, threshold=2.741e+02, percent-clipped=0.0 2023-12-04 17:43:34,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=375400.0, ans=0.07 2023-12-04 17:43:45,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=375466.6666666667, ans=0.125 2023-12-04 17:43:51,284 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-12-04 17:44:00,215 INFO [train.py:1087] (2/4) Epoch 63, batch 850, loss[loss=0.1643, simple_loss=0.2534, pruned_loss=0.03762, over 24508.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2443, pruned_loss=0.02913, over 4735114.27 frames. ], batch size: 75, lr: 3.88e-03, grad_scale: 16.0 2023-12-04 17:44:22,066 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.05 vs. limit=15.0 2023-12-04 17:44:44,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=375800.0, ans=0.0 2023-12-04 17:44:48,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=375866.6666666667, ans=0.035 2023-12-04 17:45:07,820 INFO [train.py:1087] (2/4) Epoch 64, batch 0, loss[loss=0.1384, simple_loss=0.2406, pruned_loss=0.01807, over 24703.00 frames. ], tot_loss[loss=0.1384, simple_loss=0.2406, pruned_loss=0.01807, over 24703.00 frames. ], batch size: 74, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:45:07,821 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 17:45:23,276 INFO [train.py:1119] (2/4) Epoch 64, validation: loss=0.1503, simple_loss=0.2474, pruned_loss=0.02664, over 944034.00 frames. 2023-12-04 17:45:23,277 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 17:45:25,194 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.62 vs. limit=15.0 2023-12-04 17:45:28,534 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=375900.0, ans=0.0 2023-12-04 17:45:37,507 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.91 vs. limit=10.0 2023-12-04 17:45:43,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=375966.6666666667, ans=0.0 2023-12-04 17:45:44,661 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.332e+02 1.453e+02 1.597e+02 2.099e+02, threshold=2.906e+02, percent-clipped=0.0 2023-12-04 17:46:03,386 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=376100.0, ans=0.2 2023-12-04 17:46:06,500 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.54 vs. limit=15.0 2023-12-04 17:46:30,837 INFO [train.py:1087] (2/4) Epoch 64, batch 50, loss[loss=0.1734, simple_loss=0.2643, pruned_loss=0.04122, over 24504.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2449, pruned_loss=0.02903, over 1079090.87 frames. ], batch size: 75, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:46:41,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=376233.3333333333, ans=0.125 2023-12-04 17:46:55,463 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=376366.6666666667, ans=0.0 2023-12-04 17:47:04,878 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-12-04 17:47:36,059 INFO [train.py:1087] (2/4) Epoch 64, batch 100, loss[loss=0.1768, simple_loss=0.2608, pruned_loss=0.04637, over 16947.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2445, pruned_loss=0.0291, over 1911593.60 frames. ], batch size: 177, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:47:41,432 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=376566.6666666667, ans=0.125 2023-12-04 17:47:57,751 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.310e+02 1.417e+02 1.571e+02 2.494e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 17:48:01,393 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.73 vs. limit=10.0 2023-12-04 17:48:14,577 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=376766.6666666667, ans=0.0 2023-12-04 17:48:19,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=376766.6666666667, ans=0.05 2023-12-04 17:48:21,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=376766.6666666667, ans=0.125 2023-12-04 17:48:42,167 INFO [train.py:1087] (2/4) Epoch 64, batch 150, loss[loss=0.1475, simple_loss=0.2403, pruned_loss=0.02738, over 24567.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02926, over 2555261.93 frames. ], batch size: 66, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:48:42,834 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.95 vs. limit=22.5 2023-12-04 17:48:42,872 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=15.0 2023-12-04 17:49:00,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=376966.6666666667, ans=0.1 2023-12-04 17:49:06,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=376966.6666666667, ans=0.2 2023-12-04 17:49:12,730 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:49:16,978 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=15.0 2023-12-04 17:49:17,574 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=377033.3333333333, ans=0.125 2023-12-04 17:49:50,768 INFO [train.py:1087] (2/4) Epoch 64, batch 200, loss[loss=0.1382, simple_loss=0.2301, pruned_loss=0.02316, over 24788.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2443, pruned_loss=0.02919, over 3063651.48 frames. ], batch size: 70, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:49:57,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=377233.3333333333, ans=0.125 2023-12-04 17:50:13,823 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.291e+02 1.424e+02 1.631e+02 2.278e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 17:50:21,063 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.88 vs. limit=15.0 2023-12-04 17:50:57,983 INFO [train.py:1087] (2/4) Epoch 64, batch 250, loss[loss=0.1578, simple_loss=0.2523, pruned_loss=0.03162, over 24312.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2442, pruned_loss=0.0292, over 3453104.76 frames. ], batch size: 79, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:51:09,863 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-12-04 17:51:21,022 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:51:21,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=377633.3333333333, ans=0.125 2023-12-04 17:51:21,466 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.03 vs. limit=15.0 2023-12-04 17:51:34,105 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.02 vs. limit=10.0 2023-12-04 17:51:39,107 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-12-04 17:51:50,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377833.3333333333, ans=0.1 2023-12-04 17:52:05,763 INFO [train.py:1087] (2/4) Epoch 64, batch 300, loss[loss=0.1572, simple_loss=0.2507, pruned_loss=0.03187, over 24786.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2442, pruned_loss=0.02927, over 3746551.46 frames. ], batch size: 62, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:52:28,302 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.270e+02 1.343e+02 1.484e+02 1.945e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 17:53:11,554 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.57 vs. limit=15.0 2023-12-04 17:53:13,685 INFO [train.py:1087] (2/4) Epoch 64, batch 350, loss[loss=0.154, simple_loss=0.2483, pruned_loss=0.02979, over 24162.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2443, pruned_loss=0.0292, over 3976296.76 frames. ], batch size: 82, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:53:14,002 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=378233.3333333333, ans=0.125 2023-12-04 17:53:16,372 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=378233.3333333333, ans=0.0 2023-12-04 17:53:36,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378300.0, ans=0.1 2023-12-04 17:53:48,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378366.6666666667, ans=0.1 2023-12-04 17:53:51,497 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-12-04 17:53:58,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378433.3333333333, ans=0.1 2023-12-04 17:54:12,542 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.30 vs. limit=15.0 2023-12-04 17:54:18,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=378500.0, ans=0.04949747468305833 2023-12-04 17:54:22,728 INFO [train.py:1087] (2/4) Epoch 64, batch 400, loss[loss=0.1534, simple_loss=0.2453, pruned_loss=0.0308, over 24359.00 frames. ], tot_loss[loss=0.1512, simple_loss=0.2442, pruned_loss=0.02911, over 4166491.62 frames. ], batch size: 79, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:54:47,356 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.264e+02 1.356e+02 1.518e+02 1.714e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 17:54:51,812 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 17:54:56,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=378700.0, ans=0.0 2023-12-04 17:55:06,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-12-04 17:55:29,797 INFO [train.py:1087] (2/4) Epoch 64, batch 450, loss[loss=0.1401, simple_loss=0.2345, pruned_loss=0.02283, over 24758.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2439, pruned_loss=0.02894, over 4322311.22 frames. ], batch size: 70, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:55:40,734 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378900.0, ans=0.1 2023-12-04 17:55:50,407 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378966.6666666667, ans=0.1 2023-12-04 17:55:50,423 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=378966.6666666667, ans=0.125 2023-12-04 17:55:50,806 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.83 vs. limit=22.5 2023-12-04 17:56:05,887 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379033.3333333333, ans=0.125 2023-12-04 17:56:13,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=379100.0, ans=0.07 2023-12-04 17:56:38,622 INFO [train.py:1087] (2/4) Epoch 64, batch 500, loss[loss=0.1605, simple_loss=0.2576, pruned_loss=0.03173, over 23398.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.244, pruned_loss=0.02894, over 4441744.94 frames. ], batch size: 94, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:56:40,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=379233.3333333333, ans=0.0 2023-12-04 17:56:40,761 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-12-04 17:56:47,025 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.22 vs. limit=15.0 2023-12-04 17:57:01,421 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.261e+02 1.354e+02 1.492e+02 2.080e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-04 17:57:07,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=379366.6666666667, ans=0.95 2023-12-04 17:57:08,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=379366.6666666667, ans=0.0 2023-12-04 17:57:08,895 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=379366.6666666667, ans=0.125 2023-12-04 17:57:24,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=379433.3333333333, ans=0.0 2023-12-04 17:57:37,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=379500.0, ans=0.0 2023-12-04 17:57:45,062 INFO [train.py:1087] (2/4) Epoch 64, batch 550, loss[loss=0.1384, simple_loss=0.2333, pruned_loss=0.02176, over 24677.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2445, pruned_loss=0.02919, over 4514354.54 frames. ], batch size: 74, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:57:55,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=379566.6666666667, ans=0.2 2023-12-04 17:57:57,919 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379566.6666666667, ans=0.0 2023-12-04 17:58:21,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=379700.0, ans=0.0 2023-12-04 17:58:21,328 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:58:44,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=379833.3333333333, ans=0.125 2023-12-04 17:58:54,876 INFO [train.py:1087] (2/4) Epoch 64, batch 600, loss[loss=0.1527, simple_loss=0.2429, pruned_loss=0.03128, over 24728.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2442, pruned_loss=0.029, over 4583295.70 frames. ], batch size: 67, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:58:56,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=379900.0, ans=0.1 2023-12-04 17:59:10,332 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=379966.6666666667, ans=0.0 2023-12-04 17:59:20,036 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.313e+02 1.387e+02 1.529e+02 2.312e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 17:59:31,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=380033.3333333333, ans=0.5 2023-12-04 17:59:49,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=380166.6666666667, ans=0.0 2023-12-04 18:00:00,800 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.06 vs. limit=22.5 2023-12-04 18:00:03,409 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.25 vs. limit=15.0 2023-12-04 18:00:04,029 INFO [train.py:1087] (2/4) Epoch 64, batch 650, loss[loss=0.1539, simple_loss=0.2496, pruned_loss=0.02909, over 24801.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2442, pruned_loss=0.02903, over 4618717.32 frames. ], batch size: 73, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 18:00:04,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=380233.3333333333, ans=0.5 2023-12-04 18:00:14,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380233.3333333333, ans=0.1 2023-12-04 18:00:51,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=380433.3333333333, ans=0.125 2023-12-04 18:01:01,125 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-12-04 18:01:02,480 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-12-04 18:01:05,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=380500.0, ans=0.125 2023-12-04 18:01:13,767 INFO [train.py:1087] (2/4) Epoch 64, batch 700, loss[loss=0.157, simple_loss=0.2488, pruned_loss=0.03259, over 24565.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2439, pruned_loss=0.02888, over 4670722.77 frames. ], batch size: 64, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 18:01:27,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=380633.3333333333, ans=0.0 2023-12-04 18:01:31,483 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:01:35,749 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.16 vs. limit=22.5 2023-12-04 18:01:37,723 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.273e+02 1.345e+02 1.468e+02 2.516e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 18:01:38,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=380633.3333333333, ans=0.2 2023-12-04 18:01:56,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=380766.6666666667, ans=0.125 2023-12-04 18:02:18,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=380833.3333333333, ans=0.125 2023-12-04 18:02:22,231 INFO [train.py:1087] (2/4) Epoch 64, batch 750, loss[loss=0.1461, simple_loss=0.239, pruned_loss=0.02658, over 24776.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2442, pruned_loss=0.02899, over 4710497.60 frames. ], batch size: 70, lr: 3.82e-03, grad_scale: 16.0 2023-12-04 18:02:36,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=380966.6666666667, ans=0.07 2023-12-04 18:02:44,926 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=380966.6666666667, ans=0.07 2023-12-04 18:02:55,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=381033.3333333333, ans=0.04949747468305833 2023-12-04 18:02:57,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=381033.3333333333, ans=0.125 2023-12-04 18:02:59,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=381033.3333333333, ans=0.5 2023-12-04 18:02:59,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=381033.3333333333, ans=0.125 2023-12-04 18:03:14,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=381100.0, ans=0.2 2023-12-04 18:03:30,771 INFO [train.py:1087] (2/4) Epoch 64, batch 800, loss[loss=0.15, simple_loss=0.239, pruned_loss=0.03056, over 24469.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2438, pruned_loss=0.0288, over 4731886.94 frames. ], batch size: 77, lr: 3.82e-03, grad_scale: 32.0 2023-12-04 18:03:36,344 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=381233.3333333333, ans=0.125 2023-12-04 18:03:41,725 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-12-04 18:03:49,060 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.37 vs. limit=15.0 2023-12-04 18:03:54,546 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=12.0 2023-12-04 18:03:55,034 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.286e+02 1.344e+02 1.464e+02 1.844e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 18:04:06,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=381366.6666666667, ans=0.125 2023-12-04 18:04:19,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=381433.3333333333, ans=0.0 2023-12-04 18:04:29,194 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-12-04 18:04:34,010 INFO [train.py:1087] (2/4) Epoch 64, batch 850, loss[loss=0.1459, simple_loss=0.2367, pruned_loss=0.02758, over 24753.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2439, pruned_loss=0.02878, over 4761727.69 frames. ], batch size: 66, lr: 3.82e-03, grad_scale: 32.0 2023-12-04 18:04:43,131 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-12-04 18:04:46,059 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381633.3333333333, ans=0.1 2023-12-04 18:05:46,382 INFO [train.py:1087] (2/4) Epoch 65, batch 0, loss[loss=0.1372, simple_loss=0.2325, pruned_loss=0.02091, over 24568.00 frames. ], tot_loss[loss=0.1372, simple_loss=0.2325, pruned_loss=0.02091, over 24568.00 frames. ], batch size: 65, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:05:46,383 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 18:06:02,513 INFO [train.py:1119] (2/4) Epoch 65, validation: loss=0.1513, simple_loss=0.2479, pruned_loss=0.02732, over 944034.00 frames. 2023-12-04 18:06:02,514 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 18:06:34,377 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.289e+02 1.379e+02 1.555e+02 2.140e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-04 18:06:40,119 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=382000.0, ans=0.125 2023-12-04 18:06:45,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=382066.6666666667, ans=0.125 2023-12-04 18:07:11,780 INFO [train.py:1087] (2/4) Epoch 65, batch 50, loss[loss=0.1522, simple_loss=0.2453, pruned_loss=0.02952, over 23375.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2454, pruned_loss=0.03048, over 1093973.77 frames. ], batch size: 94, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:07:21,272 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:07:22,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=382200.0, ans=0.0 2023-12-04 18:07:39,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=382333.3333333333, ans=0.035 2023-12-04 18:07:51,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=382400.0, ans=0.0 2023-12-04 18:08:18,640 INFO [train.py:1087] (2/4) Epoch 65, batch 100, loss[loss=0.1567, simple_loss=0.2491, pruned_loss=0.03217, over 24538.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.245, pruned_loss=0.03001, over 1894993.75 frames. ], batch size: 62, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:08:19,040 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=382533.3333333333, ans=0.125 2023-12-04 18:08:21,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=382533.3333333333, ans=0.5 2023-12-04 18:08:22,801 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382533.3333333333, ans=0.1 2023-12-04 18:08:49,047 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.282e+02 1.383e+02 1.510e+02 1.893e+02, threshold=2.766e+02, percent-clipped=0.0 2023-12-04 18:08:52,001 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=382666.6666666667, ans=0.0 2023-12-04 18:09:06,137 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=382733.3333333333, ans=0.09899494936611666 2023-12-04 18:09:10,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=382733.3333333333, ans=10.0 2023-12-04 18:09:14,890 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.41 vs. limit=15.0 2023-12-04 18:09:18,602 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.24 vs. limit=12.0 2023-12-04 18:09:20,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=382800.0, ans=0.0 2023-12-04 18:09:25,499 INFO [train.py:1087] (2/4) Epoch 65, batch 150, loss[loss=0.1619, simple_loss=0.2541, pruned_loss=0.03483, over 24008.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02941, over 2542525.72 frames. ], batch size: 87, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:09:25,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=382866.6666666667, ans=0.0 2023-12-04 18:09:45,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=382933.3333333333, ans=0.1 2023-12-04 18:09:52,936 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=383000.0, ans=0.0 2023-12-04 18:10:16,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-12-04 18:10:18,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=383133.3333333333, ans=0.2 2023-12-04 18:10:29,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=383133.3333333333, ans=0.125 2023-12-04 18:10:33,462 INFO [train.py:1087] (2/4) Epoch 65, batch 200, loss[loss=0.1458, simple_loss=0.239, pruned_loss=0.02627, over 24791.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2446, pruned_loss=0.02907, over 3059699.77 frames. ], batch size: 73, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:10:38,938 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=383200.0, ans=0.2 2023-12-04 18:10:54,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=383266.6666666667, ans=0.125 2023-12-04 18:10:58,995 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383266.6666666667, ans=0.0 2023-12-04 18:11:00,375 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=383333.3333333333, ans=0.0 2023-12-04 18:11:05,116 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.275e+02 1.374e+02 1.477e+02 1.891e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 18:11:15,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=383400.0, ans=0.5 2023-12-04 18:11:36,427 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=383466.6666666667, ans=0.0 2023-12-04 18:11:42,657 INFO [train.py:1087] (2/4) Epoch 65, batch 250, loss[loss=0.1418, simple_loss=0.2371, pruned_loss=0.02326, over 24787.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.244, pruned_loss=0.02867, over 3454044.09 frames. ], batch size: 72, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:11:45,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=383533.3333333333, ans=0.5 2023-12-04 18:11:52,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=383533.3333333333, ans=0.125 2023-12-04 18:11:53,773 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=383533.3333333333, ans=0.125 2023-12-04 18:12:02,306 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-12-04 18:12:17,040 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383666.6666666667, ans=0.1 2023-12-04 18:12:21,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=383666.6666666667, ans=0.125 2023-12-04 18:12:26,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=383733.3333333333, ans=0.95 2023-12-04 18:12:44,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=383800.0, ans=0.125 2023-12-04 18:12:47,086 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-12-04 18:12:50,917 INFO [train.py:1087] (2/4) Epoch 65, batch 300, loss[loss=0.1539, simple_loss=0.2472, pruned_loss=0.03037, over 24742.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2438, pruned_loss=0.02866, over 3747913.50 frames. ], batch size: 63, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:12:51,955 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-12-04 18:13:03,102 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383933.3333333333, ans=0.1 2023-12-04 18:13:04,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=383933.3333333333, ans=0.0 2023-12-04 18:13:08,149 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=383933.3333333333, ans=0.125 2023-12-04 18:13:10,195 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2023-12-04 18:13:14,113 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.76 vs. limit=15.0 2023-12-04 18:13:22,448 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.249e+02 1.342e+02 1.427e+02 2.138e+02, threshold=2.684e+02, percent-clipped=0.0 2023-12-04 18:13:30,918 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=384000.0, ans=0.04949747468305833 2023-12-04 18:13:42,653 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:13:59,751 INFO [train.py:1087] (2/4) Epoch 65, batch 350, loss[loss=0.1513, simple_loss=0.2407, pruned_loss=0.03094, over 24803.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.244, pruned_loss=0.02888, over 3985000.81 frames. ], batch size: 72, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:15:02,231 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=384466.6666666667, ans=0.0 2023-12-04 18:15:03,695 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:15:06,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=384466.6666666667, ans=0.0 2023-12-04 18:15:08,614 INFO [train.py:1087] (2/4) Epoch 65, batch 400, loss[loss=0.1693, simple_loss=0.2559, pruned_loss=0.04135, over 23995.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2441, pruned_loss=0.02901, over 4170067.13 frames. ], batch size: 87, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:15:40,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=384666.6666666667, ans=10.0 2023-12-04 18:15:41,151 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.068e+02 1.269e+02 1.375e+02 1.530e+02 1.934e+02, threshold=2.749e+02, percent-clipped=0.0 2023-12-04 18:15:42,809 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:16:13,381 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=384800.0, ans=0.125 2023-12-04 18:16:18,091 INFO [train.py:1087] (2/4) Epoch 65, batch 450, loss[loss=0.1625, simple_loss=0.2532, pruned_loss=0.03589, over 23883.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2441, pruned_loss=0.02896, over 4303506.51 frames. ], batch size: 87, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:16:19,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=384866.6666666667, ans=0.125 2023-12-04 18:16:28,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=384866.6666666667, ans=0.0 2023-12-04 18:16:31,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=384933.3333333333, ans=0.125 2023-12-04 18:16:48,877 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385000.0, ans=0.1 2023-12-04 18:16:49,258 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.28 vs. limit=22.5 2023-12-04 18:16:50,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=385000.0, ans=0.125 2023-12-04 18:17:14,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=385133.3333333333, ans=0.0 2023-12-04 18:17:26,527 INFO [train.py:1087] (2/4) Epoch 65, batch 500, loss[loss=0.1419, simple_loss=0.2366, pruned_loss=0.02365, over 24712.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2442, pruned_loss=0.02901, over 4420681.64 frames. ], batch size: 69, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:17:41,281 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=385266.6666666667, ans=0.125 2023-12-04 18:17:46,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385266.6666666667, ans=0.1 2023-12-04 18:17:56,125 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=385333.3333333333, ans=0.125 2023-12-04 18:17:57,021 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.262e+02 1.332e+02 1.433e+02 2.134e+02, threshold=2.663e+02, percent-clipped=0.0 2023-12-04 18:18:07,166 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=385400.0, ans=0.0 2023-12-04 18:18:14,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=385400.0, ans=0.125 2023-12-04 18:18:26,533 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:18:34,271 INFO [train.py:1087] (2/4) Epoch 65, batch 550, loss[loss=0.1431, simple_loss=0.239, pruned_loss=0.0236, over 24548.00 frames. ], tot_loss[loss=0.151, simple_loss=0.244, pruned_loss=0.029, over 4524916.06 frames. ], batch size: 66, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:18:40,243 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-12-04 18:18:51,141 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.18 vs. limit=15.0 2023-12-04 18:19:09,111 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=385666.6666666667, ans=0.0 2023-12-04 18:19:10,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385666.6666666667, ans=0.0 2023-12-04 18:19:10,886 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.73 vs. limit=22.5 2023-12-04 18:19:28,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=385733.3333333333, ans=0.5 2023-12-04 18:19:29,865 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=385800.0, ans=0.2 2023-12-04 18:19:42,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=385866.6666666667, ans=0.125 2023-12-04 18:19:43,713 INFO [train.py:1087] (2/4) Epoch 65, batch 600, loss[loss=0.1534, simple_loss=0.2449, pruned_loss=0.03093, over 24787.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2437, pruned_loss=0.02876, over 4590147.39 frames. ], batch size: 62, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:19:44,694 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-12-04 18:19:50,195 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-12-04 18:20:17,312 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.252e+02 1.364e+02 1.482e+02 2.086e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 18:20:24,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=386000.0, ans=0.125 2023-12-04 18:20:45,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=386133.3333333333, ans=0.2 2023-12-04 18:20:54,497 INFO [train.py:1087] (2/4) Epoch 65, batch 650, loss[loss=0.151, simple_loss=0.2436, pruned_loss=0.02915, over 24734.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2438, pruned_loss=0.02885, over 4623524.85 frames. ], batch size: 63, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:20:59,342 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=15.0 2023-12-04 18:21:09,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=386266.6666666667, ans=0.0 2023-12-04 18:21:14,802 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-12-04 18:21:22,048 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.58 vs. limit=15.0 2023-12-04 18:21:41,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=386400.0, ans=0.125 2023-12-04 18:21:55,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=386466.6666666667, ans=0.125 2023-12-04 18:21:56,170 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=386466.6666666667, ans=0.125 2023-12-04 18:22:04,474 INFO [train.py:1087] (2/4) Epoch 65, batch 700, loss[loss=0.1558, simple_loss=0.2504, pruned_loss=0.03063, over 24094.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2439, pruned_loss=0.02883, over 4669929.05 frames. ], batch size: 58, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:22:34,320 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=386666.6666666667, ans=0.0 2023-12-04 18:22:36,824 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.292e+02 1.390e+02 1.496e+02 2.025e+02, threshold=2.779e+02, percent-clipped=0.0 2023-12-04 18:22:38,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=386666.6666666667, ans=0.125 2023-12-04 18:22:55,098 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=386733.3333333333, ans=0.0 2023-12-04 18:23:14,517 INFO [train.py:1087] (2/4) Epoch 65, batch 750, loss[loss=0.1441, simple_loss=0.2365, pruned_loss=0.02585, over 24779.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.244, pruned_loss=0.0291, over 4678837.39 frames. ], batch size: 71, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:23:17,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=386866.6666666667, ans=0.125 2023-12-04 18:23:21,146 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.77 vs. limit=10.0 2023-12-04 18:23:24,789 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=386866.6666666667, ans=0.0 2023-12-04 18:23:32,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=386933.3333333333, ans=0.125 2023-12-04 18:23:48,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=387000.0, ans=0.0 2023-12-04 18:23:58,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=387066.6666666667, ans=0.125 2023-12-04 18:24:22,879 INFO [train.py:1087] (2/4) Epoch 65, batch 800, loss[loss=0.1564, simple_loss=0.2511, pruned_loss=0.03089, over 24452.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2439, pruned_loss=0.02903, over 4702968.74 frames. ], batch size: 77, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:24:32,009 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:24:52,393 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.265e+02 1.343e+02 1.484e+02 1.892e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 18:25:01,443 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2023-12-04 18:25:04,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=387400.0, ans=0.125 2023-12-04 18:25:05,508 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=387400.0, ans=0.125 2023-12-04 18:25:24,788 INFO [train.py:1087] (2/4) Epoch 65, batch 850, loss[loss=0.1412, simple_loss=0.2332, pruned_loss=0.02461, over 24764.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2437, pruned_loss=0.02874, over 4730821.16 frames. ], batch size: 65, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:25:26,788 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.21 vs. limit=6.0 2023-12-04 18:25:27,921 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.88 vs. limit=12.0 2023-12-04 18:25:48,670 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=387666.6666666667, ans=0.2 2023-12-04 18:25:48,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387666.6666666667, ans=0.1 2023-12-04 18:25:59,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=387666.6666666667, ans=0.2 2023-12-04 18:26:37,637 INFO [train.py:1087] (2/4) Epoch 66, batch 0, loss[loss=0.1423, simple_loss=0.2377, pruned_loss=0.02345, over 24685.00 frames. ], tot_loss[loss=0.1423, simple_loss=0.2377, pruned_loss=0.02345, over 24685.00 frames. ], batch size: 74, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:26:37,638 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 18:26:49,642 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9130, 3.2648, 3.3538, 3.0602], device='cuda:2') 2023-12-04 18:26:53,165 INFO [train.py:1119] (2/4) Epoch 66, validation: loss=0.1505, simple_loss=0.2474, pruned_loss=0.02677, over 944034.00 frames. 2023-12-04 18:26:53,167 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 18:26:57,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=387833.3333333333, ans=0.1 2023-12-04 18:27:07,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=387900.0, ans=0.125 2023-12-04 18:27:13,562 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=387900.0, ans=0.0 2023-12-04 18:27:32,104 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.283e+02 1.412e+02 1.512e+02 1.954e+02, threshold=2.825e+02, percent-clipped=0.0 2023-12-04 18:27:38,964 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=388033.3333333333, ans=0.0 2023-12-04 18:27:50,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=388100.0, ans=0.04949747468305833 2023-12-04 18:28:03,093 INFO [train.py:1087] (2/4) Epoch 66, batch 50, loss[loss=0.141, simple_loss=0.2299, pruned_loss=0.02606, over 24554.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2426, pruned_loss=0.02802, over 1097801.57 frames. ], batch size: 66, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:28:13,887 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=388166.6666666667, ans=0.125 2023-12-04 18:28:20,853 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2023-12-04 18:28:41,806 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=388300.0, ans=0.125 2023-12-04 18:28:43,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=388366.6666666667, ans=0.0 2023-12-04 18:28:43,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388366.6666666667, ans=0.125 2023-12-04 18:29:10,084 INFO [train.py:1087] (2/4) Epoch 66, batch 100, loss[loss=0.158, simple_loss=0.2501, pruned_loss=0.03295, over 24509.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2437, pruned_loss=0.02838, over 1927104.24 frames. ], batch size: 75, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:29:19,308 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=388500.0, ans=0.125 2023-12-04 18:29:29,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=388566.6666666667, ans=0.125 2023-12-04 18:29:29,887 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=388566.6666666667, ans=0.125 2023-12-04 18:29:48,530 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.055e+02 1.248e+02 1.303e+02 1.427e+02 1.853e+02, threshold=2.607e+02, percent-clipped=0.0 2023-12-04 18:30:05,888 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.56 vs. limit=22.5 2023-12-04 18:30:19,472 INFO [train.py:1087] (2/4) Epoch 66, batch 150, loss[loss=0.1559, simple_loss=0.2482, pruned_loss=0.03177, over 21494.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2444, pruned_loss=0.02868, over 2569129.20 frames. ], batch size: 127, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:30:19,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=388833.3333333333, ans=0.2 2023-12-04 18:30:22,623 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=388833.3333333333, ans=0.1 2023-12-04 18:30:38,353 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=388900.0, ans=0.0 2023-12-04 18:30:41,881 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:30:53,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=388966.6666666667, ans=0.125 2023-12-04 18:30:59,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=388966.6666666667, ans=0.125 2023-12-04 18:31:10,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=389033.3333333333, ans=0.0 2023-12-04 18:31:30,914 INFO [train.py:1087] (2/4) Epoch 66, batch 200, loss[loss=0.1481, simple_loss=0.2473, pruned_loss=0.02449, over 21409.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2437, pruned_loss=0.0284, over 3081822.38 frames. ], batch size: 127, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:31:50,719 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=389233.3333333333, ans=0.0 2023-12-04 18:32:09,663 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.275e+02 1.356e+02 1.462e+02 1.717e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 18:32:17,442 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.52 vs. limit=15.0 2023-12-04 18:32:39,993 INFO [train.py:1087] (2/4) Epoch 66, batch 250, loss[loss=0.1566, simple_loss=0.2505, pruned_loss=0.03136, over 24301.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.244, pruned_loss=0.02864, over 3470830.73 frames. ], batch size: 79, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:32:52,739 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=389566.6666666667, ans=0.125 2023-12-04 18:33:15,044 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=389633.3333333333, ans=0.1 2023-12-04 18:33:15,469 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-12-04 18:33:16,149 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=389633.3333333333, ans=0.0 2023-12-04 18:33:16,923 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-12-04 18:33:32,246 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=389700.0, ans=0.0 2023-12-04 18:33:38,854 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=389766.6666666667, ans=0.0 2023-12-04 18:33:48,614 INFO [train.py:1087] (2/4) Epoch 66, batch 300, loss[loss=0.1517, simple_loss=0.2385, pruned_loss=0.0325, over 24313.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2437, pruned_loss=0.02865, over 3759145.53 frames. ], batch size: 79, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:33:57,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=389833.3333333333, ans=0.2 2023-12-04 18:34:01,090 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=389900.0, ans=0.1 2023-12-04 18:34:02,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=389900.0, ans=0.0 2023-12-04 18:34:06,995 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.94 vs. limit=15.0 2023-12-04 18:34:27,123 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.271e+02 1.351e+02 1.464e+02 1.947e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 18:34:39,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=390033.3333333333, ans=0.125 2023-12-04 18:34:40,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=390033.3333333333, ans=0.0 2023-12-04 18:34:56,335 INFO [train.py:1087] (2/4) Epoch 66, batch 350, loss[loss=0.1417, simple_loss=0.2373, pruned_loss=0.02304, over 24802.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2439, pruned_loss=0.02896, over 3973537.74 frames. ], batch size: 73, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:35:18,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=390233.3333333333, ans=0.0 2023-12-04 18:36:04,983 INFO [train.py:1087] (2/4) Epoch 66, batch 400, loss[loss=0.1523, simple_loss=0.2384, pruned_loss=0.03309, over 24499.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2437, pruned_loss=0.02891, over 4153204.15 frames. ], batch size: 75, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:36:43,271 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.262e+02 1.332e+02 1.446e+02 1.853e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-04 18:36:50,836 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-12-04 18:37:14,348 INFO [train.py:1087] (2/4) Epoch 66, batch 450, loss[loss=0.1515, simple_loss=0.2447, pruned_loss=0.0292, over 24499.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2434, pruned_loss=0.02875, over 4314581.18 frames. ], batch size: 75, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:37:14,808 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390833.3333333333, ans=0.125 2023-12-04 18:37:35,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=390900.0, ans=0.1 2023-12-04 18:38:06,393 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=391033.3333333333, ans=0.1 2023-12-04 18:38:24,160 INFO [train.py:1087] (2/4) Epoch 66, batch 500, loss[loss=0.1567, simple_loss=0.2514, pruned_loss=0.03096, over 24138.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02871, over 4430019.30 frames. ], batch size: 58, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:38:28,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391166.6666666667, ans=0.1 2023-12-04 18:39:02,976 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.271e+02 1.374e+02 1.457e+02 1.879e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 18:39:12,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=391366.6666666667, ans=0.1 2023-12-04 18:39:15,233 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=391366.6666666667, ans=0.125 2023-12-04 18:39:32,562 INFO [train.py:1087] (2/4) Epoch 66, batch 550, loss[loss=0.1475, simple_loss=0.2433, pruned_loss=0.02588, over 24717.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02873, over 4518071.23 frames. ], batch size: 69, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:40:24,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=391700.0, ans=0.5 2023-12-04 18:40:41,249 INFO [train.py:1087] (2/4) Epoch 66, batch 600, loss[loss=0.1578, simple_loss=0.2497, pruned_loss=0.03298, over 24152.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02874, over 4581993.12 frames. ], batch size: 82, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:40:44,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=391833.3333333333, ans=0.125 2023-12-04 18:41:20,747 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.262e+02 1.356e+02 1.449e+02 2.093e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 18:41:27,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=392033.3333333333, ans=0.5 2023-12-04 18:41:42,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=392100.0, ans=0.1 2023-12-04 18:41:51,267 INFO [train.py:1087] (2/4) Epoch 66, batch 650, loss[loss=0.1534, simple_loss=0.2474, pruned_loss=0.0297, over 24041.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2433, pruned_loss=0.02864, over 4631110.38 frames. ], batch size: 87, lr: 3.71e-03, grad_scale: 64.0 2023-12-04 18:42:21,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=392300.0, ans=0.125 2023-12-04 18:42:35,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=392366.6666666667, ans=0.0 2023-12-04 18:42:55,759 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-12-04 18:43:02,095 INFO [train.py:1087] (2/4) Epoch 66, batch 700, loss[loss=0.1554, simple_loss=0.2511, pruned_loss=0.02983, over 22971.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2436, pruned_loss=0.02861, over 4674136.76 frames. ], batch size: 106, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:43:21,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=392566.6666666667, ans=0.125 2023-12-04 18:43:22,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=392566.6666666667, ans=0.125 2023-12-04 18:43:43,662 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.264e+02 1.339e+02 1.470e+02 1.975e+02, threshold=2.678e+02, percent-clipped=0.0 2023-12-04 18:43:44,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=392700.0, ans=0.125 2023-12-04 18:43:56,885 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-12-04 18:43:56,890 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-12-04 18:44:13,096 INFO [train.py:1087] (2/4) Epoch 66, batch 750, loss[loss=0.1585, simple_loss=0.2515, pruned_loss=0.03273, over 24805.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2432, pruned_loss=0.02849, over 4697327.68 frames. ], batch size: 62, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:44:25,176 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=392833.3333333333, ans=0.125 2023-12-04 18:44:25,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=392833.3333333333, ans=0.125 2023-12-04 18:44:31,261 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.13 vs. limit=15.0 2023-12-04 18:44:53,697 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.95 vs. limit=22.5 2023-12-04 18:45:20,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=393100.0, ans=0.125 2023-12-04 18:45:22,776 INFO [train.py:1087] (2/4) Epoch 66, batch 800, loss[loss=0.1493, simple_loss=0.2445, pruned_loss=0.02709, over 24550.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2433, pruned_loss=0.0284, over 4722346.52 frames. ], batch size: 63, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:45:27,916 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=393166.6666666667, ans=10.0 2023-12-04 18:45:41,976 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:45:46,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=393233.3333333333, ans=0.125 2023-12-04 18:46:00,575 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.263e+02 1.349e+02 1.436e+02 2.203e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 18:46:15,391 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-12-04 18:46:18,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=393433.3333333333, ans=0.1 2023-12-04 18:46:23,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=393433.3333333333, ans=0.125 2023-12-04 18:46:27,433 INFO [train.py:1087] (2/4) Epoch 66, batch 850, loss[loss=0.1431, simple_loss=0.2401, pruned_loss=0.02304, over 24760.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2436, pruned_loss=0.02865, over 4744989.87 frames. ], batch size: 66, lr: 3.70e-03, grad_scale: 32.0 2023-12-04 18:46:32,781 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=393500.0, ans=0.125 2023-12-04 18:46:35,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=393500.0, ans=0.0 2023-12-04 18:46:45,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=393566.6666666667, ans=0.125 2023-12-04 18:46:50,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393566.6666666667, ans=0.1 2023-12-04 18:47:10,217 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.12 vs. limit=15.0 2023-12-04 18:47:17,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=393766.6666666667, ans=0.1 2023-12-04 18:47:21,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=393766.6666666667, ans=0.0 2023-12-04 18:47:45,717 INFO [train.py:1087] (2/4) Epoch 67, batch 0, loss[loss=0.1455, simple_loss=0.2387, pruned_loss=0.02614, over 24788.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2387, pruned_loss=0.02614, over 24788.00 frames. ], batch size: 70, lr: 3.68e-03, grad_scale: 32.0 2023-12-04 18:47:45,719 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 18:48:02,843 INFO [train.py:1119] (2/4) Epoch 67, validation: loss=0.1507, simple_loss=0.2474, pruned_loss=0.02701, over 944034.00 frames. 2023-12-04 18:48:02,844 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 18:48:04,715 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=393800.0, ans=0.0 2023-12-04 18:48:13,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=393800.0, ans=0.125 2023-12-04 18:48:32,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=393933.3333333333, ans=0.1 2023-12-04 18:48:39,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=393933.3333333333, ans=0.125 2023-12-04 18:48:50,412 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.049e+02 1.306e+02 1.374e+02 1.528e+02 2.366e+02, threshold=2.749e+02, percent-clipped=0.0 2023-12-04 18:49:14,004 INFO [train.py:1087] (2/4) Epoch 67, batch 50, loss[loss=0.1551, simple_loss=0.2485, pruned_loss=0.03086, over 23917.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2425, pruned_loss=0.02786, over 1092167.67 frames. ], batch size: 87, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:49:14,500 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=394133.3333333333, ans=0.125 2023-12-04 18:49:18,324 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=394133.3333333333, ans=0.125 2023-12-04 18:49:22,446 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=394133.3333333333, ans=0.125 2023-12-04 18:49:42,970 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-12-04 18:50:03,952 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394333.3333333333, ans=0.1 2023-12-04 18:50:10,430 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394400.0, ans=0.1 2023-12-04 18:50:10,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394400.0, ans=0.125 2023-12-04 18:50:11,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=394400.0, ans=0.125 2023-12-04 18:50:16,047 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:50:19,003 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-12-04 18:50:23,076 INFO [train.py:1087] (2/4) Epoch 67, batch 100, loss[loss=0.157, simple_loss=0.2536, pruned_loss=0.03017, over 22712.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2421, pruned_loss=0.02782, over 1924545.17 frames. ], batch size: 106, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:50:25,455 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=394466.6666666667, ans=0.125 2023-12-04 18:50:30,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=394466.6666666667, ans=0.0 2023-12-04 18:50:39,970 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:50:40,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=394533.3333333333, ans=0.0 2023-12-04 18:51:11,872 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.252e+02 1.351e+02 1.505e+02 2.091e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 18:51:17,511 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=394666.6666666667, ans=0.125 2023-12-04 18:51:18,977 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=394733.3333333333, ans=0.0 2023-12-04 18:51:21,498 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=394733.3333333333, ans=0.0 2023-12-04 18:51:29,444 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-12-04 18:51:30,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=394733.3333333333, ans=0.125 2023-12-04 18:51:32,801 INFO [train.py:1087] (2/4) Epoch 67, batch 150, loss[loss=0.1562, simple_loss=0.2472, pruned_loss=0.0326, over 24550.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2425, pruned_loss=0.02777, over 2571273.73 frames. ], batch size: 66, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:51:51,947 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394866.6666666667, ans=0.1 2023-12-04 18:51:58,437 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=394866.6666666667, ans=0.2 2023-12-04 18:52:00,022 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=394933.3333333333, ans=0.0 2023-12-04 18:52:17,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=395000.0, ans=0.0 2023-12-04 18:52:20,573 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=22.5 2023-12-04 18:52:27,825 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=395066.6666666667, ans=0.0 2023-12-04 18:52:29,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=395066.6666666667, ans=0.125 2023-12-04 18:52:29,502 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.05 vs. limit=10.0 2023-12-04 18:52:42,186 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-12-04 18:52:42,577 INFO [train.py:1087] (2/4) Epoch 67, batch 200, loss[loss=0.1557, simple_loss=0.2501, pruned_loss=0.03065, over 24159.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.243, pruned_loss=0.02814, over 3069326.14 frames. ], batch size: 82, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:52:55,554 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=395200.0, ans=0.125 2023-12-04 18:53:28,318 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=395333.3333333333, ans=0.125 2023-12-04 18:53:30,530 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.241e+02 1.329e+02 1.420e+02 2.138e+02, threshold=2.658e+02, percent-clipped=0.0 2023-12-04 18:53:42,541 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.79 vs. limit=15.0 2023-12-04 18:53:46,941 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=395400.0, ans=0.1 2023-12-04 18:53:53,695 INFO [train.py:1087] (2/4) Epoch 67, batch 250, loss[loss=0.1545, simple_loss=0.2446, pruned_loss=0.03218, over 24570.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2432, pruned_loss=0.02837, over 3473021.81 frames. ], batch size: 64, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:53:54,327 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.27 vs. limit=15.0 2023-12-04 18:54:25,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=395600.0, ans=0.2 2023-12-04 18:54:25,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=395600.0, ans=0.125 2023-12-04 18:54:29,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=395600.0, ans=0.125 2023-12-04 18:54:30,019 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=15.0 2023-12-04 18:54:39,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=395666.6666666667, ans=0.125 2023-12-04 18:54:46,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=395666.6666666667, ans=0.2 2023-12-04 18:55:04,108 INFO [train.py:1087] (2/4) Epoch 67, batch 300, loss[loss=0.1625, simple_loss=0.2546, pruned_loss=0.03518, over 23922.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2432, pruned_loss=0.02831, over 3778808.75 frames. ], batch size: 87, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:55:53,465 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.050e+02 1.265e+02 1.360e+02 1.434e+02 1.941e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-04 18:56:01,149 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=12.0 2023-12-04 18:56:10,010 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.48 vs. limit=22.5 2023-12-04 18:56:15,644 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-12-04 18:56:16,017 INFO [train.py:1087] (2/4) Epoch 67, batch 350, loss[loss=0.143, simple_loss=0.2338, pruned_loss=0.02614, over 24790.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2425, pruned_loss=0.02807, over 4023139.27 frames. ], batch size: 72, lr: 3.66e-03, grad_scale: 32.0 2023-12-04 18:56:38,065 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.42 vs. limit=10.0 2023-12-04 18:56:39,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=396200.0, ans=0.125 2023-12-04 18:56:58,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=396333.3333333333, ans=0.0 2023-12-04 18:56:58,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=396333.3333333333, ans=0.2 2023-12-04 18:57:07,082 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.09 vs. limit=15.0 2023-12-04 18:57:15,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=396400.0, ans=10.0 2023-12-04 18:57:28,713 INFO [train.py:1087] (2/4) Epoch 67, batch 400, loss[loss=0.1454, simple_loss=0.2364, pruned_loss=0.02724, over 24754.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02812, over 4202072.23 frames. ], batch size: 65, lr: 3.66e-03, grad_scale: 32.0 2023-12-04 18:57:41,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396533.3333333333, ans=0.0 2023-12-04 18:57:42,153 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=396533.3333333333, ans=0.0 2023-12-04 18:58:05,566 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-12-04 18:58:06,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396600.0, ans=0.1 2023-12-04 18:58:18,875 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.226e+02 1.325e+02 1.425e+02 1.742e+02, threshold=2.649e+02, percent-clipped=0.0 2023-12-04 18:58:29,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=396733.3333333333, ans=0.2 2023-12-04 18:58:41,478 INFO [train.py:1087] (2/4) Epoch 67, batch 450, loss[loss=0.1616, simple_loss=0.2493, pruned_loss=0.03689, over 24522.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2434, pruned_loss=0.02855, over 4312290.28 frames. ], batch size: 75, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 18:59:10,938 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:59:40,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=397066.6666666667, ans=0.07 2023-12-04 18:59:40,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.49 vs. limit=15.0 2023-12-04 18:59:51,814 INFO [train.py:1087] (2/4) Epoch 67, batch 500, loss[loss=0.1516, simple_loss=0.2492, pruned_loss=0.02694, over 24612.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2434, pruned_loss=0.02847, over 4429075.67 frames. ], batch size: 68, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 18:59:56,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=397133.3333333333, ans=0.125 2023-12-04 18:59:58,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=397133.3333333333, ans=0.125 2023-12-04 19:00:15,520 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=397200.0, ans=0.125 2023-12-04 19:00:36,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397333.3333333333, ans=0.1 2023-12-04 19:00:37,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=397333.3333333333, ans=0.2 2023-12-04 19:00:41,465 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.259e+02 1.344e+02 1.490e+02 1.984e+02, threshold=2.687e+02, percent-clipped=0.0 2023-12-04 19:00:51,735 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-12-04 19:00:59,326 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.06 vs. limit=15.0 2023-12-04 19:01:01,445 INFO [train.py:1087] (2/4) Epoch 67, batch 550, loss[loss=0.1482, simple_loss=0.2442, pruned_loss=0.02613, over 21505.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2433, pruned_loss=0.02863, over 4511010.66 frames. ], batch size: 129, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:01:10,435 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=397466.6666666667, ans=0.0 2023-12-04 19:02:12,018 INFO [train.py:1087] (2/4) Epoch 67, batch 600, loss[loss=0.1626, simple_loss=0.2533, pruned_loss=0.03599, over 24445.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02875, over 4570272.09 frames. ], batch size: 77, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:02:17,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=397800.0, ans=0.5 2023-12-04 19:02:50,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=397933.3333333333, ans=0.0 2023-12-04 19:03:00,023 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.244e+02 1.311e+02 1.453e+02 1.975e+02, threshold=2.621e+02, percent-clipped=0.0 2023-12-04 19:03:06,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=398066.6666666667, ans=0.125 2023-12-04 19:03:21,699 INFO [train.py:1087] (2/4) Epoch 67, batch 650, loss[loss=0.1413, simple_loss=0.2381, pruned_loss=0.02221, over 24851.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2434, pruned_loss=0.02872, over 4629985.66 frames. ], batch size: 68, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:03:29,955 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398133.3333333333, ans=0.1 2023-12-04 19:03:35,498 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.04 vs. limit=10.0 2023-12-04 19:03:38,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=398200.0, ans=0.125 2023-12-04 19:03:49,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=398266.6666666667, ans=0.0 2023-12-04 19:03:53,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=398266.6666666667, ans=0.0 2023-12-04 19:03:57,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398266.6666666667, ans=0.1 2023-12-04 19:04:01,548 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.04 vs. limit=15.0 2023-12-04 19:04:31,943 INFO [train.py:1087] (2/4) Epoch 67, batch 700, loss[loss=0.1777, simple_loss=0.2621, pruned_loss=0.04665, over 17577.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2433, pruned_loss=0.02873, over 4667589.91 frames. ], batch size: 177, lr: 3.65e-03, grad_scale: 16.0 2023-12-04 19:04:56,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=398533.3333333333, ans=0.125 2023-12-04 19:04:59,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398600.0, ans=0.125 2023-12-04 19:05:11,174 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=398600.0, ans=0.2 2023-12-04 19:05:22,005 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.250e+02 1.327e+02 1.419e+02 2.097e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 19:05:27,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=398733.3333333333, ans=0.0 2023-12-04 19:05:35,961 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:05:39,128 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=398733.3333333333, ans=0.125 2023-12-04 19:05:43,130 INFO [train.py:1087] (2/4) Epoch 67, batch 750, loss[loss=0.1584, simple_loss=0.2501, pruned_loss=0.03333, over 23815.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2434, pruned_loss=0.02893, over 4687993.68 frames. ], batch size: 95, lr: 3.65e-03, grad_scale: 16.0 2023-12-04 19:05:51,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=398800.0, ans=0.1 2023-12-04 19:05:51,880 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=8.0 2023-12-04 19:06:03,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=398866.6666666667, ans=0.125 2023-12-04 19:06:08,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=398866.6666666667, ans=0.125 2023-12-04 19:06:14,826 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=398933.3333333333, ans=0.07 2023-12-04 19:06:23,760 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=399000.0, ans=0.125 2023-12-04 19:06:50,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399066.6666666667, ans=0.1 2023-12-04 19:06:52,754 INFO [train.py:1087] (2/4) Epoch 67, batch 800, loss[loss=0.1468, simple_loss=0.2439, pruned_loss=0.02487, over 24772.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02873, over 4717770.50 frames. ], batch size: 64, lr: 3.65e-03, grad_scale: 32.0 2023-12-04 19:07:04,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=399133.3333333333, ans=0.125 2023-12-04 19:07:06,483 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-12-04 19:07:12,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=399200.0, ans=0.125 2023-12-04 19:07:19,332 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-12-04 19:07:21,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=399266.6666666667, ans=0.0 2023-12-04 19:07:38,275 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.255e+02 1.337e+02 1.436e+02 1.884e+02, threshold=2.675e+02, percent-clipped=0.0 2023-12-04 19:07:56,379 INFO [train.py:1087] (2/4) Epoch 67, batch 850, loss[loss=0.1558, simple_loss=0.2478, pruned_loss=0.03194, over 22886.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2432, pruned_loss=0.02856, over 4741423.83 frames. ], batch size: 106, lr: 3.65e-03, grad_scale: 32.0 2023-12-04 19:08:12,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=399533.3333333333, ans=0.07 2023-12-04 19:09:08,796 INFO [train.py:1087] (2/4) Epoch 68, batch 0, loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02945, over 24554.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02945, over 24554.00 frames. ], batch size: 63, lr: 3.62e-03, grad_scale: 32.0 2023-12-04 19:09:08,798 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 19:09:25,584 INFO [train.py:1119] (2/4) Epoch 68, validation: loss=0.151, simple_loss=0.2474, pruned_loss=0.02727, over 944034.00 frames. 2023-12-04 19:09:25,585 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 19:10:01,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=399900.0, ans=0.035 2023-12-04 19:10:02,551 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=399900.0, ans=0.05 2023-12-04 19:10:25,400 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.238e+02 1.333e+02 1.431e+02 2.439e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-04 19:10:34,167 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=400033.3333333333, ans=0.125 2023-12-04 19:10:40,294 INFO [train.py:1087] (2/4) Epoch 68, batch 50, loss[loss=0.1523, simple_loss=0.2485, pruned_loss=0.02807, over 21413.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2448, pruned_loss=0.02938, over 1067043.92 frames. ], batch size: 127, lr: 3.62e-03, grad_scale: 32.0 2023-12-04 19:10:56,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=400166.6666666667, ans=0.125 2023-12-04 19:11:29,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=400300.0, ans=0.2 2023-12-04 19:11:31,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=400300.0, ans=0.0 2023-12-04 19:11:49,754 INFO [train.py:1087] (2/4) Epoch 68, batch 100, loss[loss=0.1392, simple_loss=0.2329, pruned_loss=0.02273, over 24564.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2438, pruned_loss=0.02839, over 1878945.82 frames. ], batch size: 66, lr: 3.62e-03, grad_scale: 16.0 2023-12-04 19:11:57,768 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.07 vs. limit=15.0 2023-12-04 19:12:46,690 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.320e+02 1.380e+02 1.502e+02 2.105e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 19:12:58,640 INFO [train.py:1087] (2/4) Epoch 68, batch 150, loss[loss=0.1517, simple_loss=0.2389, pruned_loss=0.03219, over 24470.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2438, pruned_loss=0.02869, over 2518053.41 frames. ], batch size: 77, lr: 3.62e-03, grad_scale: 16.0 2023-12-04 19:13:14,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=400833.3333333333, ans=15.0 2023-12-04 19:13:30,470 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.14 vs. limit=15.0 2023-12-04 19:13:52,128 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400966.6666666667, ans=0.1 2023-12-04 19:14:07,696 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=401100.0, ans=0.0 2023-12-04 19:14:08,809 INFO [train.py:1087] (2/4) Epoch 68, batch 200, loss[loss=0.1504, simple_loss=0.246, pruned_loss=0.02738, over 24715.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.243, pruned_loss=0.02808, over 3036760.77 frames. ], batch size: 69, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:14:18,846 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-12-04 19:14:26,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401166.6666666667, ans=0.1 2023-12-04 19:14:43,698 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-12-04 19:15:00,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=401300.0, ans=0.125 2023-12-04 19:15:07,825 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.273e+02 1.370e+02 1.496e+02 1.748e+02, threshold=2.741e+02, percent-clipped=0.0 2023-12-04 19:15:19,791 INFO [train.py:1087] (2/4) Epoch 68, batch 250, loss[loss=0.1474, simple_loss=0.2407, pruned_loss=0.02706, over 24550.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2436, pruned_loss=0.02856, over 3426541.28 frames. ], batch size: 66, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:15:28,297 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=401433.3333333333, ans=0.0 2023-12-04 19:15:30,985 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=401433.3333333333, ans=0.0 2023-12-04 19:15:36,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-12-04 19:15:45,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401500.0, ans=0.1 2023-12-04 19:15:56,652 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=22.5 2023-12-04 19:16:29,597 INFO [train.py:1087] (2/4) Epoch 68, batch 300, loss[loss=0.1606, simple_loss=0.253, pruned_loss=0.03407, over 22736.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2436, pruned_loss=0.02853, over 3731087.23 frames. ], batch size: 106, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:16:40,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=401766.6666666667, ans=0.025 2023-12-04 19:16:46,721 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=401833.3333333333, ans=0.07 2023-12-04 19:17:10,254 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2023-12-04 19:17:25,425 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.247e+02 1.339e+02 1.460e+02 2.398e+02, threshold=2.677e+02, percent-clipped=0.0 2023-12-04 19:17:36,420 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=402100.0, ans=0.0 2023-12-04 19:17:37,283 INFO [train.py:1087] (2/4) Epoch 68, batch 350, loss[loss=0.1494, simple_loss=0.2447, pruned_loss=0.02702, over 24719.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2436, pruned_loss=0.0285, over 3974051.22 frames. ], batch size: 67, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:17:55,330 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402166.6666666667, ans=0.1 2023-12-04 19:18:12,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402233.3333333333, ans=0.1 2023-12-04 19:18:39,252 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=402366.6666666667, ans=0.035 2023-12-04 19:18:47,027 INFO [train.py:1087] (2/4) Epoch 68, batch 400, loss[loss=0.1613, simple_loss=0.2578, pruned_loss=0.03233, over 24794.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2434, pruned_loss=0.02834, over 4168147.36 frames. ], batch size: 62, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:19:09,411 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=402500.0, ans=0.125 2023-12-04 19:19:12,861 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.22 vs. limit=15.0 2023-12-04 19:19:21,891 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=402566.6666666667, ans=0.0 2023-12-04 19:19:30,335 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.01 vs. limit=10.0 2023-12-04 19:19:33,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=402633.3333333333, ans=0.125 2023-12-04 19:19:39,945 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=402633.3333333333, ans=0.125 2023-12-04 19:19:45,408 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.227e+02 1.311e+02 1.473e+02 1.957e+02, threshold=2.622e+02, percent-clipped=0.0 2023-12-04 19:19:45,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402700.0, ans=0.1 2023-12-04 19:19:49,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=402700.0, ans=0.125 2023-12-04 19:19:57,923 INFO [train.py:1087] (2/4) Epoch 68, batch 450, loss[loss=0.1433, simple_loss=0.2356, pruned_loss=0.02551, over 24183.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2431, pruned_loss=0.0283, over 4299667.37 frames. ], batch size: 82, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:20:56,766 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=403033.3333333333, ans=0.0 2023-12-04 19:21:06,547 INFO [train.py:1087] (2/4) Epoch 68, batch 500, loss[loss=0.1527, simple_loss=0.2473, pruned_loss=0.02904, over 24696.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2426, pruned_loss=0.02818, over 4433232.14 frames. ], batch size: 74, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:21:42,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=403233.3333333333, ans=0.125 2023-12-04 19:21:43,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=403233.3333333333, ans=0.125 2023-12-04 19:22:00,566 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2023-12-04 19:22:04,101 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=403366.6666666667, ans=0.2 2023-12-04 19:22:04,943 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.270e+02 1.394e+02 1.461e+02 1.900e+02, threshold=2.787e+02, percent-clipped=0.0 2023-12-04 19:22:05,324 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=403366.6666666667, ans=0.0 2023-12-04 19:22:11,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=403366.6666666667, ans=0.125 2023-12-04 19:22:12,053 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=403366.6666666667, ans=0.125 2023-12-04 19:22:16,552 INFO [train.py:1087] (2/4) Epoch 68, batch 550, loss[loss=0.1542, simple_loss=0.2494, pruned_loss=0.02952, over 24842.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2428, pruned_loss=0.02819, over 4503227.80 frames. ], batch size: 68, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:22:25,266 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=403433.3333333333, ans=0.125 2023-12-04 19:22:33,151 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=403500.0, ans=0.125 2023-12-04 19:22:36,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=403500.0, ans=0.0 2023-12-04 19:22:49,898 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=403566.6666666667, ans=0.09899494936611666 2023-12-04 19:23:12,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=403700.0, ans=0.0 2023-12-04 19:23:15,733 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-12-04 19:23:20,919 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:23:25,811 INFO [train.py:1087] (2/4) Epoch 68, batch 600, loss[loss=0.15, simple_loss=0.2441, pruned_loss=0.0279, over 24556.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2422, pruned_loss=0.02789, over 4593689.33 frames. ], batch size: 66, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:23:48,382 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.31 vs. limit=22.5 2023-12-04 19:24:12,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=403966.6666666667, ans=0.0 2023-12-04 19:24:25,072 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.276e+02 1.333e+02 1.458e+02 1.844e+02, threshold=2.667e+02, percent-clipped=0.0 2023-12-04 19:24:25,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2023-12-04 19:24:32,272 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.85 vs. limit=15.0 2023-12-04 19:24:35,495 INFO [train.py:1087] (2/4) Epoch 68, batch 650, loss[loss=0.1465, simple_loss=0.2427, pruned_loss=0.02516, over 24789.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.02799, over 4654092.00 frames. ], batch size: 71, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:25:01,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=404166.6666666667, ans=6.0 2023-12-04 19:25:09,805 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-12-04 19:25:12,456 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-12-04 19:25:30,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=404366.6666666667, ans=0.0 2023-12-04 19:25:43,959 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=404433.3333333333, ans=0.0 2023-12-04 19:25:44,715 INFO [train.py:1087] (2/4) Epoch 68, batch 700, loss[loss=0.1687, simple_loss=0.264, pruned_loss=0.03675, over 24289.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.0281, over 4705117.58 frames. ], batch size: 79, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:25:58,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=404500.0, ans=0.2 2023-12-04 19:26:17,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=404566.6666666667, ans=0.125 2023-12-04 19:26:39,635 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=404700.0, ans=0.2 2023-12-04 19:26:41,764 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.275e+02 1.354e+02 1.444e+02 2.220e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 19:26:42,125 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=404700.0, ans=0.0 2023-12-04 19:26:49,916 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-12-04 19:26:53,534 INFO [train.py:1087] (2/4) Epoch 68, batch 750, loss[loss=0.146, simple_loss=0.2399, pruned_loss=0.02604, over 24735.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2428, pruned_loss=0.02816, over 4730963.68 frames. ], batch size: 63, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:26:53,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=404766.6666666667, ans=0.2 2023-12-04 19:28:01,085 INFO [train.py:1087] (2/4) Epoch 68, batch 800, loss[loss=0.1423, simple_loss=0.2318, pruned_loss=0.02643, over 23458.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2427, pruned_loss=0.02805, over 4766626.64 frames. ], batch size: 94, lr: 3.60e-03, grad_scale: 32.0 2023-12-04 19:28:02,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=405100.0, ans=0.125 2023-12-04 19:28:12,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=405100.0, ans=0.0 2023-12-04 19:28:22,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=405166.6666666667, ans=0.125 2023-12-04 19:28:26,331 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=405233.3333333333, ans=0.125 2023-12-04 19:28:33,891 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.62 vs. limit=15.0 2023-12-04 19:28:34,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=405233.3333333333, ans=0.125 2023-12-04 19:28:50,054 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-12-04 19:28:53,977 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.053e+02 1.256e+02 1.383e+02 1.523e+02 1.980e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 19:29:03,553 INFO [train.py:1087] (2/4) Epoch 68, batch 850, loss[loss=0.1536, simple_loss=0.2477, pruned_loss=0.02974, over 24156.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2429, pruned_loss=0.02827, over 4757506.77 frames. ], batch size: 58, lr: 3.60e-03, grad_scale: 32.0 2023-12-04 19:29:12,194 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=405433.3333333333, ans=0.125 2023-12-04 19:29:20,997 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-12-04 19:29:34,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.48 vs. limit=15.0 2023-12-04 19:29:39,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=405633.3333333333, ans=0.2 2023-12-04 19:29:42,734 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2023-12-04 19:29:44,817 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-12-04 19:30:17,990 INFO [train.py:1087] (2/4) Epoch 69, batch 0, loss[loss=0.1455, simple_loss=0.241, pruned_loss=0.02503, over 24567.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.241, pruned_loss=0.02503, over 24567.00 frames. ], batch size: 64, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:30:17,992 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 19:30:34,640 INFO [train.py:1119] (2/4) Epoch 69, validation: loss=0.151, simple_loss=0.2473, pruned_loss=0.02734, over 944034.00 frames. 2023-12-04 19:30:34,642 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 19:30:45,950 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.51 vs. limit=15.0 2023-12-04 19:30:55,107 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=405800.0, ans=0.125 2023-12-04 19:31:12,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=405866.6666666667, ans=0.0 2023-12-04 19:31:32,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=406000.0, ans=0.2 2023-12-04 19:31:33,691 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=406000.0, ans=0.2 2023-12-04 19:31:36,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=406000.0, ans=0.125 2023-12-04 19:31:38,951 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.167e+02 1.273e+02 1.368e+02 1.465e+02 2.041e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 19:31:43,025 INFO [train.py:1087] (2/4) Epoch 69, batch 50, loss[loss=0.1458, simple_loss=0.2401, pruned_loss=0.02582, over 24699.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2418, pruned_loss=0.02702, over 1093823.51 frames. ], batch size: 74, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:32:09,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=406200.0, ans=0.125 2023-12-04 19:32:51,509 INFO [train.py:1087] (2/4) Epoch 69, batch 100, loss[loss=0.1539, simple_loss=0.2457, pruned_loss=0.03107, over 24280.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02753, over 1934618.38 frames. ], batch size: 79, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:33:05,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=406466.6666666667, ans=0.125 2023-12-04 19:33:39,362 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.07 vs. limit=15.0 2023-12-04 19:33:40,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=406600.0, ans=0.125 2023-12-04 19:33:56,367 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.284e+02 1.376e+02 1.523e+02 2.091e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 19:34:00,316 INFO [train.py:1087] (2/4) Epoch 69, batch 150, loss[loss=0.145, simple_loss=0.2415, pruned_loss=0.02432, over 24749.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02814, over 2556084.91 frames. ], batch size: 66, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:34:00,963 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-12-04 19:34:13,266 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-12-04 19:34:19,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=406800.0, ans=0.125 2023-12-04 19:34:39,184 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=406866.6666666667, ans=0.0 2023-12-04 19:34:44,800 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.71 vs. limit=22.5 2023-12-04 19:34:44,926 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-12-04 19:35:08,914 INFO [train.py:1087] (2/4) Epoch 69, batch 200, loss[loss=0.1359, simple_loss=0.2373, pruned_loss=0.01727, over 24588.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02813, over 3050163.92 frames. ], batch size: 64, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:35:13,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=407066.6666666667, ans=0.2 2023-12-04 19:35:22,335 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.86 vs. limit=15.0 2023-12-04 19:35:31,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=407133.3333333333, ans=0.2 2023-12-04 19:35:56,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=407266.6666666667, ans=0.125 2023-12-04 19:36:13,671 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=407333.3333333333, ans=0.2 2023-12-04 19:36:14,379 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.284e+02 1.362e+02 1.487e+02 2.201e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 19:36:18,336 INFO [train.py:1087] (2/4) Epoch 69, batch 250, loss[loss=0.155, simple_loss=0.2508, pruned_loss=0.02954, over 22984.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2431, pruned_loss=0.02834, over 3436426.34 frames. ], batch size: 106, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:36:18,950 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.30 vs. limit=10.0 2023-12-04 19:36:29,261 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:36:54,754 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:37:10,498 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=407600.0, ans=0.2 2023-12-04 19:37:13,229 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=407666.6666666667, ans=0.0 2023-12-04 19:37:24,307 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=407666.6666666667, ans=22.5 2023-12-04 19:37:27,280 INFO [train.py:1087] (2/4) Epoch 69, batch 300, loss[loss=0.1543, simple_loss=0.2475, pruned_loss=0.0305, over 24788.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.02809, over 3744996.11 frames. ], batch size: 62, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:37:28,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=407733.3333333333, ans=0.2 2023-12-04 19:37:35,922 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=22.5 2023-12-04 19:37:54,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=407866.6666666667, ans=0.2 2023-12-04 19:38:02,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=407866.6666666667, ans=0.0 2023-12-04 19:38:18,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407933.3333333333, ans=0.1 2023-12-04 19:38:26,957 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=408000.0, ans=0.1 2023-12-04 19:38:30,199 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.056e+02 1.268e+02 1.359e+02 1.465e+02 1.790e+02, threshold=2.718e+02, percent-clipped=0.0 2023-12-04 19:38:34,703 INFO [train.py:1087] (2/4) Epoch 69, batch 350, loss[loss=0.153, simple_loss=0.2403, pruned_loss=0.03289, over 24502.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2428, pruned_loss=0.02801, over 3994949.54 frames. ], batch size: 75, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:38:53,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=408133.3333333333, ans=0.0 2023-12-04 19:38:56,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=408133.3333333333, ans=0.125 2023-12-04 19:39:01,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=408200.0, ans=0.1 2023-12-04 19:39:01,289 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=408200.0, ans=0.04949747468305833 2023-12-04 19:39:38,921 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:39:43,585 INFO [train.py:1087] (2/4) Epoch 69, batch 400, loss[loss=0.1362, simple_loss=0.2267, pruned_loss=0.02285, over 24741.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2431, pruned_loss=0.02825, over 4171284.34 frames. ], batch size: 63, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:40:24,546 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=408600.0, ans=0.0 2023-12-04 19:40:42,079 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=408666.6666666667, ans=0.125 2023-12-04 19:40:49,274 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.060e+02 1.255e+02 1.353e+02 1.467e+02 1.842e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 19:40:53,285 INFO [train.py:1087] (2/4) Epoch 69, batch 450, loss[loss=0.1523, simple_loss=0.2464, pruned_loss=0.02914, over 24837.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02817, over 4312109.61 frames. ], batch size: 68, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:40:54,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=408733.3333333333, ans=0.2 2023-12-04 19:41:03,475 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-12-04 19:41:32,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=408866.6666666667, ans=0.1 2023-12-04 19:41:45,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=408933.3333333333, ans=0.125 2023-12-04 19:41:49,145 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=409000.0, ans=0.0 2023-12-04 19:42:02,469 INFO [train.py:1087] (2/4) Epoch 69, batch 500, loss[loss=0.1465, simple_loss=0.2392, pruned_loss=0.02695, over 24760.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2429, pruned_loss=0.02816, over 4430899.52 frames. ], batch size: 66, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:42:05,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=409066.6666666667, ans=0.035 2023-12-04 19:42:13,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=409066.6666666667, ans=0.0 2023-12-04 19:42:31,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=409200.0, ans=0.125 2023-12-04 19:43:01,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=409333.3333333333, ans=0.125 2023-12-04 19:43:07,056 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.242e+02 1.345e+02 1.487e+02 1.908e+02, threshold=2.690e+02, percent-clipped=0.0 2023-12-04 19:43:10,224 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=409400.0, ans=0.125 2023-12-04 19:43:11,088 INFO [train.py:1087] (2/4) Epoch 69, batch 550, loss[loss=0.1517, simple_loss=0.242, pruned_loss=0.03072, over 24550.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.243, pruned_loss=0.02813, over 4513783.82 frames. ], batch size: 64, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:43:32,480 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=22.5 2023-12-04 19:43:40,535 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-12-04 19:43:41,480 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=409533.3333333333, ans=0.2 2023-12-04 19:43:51,917 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=409600.0, ans=0.125 2023-12-04 19:44:05,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=409600.0, ans=0.125 2023-12-04 19:44:10,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=409666.6666666667, ans=0.0 2023-12-04 19:44:20,576 INFO [train.py:1087] (2/4) Epoch 69, batch 600, loss[loss=0.1506, simple_loss=0.2413, pruned_loss=0.02999, over 24787.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2428, pruned_loss=0.02802, over 4578219.63 frames. ], batch size: 71, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:44:29,338 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-12-04 19:44:51,995 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.37 vs. limit=6.0 2023-12-04 19:45:02,035 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=409933.3333333333, ans=0.125 2023-12-04 19:45:25,243 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.291e+02 1.390e+02 1.536e+02 2.156e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 19:45:29,251 INFO [train.py:1087] (2/4) Epoch 69, batch 650, loss[loss=0.1428, simple_loss=0.2401, pruned_loss=0.02274, over 24556.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2428, pruned_loss=0.02795, over 4632759.59 frames. ], batch size: 63, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:45:31,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=410066.6666666667, ans=0.125 2023-12-04 19:46:06,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=410200.0, ans=0.0 2023-12-04 19:46:06,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=410200.0, ans=0.04949747468305833 2023-12-04 19:46:15,658 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410266.6666666667, ans=0.1 2023-12-04 19:46:17,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=410266.6666666667, ans=0.2 2023-12-04 19:46:22,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=410333.3333333333, ans=0.0 2023-12-04 19:46:26,419 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:46:36,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=410400.0, ans=0.125 2023-12-04 19:46:37,724 INFO [train.py:1087] (2/4) Epoch 69, batch 700, loss[loss=0.1553, simple_loss=0.2485, pruned_loss=0.03109, over 24570.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2431, pruned_loss=0.02831, over 4646422.67 frames. ], batch size: 65, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:46:38,220 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=410400.0, ans=0.0 2023-12-04 19:46:59,777 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.73 vs. limit=12.0 2023-12-04 19:47:02,160 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=410466.6666666667, ans=0.2 2023-12-04 19:47:11,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=410533.3333333333, ans=0.2 2023-12-04 19:47:12,697 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2023-12-04 19:47:41,422 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.061e+02 1.290e+02 1.377e+02 1.496e+02 2.401e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 19:47:46,305 INFO [train.py:1087] (2/4) Epoch 69, batch 750, loss[loss=0.1587, simple_loss=0.2517, pruned_loss=0.03281, over 24539.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.243, pruned_loss=0.02823, over 4687147.18 frames. ], batch size: 62, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:47:52,746 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.88 vs. limit=12.0 2023-12-04 19:48:12,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=410866.6666666667, ans=0.1 2023-12-04 19:48:13,009 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.87 vs. limit=15.0 2023-12-04 19:48:16,500 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=410866.6666666667, ans=0.2 2023-12-04 19:48:46,808 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-12-04 19:48:54,617 INFO [train.py:1087] (2/4) Epoch 69, batch 800, loss[loss=0.1514, simple_loss=0.2467, pruned_loss=0.02802, over 24773.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02824, over 4729783.33 frames. ], batch size: 64, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:49:03,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=411066.6666666667, ans=0.125 2023-12-04 19:49:24,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411200.0, ans=0.1 2023-12-04 19:49:26,144 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=411200.0, ans=0.125 2023-12-04 19:49:41,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=411266.6666666667, ans=0.125 2023-12-04 19:49:42,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=411266.6666666667, ans=0.2 2023-12-04 19:49:49,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=411333.3333333333, ans=0.125 2023-12-04 19:49:52,928 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.292e+02 1.387e+02 1.504e+02 2.186e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 19:49:56,667 INFO [train.py:1087] (2/4) Epoch 69, batch 850, loss[loss=0.1437, simple_loss=0.2395, pruned_loss=0.02394, over 24860.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2428, pruned_loss=0.02831, over 4753491.11 frames. ], batch size: 68, lr: 3.54e-03, grad_scale: 32.0 2023-12-04 19:50:19,758 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=411533.3333333333, ans=0.0 2023-12-04 19:50:21,052 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411533.3333333333, ans=0.1 2023-12-04 19:50:43,972 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.38 vs. limit=15.0 2023-12-04 19:51:06,652 INFO [train.py:1087] (2/4) Epoch 70, batch 0, loss[loss=0.1592, simple_loss=0.2528, pruned_loss=0.03285, over 23495.00 frames. ], tot_loss[loss=0.1592, simple_loss=0.2528, pruned_loss=0.03285, over 23495.00 frames. ], batch size: 94, lr: 3.52e-03, grad_scale: 32.0 2023-12-04 19:51:06,660 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 19:51:20,012 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.2492, 3.5375, 3.5284, 3.8952], device='cuda:2') 2023-12-04 19:51:22,678 INFO [train.py:1119] (2/4) Epoch 70, validation: loss=0.1509, simple_loss=0.2473, pruned_loss=0.02724, over 944034.00 frames. 2023-12-04 19:51:22,679 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 19:51:29,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=411700.0, ans=0.0 2023-12-04 19:51:31,984 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=411700.0, ans=0.125 2023-12-04 19:51:38,249 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=411766.6666666667, ans=0.2 2023-12-04 19:51:45,339 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=411766.6666666667, ans=0.125 2023-12-04 19:52:01,297 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411900.0, ans=0.1 2023-12-04 19:52:06,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=411900.0, ans=0.1 2023-12-04 19:52:07,683 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=411900.0, ans=10.0 2023-12-04 19:52:11,993 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.79 vs. limit=15.0 2023-12-04 19:52:17,034 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=411966.6666666667, ans=0.0 2023-12-04 19:52:22,907 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:52:29,717 INFO [train.py:1087] (2/4) Epoch 70, batch 50, loss[loss=0.1729, simple_loss=0.2627, pruned_loss=0.04148, over 24474.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2422, pruned_loss=0.02774, over 1101441.95 frames. ], batch size: 75, lr: 3.52e-03, grad_scale: 16.0 2023-12-04 19:52:30,739 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-12-04 19:52:35,037 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.274e+02 1.365e+02 1.527e+02 2.483e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 19:52:35,811 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.99 vs. limit=22.5 2023-12-04 19:53:05,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=412166.6666666667, ans=0.0 2023-12-04 19:53:09,441 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=412233.3333333333, ans=0.0 2023-12-04 19:53:32,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412300.0, ans=0.1 2023-12-04 19:53:36,472 INFO [train.py:1087] (2/4) Epoch 70, batch 100, loss[loss=0.1434, simple_loss=0.2371, pruned_loss=0.02481, over 24779.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2424, pruned_loss=0.02763, over 1929469.86 frames. ], batch size: 71, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:53:40,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=412366.6666666667, ans=0.125 2023-12-04 19:53:41,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=412366.6666666667, ans=0.1 2023-12-04 19:53:52,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=412433.3333333333, ans=0.2 2023-12-04 19:53:53,216 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=412433.3333333333, ans=0.0 2023-12-04 19:54:03,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=412500.0, ans=0.04949747468305833 2023-12-04 19:54:23,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=412566.6666666667, ans=0.1 2023-12-04 19:54:26,003 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412566.6666666667, ans=0.125 2023-12-04 19:54:42,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=412633.3333333333, ans=0.125 2023-12-04 19:54:42,655 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=12.0 2023-12-04 19:54:44,495 INFO [train.py:1087] (2/4) Epoch 70, batch 150, loss[loss=0.1379, simple_loss=0.2352, pruned_loss=0.02028, over 24778.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2425, pruned_loss=0.02805, over 2571246.05 frames. ], batch size: 70, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:54:49,705 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.259e+02 1.327e+02 1.407e+02 1.863e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 19:54:56,328 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-12-04 19:54:57,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=412766.6666666667, ans=0.125 2023-12-04 19:55:11,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=412833.3333333333, ans=0.2 2023-12-04 19:55:14,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=412833.3333333333, ans=0.125 2023-12-04 19:55:19,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=412833.3333333333, ans=0.2 2023-12-04 19:55:24,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=412900.0, ans=0.125 2023-12-04 19:55:31,906 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-12-04 19:55:34,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=412900.0, ans=22.5 2023-12-04 19:55:48,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=412966.6666666667, ans=0.0 2023-12-04 19:55:51,974 INFO [train.py:1087] (2/4) Epoch 70, batch 200, loss[loss=0.1442, simple_loss=0.237, pruned_loss=0.02572, over 24787.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2425, pruned_loss=0.02792, over 3059551.80 frames. ], batch size: 73, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:55:59,625 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-12-04 19:56:04,611 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-12-04 19:56:18,270 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=413166.6666666667, ans=0.125 2023-12-04 19:56:38,455 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-12-04 19:56:50,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=413300.0, ans=0.0 2023-12-04 19:56:59,694 INFO [train.py:1087] (2/4) Epoch 70, batch 250, loss[loss=0.1501, simple_loss=0.2432, pruned_loss=0.02853, over 24572.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.02785, over 3455382.39 frames. ], batch size: 64, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:57:04,767 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.288e+02 1.394e+02 1.582e+02 1.956e+02, threshold=2.788e+02, percent-clipped=0.0 2023-12-04 19:57:07,602 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:57:29,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413500.0, ans=0.1 2023-12-04 19:57:31,406 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=413500.0, ans=0.125 2023-12-04 19:57:35,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=413500.0, ans=0.125 2023-12-04 19:57:42,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=413566.6666666667, ans=0.5 2023-12-04 19:57:44,444 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.35 vs. limit=15.0 2023-12-04 19:57:56,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413633.3333333333, ans=0.1 2023-12-04 19:57:56,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=413633.3333333333, ans=0.0 2023-12-04 19:58:06,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=413700.0, ans=0.0 2023-12-04 19:58:07,212 INFO [train.py:1087] (2/4) Epoch 70, batch 300, loss[loss=0.1505, simple_loss=0.2407, pruned_loss=0.03015, over 24754.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2423, pruned_loss=0.02773, over 3763904.92 frames. ], batch size: 66, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:58:26,646 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=413766.6666666667, ans=0.125 2023-12-04 19:58:51,608 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.43 vs. limit=22.5 2023-12-04 19:58:55,739 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-12-04 19:58:58,301 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=15.0 2023-12-04 19:59:13,299 INFO [train.py:1087] (2/4) Epoch 70, batch 350, loss[loss=0.1511, simple_loss=0.2413, pruned_loss=0.03045, over 24713.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2425, pruned_loss=0.02767, over 3997704.61 frames. ], batch size: 74, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:59:18,495 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.237e+02 1.319e+02 1.438e+02 1.651e+02, threshold=2.639e+02, percent-clipped=0.0 2023-12-04 20:00:17,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=414300.0, ans=0.0 2023-12-04 20:00:20,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=414300.0, ans=0.0 2023-12-04 20:00:22,155 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.12 vs. limit=15.0 2023-12-04 20:00:22,635 INFO [train.py:1087] (2/4) Epoch 70, batch 400, loss[loss=0.1418, simple_loss=0.2342, pruned_loss=0.02474, over 24569.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2423, pruned_loss=0.02774, over 4182578.02 frames. ], batch size: 64, lr: 3.51e-03, grad_scale: 32.0 2023-12-04 20:00:26,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=414366.6666666667, ans=0.125 2023-12-04 20:00:26,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=414366.6666666667, ans=0.125 2023-12-04 20:00:45,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=414433.3333333333, ans=0.0 2023-12-04 20:00:54,998 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-12-04 20:01:02,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=414566.6666666667, ans=0.125 2023-12-04 20:01:19,772 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=414633.3333333333, ans=0.125 2023-12-04 20:01:27,212 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-12-04 20:01:31,522 INFO [train.py:1087] (2/4) Epoch 70, batch 450, loss[loss=0.1614, simple_loss=0.2534, pruned_loss=0.03474, over 23701.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2424, pruned_loss=0.02792, over 4330964.92 frames. ], batch size: 95, lr: 3.50e-03, grad_scale: 32.0 2023-12-04 20:01:36,542 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.278e+02 1.347e+02 1.512e+02 1.947e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 20:01:42,185 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=414700.0, ans=0.0 2023-12-04 20:01:54,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=414766.6666666667, ans=0.2 2023-12-04 20:02:01,713 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-12-04 20:02:06,129 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2023-12-04 20:02:17,833 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.11 vs. limit=22.5 2023-12-04 20:02:18,582 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=414900.0, ans=0.04949747468305833 2023-12-04 20:02:29,021 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=414966.6666666667, ans=0.1 2023-12-04 20:02:39,591 INFO [train.py:1087] (2/4) Epoch 70, batch 500, loss[loss=0.1537, simple_loss=0.2483, pruned_loss=0.02953, over 24545.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2425, pruned_loss=0.028, over 4435205.14 frames. ], batch size: 63, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:02:41,412 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=415033.3333333333, ans=10.0 2023-12-04 20:03:00,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=415100.0, ans=0.0 2023-12-04 20:03:11,111 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.64 vs. limit=15.0 2023-12-04 20:03:41,395 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415300.0, ans=0.0 2023-12-04 20:03:47,578 INFO [train.py:1087] (2/4) Epoch 70, batch 550, loss[loss=0.1497, simple_loss=0.2445, pruned_loss=0.02745, over 24704.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2427, pruned_loss=0.0281, over 4526011.32 frames. ], batch size: 69, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:03:55,055 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.240e+02 1.315e+02 1.412e+02 1.948e+02, threshold=2.630e+02, percent-clipped=0.0 2023-12-04 20:04:11,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=415433.3333333333, ans=0.125 2023-12-04 20:04:21,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=415500.0, ans=0.2 2023-12-04 20:04:29,090 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415566.6666666667, ans=0.1 2023-12-04 20:04:31,927 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=415566.6666666667, ans=0.125 2023-12-04 20:04:35,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=415566.6666666667, ans=0.125 2023-12-04 20:04:56,346 INFO [train.py:1087] (2/4) Epoch 70, batch 600, loss[loss=0.1458, simple_loss=0.2408, pruned_loss=0.02536, over 24573.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2423, pruned_loss=0.02789, over 4600260.91 frames. ], batch size: 65, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:05:05,756 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415700.0, ans=0.125 2023-12-04 20:05:27,320 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-12-04 20:05:30,926 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=415833.3333333333, ans=0.2 2023-12-04 20:05:55,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415966.6666666667, ans=0.1 2023-12-04 20:06:04,813 INFO [train.py:1087] (2/4) Epoch 70, batch 650, loss[loss=0.1463, simple_loss=0.2386, pruned_loss=0.02703, over 24604.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2421, pruned_loss=0.02782, over 4649650.27 frames. ], batch size: 68, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:06:11,812 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.223e+02 1.306e+02 1.426e+02 1.890e+02, threshold=2.613e+02, percent-clipped=0.0 2023-12-04 20:06:12,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=416033.3333333333, ans=0.125 2023-12-04 20:06:15,495 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-12-04 20:06:16,095 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=416033.3333333333, ans=0.125 2023-12-04 20:06:22,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=416100.0, ans=0.2 2023-12-04 20:06:44,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=416233.3333333333, ans=0.125 2023-12-04 20:07:13,100 INFO [train.py:1087] (2/4) Epoch 70, batch 700, loss[loss=0.1548, simple_loss=0.2483, pruned_loss=0.0306, over 24508.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2425, pruned_loss=0.02798, over 4678224.51 frames. ], batch size: 75, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:07:16,040 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=416366.6666666667, ans=0.2 2023-12-04 20:07:38,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416500.0, ans=0.1 2023-12-04 20:07:51,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=416500.0, ans=0.125 2023-12-04 20:07:52,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=416566.6666666667, ans=0.1 2023-12-04 20:07:54,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=416566.6666666667, ans=0.125 2023-12-04 20:08:07,338 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-12-04 20:08:16,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=416633.3333333333, ans=0.2 2023-12-04 20:08:19,693 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=15.0 2023-12-04 20:08:19,985 INFO [train.py:1087] (2/4) Epoch 70, batch 750, loss[loss=0.1441, simple_loss=0.2358, pruned_loss=0.0262, over 24773.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.02799, over 4704309.04 frames. ], batch size: 64, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:08:24,681 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=416700.0, ans=0.0 2023-12-04 20:08:27,500 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.247e+02 1.315e+02 1.424e+02 2.210e+02, threshold=2.629e+02, percent-clipped=0.0 2023-12-04 20:08:46,116 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=416833.3333333333, ans=0.125 2023-12-04 20:08:55,943 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=416833.3333333333, ans=0.125 2023-12-04 20:09:07,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=416900.0, ans=0.125 2023-12-04 20:09:11,273 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=416900.0, ans=0.125 2023-12-04 20:09:28,685 INFO [train.py:1087] (2/4) Epoch 70, batch 800, loss[loss=0.1408, simple_loss=0.2349, pruned_loss=0.0234, over 24755.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2421, pruned_loss=0.02792, over 4710704.99 frames. ], batch size: 66, lr: 3.49e-03, grad_scale: 32.0 2023-12-04 20:09:40,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=417033.3333333333, ans=0.0 2023-12-04 20:09:46,149 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417100.0, ans=0.1 2023-12-04 20:09:50,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=417100.0, ans=0.0 2023-12-04 20:10:24,787 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=417300.0, ans=0.04949747468305833 2023-12-04 20:10:30,563 INFO [train.py:1087] (2/4) Epoch 70, batch 850, loss[loss=0.1696, simple_loss=0.2598, pruned_loss=0.03974, over 17048.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2427, pruned_loss=0.02818, over 4705313.27 frames. ], batch size: 177, lr: 3.49e-03, grad_scale: 32.0 2023-12-04 20:10:36,605 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.298e+02 1.372e+02 1.484e+02 2.066e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 20:10:46,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=417433.3333333333, ans=0.125 2023-12-04 20:10:51,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=417433.3333333333, ans=0.125 2023-12-04 20:11:00,323 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=15.0 2023-12-04 20:11:12,724 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=15.0 2023-12-04 20:11:40,516 INFO [train.py:1087] (2/4) Epoch 71, batch 0, loss[loss=0.1462, simple_loss=0.2434, pruned_loss=0.02451, over 24694.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2434, pruned_loss=0.02451, over 24694.00 frames. ], batch size: 74, lr: 3.47e-03, grad_scale: 32.0 2023-12-04 20:11:40,518 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 20:11:56,070 INFO [train.py:1119] (2/4) Epoch 71, validation: loss=0.1506, simple_loss=0.247, pruned_loss=0.02716, over 944034.00 frames. 2023-12-04 20:11:56,071 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 20:12:14,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=417733.3333333333, ans=0.125 2023-12-04 20:12:14,748 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-12-04 20:12:33,773 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.79 vs. limit=15.0 2023-12-04 20:12:42,098 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.76 vs. limit=10.0 2023-12-04 20:12:42,936 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:12:48,253 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=417933.3333333333, ans=0.125 2023-12-04 20:12:48,487 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.95 vs. limit=15.0 2023-12-04 20:13:02,937 INFO [train.py:1087] (2/4) Epoch 71, batch 50, loss[loss=0.1398, simple_loss=0.2356, pruned_loss=0.02206, over 24746.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2423, pruned_loss=0.0274, over 1079077.29 frames. ], batch size: 63, lr: 3.47e-03, grad_scale: 32.0 2023-12-04 20:13:14,155 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-12-04 20:13:16,523 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.267e+02 1.344e+02 1.443e+02 2.088e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 20:13:25,730 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=418066.6666666667, ans=0.125 2023-12-04 20:13:38,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=418133.3333333333, ans=0.125 2023-12-04 20:13:39,285 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:14:09,099 INFO [train.py:1087] (2/4) Epoch 71, batch 100, loss[loss=0.1505, simple_loss=0.2461, pruned_loss=0.02744, over 24769.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2428, pruned_loss=0.02766, over 1925687.98 frames. ], batch size: 65, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:14:15,531 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=418333.3333333333, ans=0.0 2023-12-04 20:14:30,790 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-12-04 20:14:43,379 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=418466.6666666667, ans=0.125 2023-12-04 20:14:56,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.80 vs. limit=15.0 2023-12-04 20:15:02,184 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=15.0 2023-12-04 20:15:09,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=418600.0, ans=0.0 2023-12-04 20:15:13,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418666.6666666667, ans=0.1 2023-12-04 20:15:14,821 INFO [train.py:1087] (2/4) Epoch 71, batch 150, loss[loss=0.1411, simple_loss=0.231, pruned_loss=0.02556, over 24766.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2423, pruned_loss=0.02759, over 2589424.72 frames. ], batch size: 65, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:15:28,856 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.240e+02 1.368e+02 1.477e+02 1.934e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 20:15:30,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=418733.3333333333, ans=0.125 2023-12-04 20:15:33,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=418733.3333333333, ans=0.125 2023-12-04 20:15:35,040 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=418733.3333333333, ans=0.0 2023-12-04 20:15:39,099 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.69 vs. limit=22.5 2023-12-04 20:15:54,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=418866.6666666667, ans=0.0 2023-12-04 20:16:22,817 INFO [train.py:1087] (2/4) Epoch 71, batch 200, loss[loss=0.165, simple_loss=0.2544, pruned_loss=0.03775, over 23818.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2428, pruned_loss=0.02786, over 3071977.42 frames. ], batch size: 57, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:16:23,678 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-12-04 20:16:34,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=419066.6666666667, ans=0.125 2023-12-04 20:16:40,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=419066.6666666667, ans=0.125 2023-12-04 20:16:56,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=419133.3333333333, ans=0.125 2023-12-04 20:17:31,531 INFO [train.py:1087] (2/4) Epoch 71, batch 250, loss[loss=0.1402, simple_loss=0.2351, pruned_loss=0.02264, over 24600.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2425, pruned_loss=0.02781, over 3450489.92 frames. ], batch size: 68, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:17:32,290 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.05 vs. limit=12.0 2023-12-04 20:17:39,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=419333.3333333333, ans=0.2 2023-12-04 20:17:44,473 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.265e+02 1.365e+02 1.452e+02 1.844e+02, threshold=2.730e+02, percent-clipped=0.0 2023-12-04 20:18:03,640 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=15.0 2023-12-04 20:18:28,460 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=419600.0, ans=0.0 2023-12-04 20:18:39,742 INFO [train.py:1087] (2/4) Epoch 71, batch 300, loss[loss=0.1632, simple_loss=0.254, pruned_loss=0.0362, over 24460.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2426, pruned_loss=0.02795, over 3743780.59 frames. ], batch size: 77, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:18:59,654 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:19:08,006 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-12-04 20:19:19,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=419866.6666666667, ans=0.125 2023-12-04 20:19:28,778 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=12.0 2023-12-04 20:19:30,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=419866.6666666667, ans=0.125 2023-12-04 20:19:33,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=419933.3333333333, ans=0.0 2023-12-04 20:19:35,084 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=419933.3333333333, ans=0.0 2023-12-04 20:19:46,792 INFO [train.py:1087] (2/4) Epoch 71, batch 350, loss[loss=0.149, simple_loss=0.2437, pruned_loss=0.02708, over 24855.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2427, pruned_loss=0.02799, over 3980839.99 frames. ], batch size: 68, lr: 3.46e-03, grad_scale: 16.0 2023-12-04 20:20:02,818 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.263e+02 1.346e+02 1.433e+02 1.740e+02, threshold=2.692e+02, percent-clipped=0.0 2023-12-04 20:20:03,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=420066.6666666667, ans=0.05 2023-12-04 20:20:05,729 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=420066.6666666667, ans=0.0 2023-12-04 20:20:13,901 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=12.0 2023-12-04 20:20:26,504 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=420200.0, ans=0.125 2023-12-04 20:20:32,748 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=420200.0, ans=0.125 2023-12-04 20:20:50,343 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=420266.6666666667, ans=0.0 2023-12-04 20:20:55,238 INFO [train.py:1087] (2/4) Epoch 71, batch 400, loss[loss=0.1378, simple_loss=0.2318, pruned_loss=0.0219, over 24608.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2428, pruned_loss=0.02795, over 4173547.89 frames. ], batch size: 68, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:20:56,785 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=420333.3333333333, ans=0.0 2023-12-04 20:20:56,880 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=420333.3333333333, ans=10.0 2023-12-04 20:21:09,023 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=420400.0, ans=0.125 2023-12-04 20:21:17,463 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-12-04 20:21:19,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=420400.0, ans=0.2 2023-12-04 20:21:40,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=420533.3333333333, ans=0.04949747468305833 2023-12-04 20:21:42,870 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=420533.3333333333, ans=0.125 2023-12-04 20:21:44,596 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-12-04 20:21:48,434 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=420600.0, ans=0.125 2023-12-04 20:22:04,171 INFO [train.py:1087] (2/4) Epoch 71, batch 450, loss[loss=0.139, simple_loss=0.2313, pruned_loss=0.0233, over 24763.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2428, pruned_loss=0.0279, over 4323359.70 frames. ], batch size: 65, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:22:12,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=420666.6666666667, ans=0.0 2023-12-04 20:22:18,181 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.263e+02 1.348e+02 1.441e+02 1.914e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 20:22:34,562 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-12-04 20:22:37,191 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.22 vs. limit=6.0 2023-12-04 20:23:01,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=420933.3333333333, ans=0.0 2023-12-04 20:23:11,324 INFO [train.py:1087] (2/4) Epoch 71, batch 500, loss[loss=0.1583, simple_loss=0.2548, pruned_loss=0.03092, over 24544.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2432, pruned_loss=0.02801, over 4426631.22 frames. ], batch size: 66, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:23:16,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=421000.0, ans=0.125 2023-12-04 20:23:21,814 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.91 vs. limit=15.0 2023-12-04 20:23:25,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=421066.6666666667, ans=0.0 2023-12-04 20:23:25,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421066.6666666667, ans=0.125 2023-12-04 20:23:53,747 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=421200.0, ans=0.0 2023-12-04 20:24:19,567 INFO [train.py:1087] (2/4) Epoch 71, batch 550, loss[loss=0.1431, simple_loss=0.2385, pruned_loss=0.02381, over 24715.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2429, pruned_loss=0.0281, over 4519326.74 frames. ], batch size: 74, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:24:30,224 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=421333.3333333333, ans=0.125 2023-12-04 20:24:34,965 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.289e+02 1.391e+02 1.501e+02 1.847e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 20:24:43,428 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=421400.0, ans=0.015 2023-12-04 20:25:17,763 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=22.5 2023-12-04 20:25:27,733 INFO [train.py:1087] (2/4) Epoch 71, batch 600, loss[loss=0.1477, simple_loss=0.2425, pruned_loss=0.02646, over 23442.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2429, pruned_loss=0.02796, over 4589695.90 frames. ], batch size: 94, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:25:48,684 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=421733.3333333333, ans=0.0 2023-12-04 20:26:10,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=421866.6666666667, ans=0.0 2023-12-04 20:26:29,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=421933.3333333333, ans=0.0 2023-12-04 20:26:35,181 INFO [train.py:1087] (2/4) Epoch 71, batch 650, loss[loss=0.1489, simple_loss=0.2397, pruned_loss=0.029, over 24745.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2422, pruned_loss=0.0277, over 4649517.31 frames. ], batch size: 63, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:26:51,251 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.308e+02 1.425e+02 1.537e+02 2.318e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 20:27:00,874 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=422066.6666666667, ans=0.125 2023-12-04 20:27:21,525 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=422200.0, ans=0.0 2023-12-04 20:27:24,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=422200.0, ans=0.0 2023-12-04 20:27:42,326 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-12-04 20:27:44,292 INFO [train.py:1087] (2/4) Epoch 71, batch 700, loss[loss=0.1437, simple_loss=0.2406, pruned_loss=0.02341, over 24819.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2421, pruned_loss=0.02761, over 4693038.69 frames. ], batch size: 72, lr: 3.45e-03, grad_scale: 16.0 2023-12-04 20:28:04,657 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=422400.0, ans=0.2 2023-12-04 20:28:04,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=422400.0, ans=0.125 2023-12-04 20:28:20,938 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=422466.6666666667, ans=0.125 2023-12-04 20:28:28,665 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=422533.3333333333, ans=0.0 2023-12-04 20:28:53,148 INFO [train.py:1087] (2/4) Epoch 71, batch 750, loss[loss=0.1424, simple_loss=0.2337, pruned_loss=0.02552, over 24721.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2425, pruned_loss=0.02782, over 4727904.26 frames. ], batch size: 69, lr: 3.45e-03, grad_scale: 8.0 2023-12-04 20:29:02,078 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422666.6666666667, ans=0.1 2023-12-04 20:29:07,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=422733.3333333333, ans=0.125 2023-12-04 20:29:08,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=422733.3333333333, ans=0.0 2023-12-04 20:29:10,974 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.267e+02 1.335e+02 1.436e+02 1.922e+02, threshold=2.670e+02, percent-clipped=0.0 2023-12-04 20:29:12,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=422733.3333333333, ans=0.2 2023-12-04 20:29:19,251 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422800.0, ans=0.1 2023-12-04 20:29:19,489 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.67 vs. limit=15.0 2023-12-04 20:29:58,153 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-12-04 20:30:01,119 INFO [train.py:1087] (2/4) Epoch 71, batch 800, loss[loss=0.1501, simple_loss=0.2419, pruned_loss=0.02918, over 24745.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2423, pruned_loss=0.02763, over 4755845.74 frames. ], batch size: 70, lr: 3.45e-03, grad_scale: 16.0 2023-12-04 20:30:01,457 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=423000.0, ans=0.0 2023-12-04 20:30:21,195 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=423066.6666666667, ans=0.0 2023-12-04 20:31:04,110 INFO [train.py:1087] (2/4) Epoch 71, batch 850, loss[loss=0.1458, simple_loss=0.2445, pruned_loss=0.02352, over 24798.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2423, pruned_loss=0.02771, over 4768699.80 frames. ], batch size: 72, lr: 3.44e-03, grad_scale: 16.0 2023-12-04 20:31:17,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=423400.0, ans=0.125 2023-12-04 20:31:19,241 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.244e+02 1.335e+02 1.450e+02 1.789e+02, threshold=2.670e+02, percent-clipped=0.0 2023-12-04 20:31:52,844 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=423600.0, ans=0.125 2023-12-04 20:31:53,205 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.19 vs. limit=22.5 2023-12-04 20:32:11,652 INFO [train.py:1087] (2/4) Epoch 72, batch 0, loss[loss=0.1461, simple_loss=0.2386, pruned_loss=0.02682, over 24753.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2386, pruned_loss=0.02682, over 24753.00 frames. ], batch size: 66, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:32:11,654 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 20:32:27,128 INFO [train.py:1119] (2/4) Epoch 72, validation: loss=0.1505, simple_loss=0.2467, pruned_loss=0.02715, over 944034.00 frames. 2023-12-04 20:32:27,129 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 20:32:28,583 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:32:52,931 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=423766.6666666667, ans=0.0 2023-12-04 20:33:09,470 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=15.0 2023-12-04 20:33:15,664 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=423833.3333333333, ans=0.05 2023-12-04 20:33:34,136 INFO [train.py:1087] (2/4) Epoch 72, batch 50, loss[loss=0.1493, simple_loss=0.2465, pruned_loss=0.02602, over 24745.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2436, pruned_loss=0.028, over 1060561.28 frames. ], batch size: 61, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:33:36,935 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=423966.6666666667, ans=0.125 2023-12-04 20:33:43,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=423966.6666666667, ans=0.95 2023-12-04 20:33:47,388 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=12.0 2023-12-04 20:33:52,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=424033.3333333333, ans=0.1 2023-12-04 20:33:52,817 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.85 vs. limit=22.5 2023-12-04 20:33:58,346 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.316e+02 1.426e+02 1.600e+02 2.159e+02, threshold=2.851e+02, percent-clipped=0.0 2023-12-04 20:34:20,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=424166.6666666667, ans=0.1 2023-12-04 20:34:40,866 INFO [train.py:1087] (2/4) Epoch 72, batch 100, loss[loss=0.1608, simple_loss=0.255, pruned_loss=0.03328, over 22785.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.243, pruned_loss=0.02778, over 1898091.34 frames. ], batch size: 106, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:35:08,455 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=424433.3333333333, ans=0.125 2023-12-04 20:35:19,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=424433.3333333333, ans=0.0 2023-12-04 20:35:25,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424500.0, ans=0.125 2023-12-04 20:35:48,536 INFO [train.py:1087] (2/4) Epoch 72, batch 150, loss[loss=0.1483, simple_loss=0.242, pruned_loss=0.02734, over 24578.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2431, pruned_loss=0.02775, over 2536317.61 frames. ], batch size: 65, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:36:00,287 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=12.0 2023-12-04 20:36:11,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=424700.0, ans=0.0 2023-12-04 20:36:12,716 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.054e+02 1.281e+02 1.343e+02 1.454e+02 1.783e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 20:36:22,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=424766.6666666667, ans=0.125 2023-12-04 20:36:26,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=424766.6666666667, ans=0.125 2023-12-04 20:36:30,728 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 20:36:52,992 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-12-04 20:36:56,258 INFO [train.py:1087] (2/4) Epoch 72, batch 200, loss[loss=0.1445, simple_loss=0.2379, pruned_loss=0.02561, over 24569.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2427, pruned_loss=0.02766, over 3037941.51 frames. ], batch size: 64, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:37:35,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=425166.6666666667, ans=0.07 2023-12-04 20:37:36,222 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.50 vs. limit=15.0 2023-12-04 20:38:04,437 INFO [train.py:1087] (2/4) Epoch 72, batch 250, loss[loss=0.1678, simple_loss=0.2564, pruned_loss=0.03959, over 17102.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2426, pruned_loss=0.02798, over 3413082.72 frames. ], batch size: 177, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:38:18,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=425366.6666666667, ans=0.2 2023-12-04 20:38:25,640 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2023-12-04 20:38:28,887 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.288e+02 1.353e+02 1.422e+02 1.761e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 20:38:35,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=425433.3333333333, ans=0.0 2023-12-04 20:38:51,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=425500.0, ans=0.125 2023-12-04 20:38:53,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=425500.0, ans=0.125 2023-12-04 20:38:56,608 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=15.0 2023-12-04 20:39:12,480 INFO [train.py:1087] (2/4) Epoch 72, batch 300, loss[loss=0.1475, simple_loss=0.2465, pruned_loss=0.02424, over 24767.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2427, pruned_loss=0.02779, over 3731683.64 frames. ], batch size: 70, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:39:16,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=425633.3333333333, ans=0.125 2023-12-04 20:39:32,442 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=425700.0, ans=0.2 2023-12-04 20:39:40,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=425766.6666666667, ans=0.1 2023-12-04 20:39:41,695 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.91 vs. limit=10.0 2023-12-04 20:39:43,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=425766.6666666667, ans=10.0 2023-12-04 20:39:56,699 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.76 vs. limit=15.0 2023-12-04 20:40:12,109 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2023-12-04 20:40:19,158 INFO [train.py:1087] (2/4) Epoch 72, batch 350, loss[loss=0.1418, simple_loss=0.2345, pruned_loss=0.02454, over 24581.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2427, pruned_loss=0.02797, over 3975867.13 frames. ], batch size: 65, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:40:44,212 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.286e+02 1.389e+02 1.472e+02 1.982e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-04 20:40:51,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=426100.0, ans=0.0 2023-12-04 20:41:27,078 INFO [train.py:1087] (2/4) Epoch 72, batch 400, loss[loss=0.1466, simple_loss=0.2368, pruned_loss=0.02817, over 24854.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02811, over 4169207.19 frames. ], batch size: 68, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:41:49,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=426366.6666666667, ans=0.125 2023-12-04 20:42:01,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=426433.3333333333, ans=0.0 2023-12-04 20:42:09,825 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=426500.0, ans=0.2 2023-12-04 20:42:15,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=426500.0, ans=0.125 2023-12-04 20:42:21,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=426566.6666666667, ans=0.125 2023-12-04 20:42:30,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=426566.6666666667, ans=0.125 2023-12-04 20:42:35,248 INFO [train.py:1087] (2/4) Epoch 72, batch 450, loss[loss=0.1413, simple_loss=0.2328, pruned_loss=0.0249, over 24807.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.02801, over 4304693.02 frames. ], batch size: 72, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:42:56,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=426700.0, ans=0.05 2023-12-04 20:43:04,158 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.260e+02 1.358e+02 1.461e+02 1.966e+02, threshold=2.717e+02, percent-clipped=0.0 2023-12-04 20:43:21,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=426833.3333333333, ans=0.125 2023-12-04 20:43:31,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=426900.0, ans=0.0 2023-12-04 20:43:46,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=426966.6666666667, ans=0.025 2023-12-04 20:43:47,682 INFO [train.py:1087] (2/4) Epoch 72, batch 500, loss[loss=0.1484, simple_loss=0.2437, pruned_loss=0.02653, over 24780.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2423, pruned_loss=0.02794, over 4420617.43 frames. ], batch size: 71, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:43:52,053 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=426966.6666666667, ans=0.125 2023-12-04 20:44:02,539 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=427033.3333333333, ans=0.125 2023-12-04 20:44:04,813 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=427033.3333333333, ans=0.125 2023-12-04 20:44:29,559 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.71 vs. limit=15.0 2023-12-04 20:44:39,491 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=427166.6666666667, ans=0.04949747468305833 2023-12-04 20:44:40,645 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=427233.3333333333, ans=0.0 2023-12-04 20:44:54,309 INFO [train.py:1087] (2/4) Epoch 72, batch 550, loss[loss=0.1388, simple_loss=0.2335, pruned_loss=0.0221, over 24763.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2425, pruned_loss=0.02821, over 4487306.82 frames. ], batch size: 64, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:44:55,834 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=427300.0, ans=0.125 2023-12-04 20:45:18,714 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.261e+02 1.376e+02 1.500e+02 2.224e+02, threshold=2.751e+02, percent-clipped=0.0 2023-12-04 20:45:51,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=427566.6666666667, ans=0.125 2023-12-04 20:45:58,777 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.38 vs. limit=15.0 2023-12-04 20:46:01,924 INFO [train.py:1087] (2/4) Epoch 72, batch 600, loss[loss=0.1428, simple_loss=0.2404, pruned_loss=0.02264, over 24786.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2426, pruned_loss=0.0282, over 4564015.74 frames. ], batch size: 72, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:46:38,975 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=7.650e-03 2023-12-04 20:46:45,580 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=427833.3333333333, ans=0.125 2023-12-04 20:47:10,169 INFO [train.py:1087] (2/4) Epoch 72, batch 650, loss[loss=0.1611, simple_loss=0.2575, pruned_loss=0.03229, over 22821.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2426, pruned_loss=0.02813, over 4613472.13 frames. ], batch size: 106, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:47:10,585 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=427966.6666666667, ans=0.0 2023-12-04 20:47:15,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=427966.6666666667, ans=0.125 2023-12-04 20:47:15,695 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=427966.6666666667, ans=0.2 2023-12-04 20:47:19,953 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427966.6666666667, ans=0.1 2023-12-04 20:47:34,544 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.250e+02 1.369e+02 1.461e+02 1.871e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-04 20:48:16,626 INFO [train.py:1087] (2/4) Epoch 72, batch 700, loss[loss=0.1381, simple_loss=0.2339, pruned_loss=0.02114, over 24789.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2429, pruned_loss=0.02812, over 4635739.29 frames. ], batch size: 72, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:48:23,519 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=428300.0, ans=0.125 2023-12-04 20:48:26,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428300.0, ans=0.125 2023-12-04 20:48:36,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=428366.6666666667, ans=0.125 2023-12-04 20:48:46,944 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=2.701e-03 2023-12-04 20:48:48,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=428433.3333333333, ans=0.2 2023-12-04 20:48:52,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=428433.3333333333, ans=0.125 2023-12-04 20:49:05,849 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428500.0, ans=0.1 2023-12-04 20:49:07,599 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=428500.0, ans=0.0 2023-12-04 20:49:23,502 INFO [train.py:1087] (2/4) Epoch 72, batch 750, loss[loss=0.1501, simple_loss=0.2438, pruned_loss=0.02822, over 24721.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.243, pruned_loss=0.02822, over 4684456.81 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:49:47,389 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.245e+02 1.331e+02 1.423e+02 2.357e+02, threshold=2.662e+02, percent-clipped=0.0 2023-12-04 20:49:55,318 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=428766.6666666667, ans=0.05 2023-12-04 20:49:59,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=428766.6666666667, ans=0.125 2023-12-04 20:50:09,718 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=428833.3333333333, ans=0.125 2023-12-04 20:50:29,935 INFO [train.py:1087] (2/4) Epoch 72, batch 800, loss[loss=0.1523, simple_loss=0.2461, pruned_loss=0.02921, over 21369.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2426, pruned_loss=0.02804, over 4712267.83 frames. ], batch size: 127, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:50:34,399 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428966.6666666667, ans=0.1 2023-12-04 20:50:35,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=428966.6666666667, ans=0.125 2023-12-04 20:50:36,713 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=428966.6666666667, ans=0.125 2023-12-04 20:50:56,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429100.0, ans=0.1 2023-12-04 20:51:08,172 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=429166.6666666667, ans=0.0 2023-12-04 20:51:29,022 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429300.0, ans=0.1 2023-12-04 20:51:29,965 INFO [train.py:1087] (2/4) Epoch 72, batch 850, loss[loss=0.1395, simple_loss=0.2355, pruned_loss=0.02178, over 24797.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.02788, over 4736474.70 frames. ], batch size: 72, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:51:39,982 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=429300.0, ans=0.0 2023-12-04 20:51:51,962 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.290e+02 1.355e+02 1.497e+02 2.215e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 20:52:04,829 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.02 vs. limit=10.0 2023-12-04 20:52:43,348 INFO [train.py:1087] (2/4) Epoch 73, batch 0, loss[loss=0.1383, simple_loss=0.2316, pruned_loss=0.02247, over 24748.00 frames. ], tot_loss[loss=0.1383, simple_loss=0.2316, pruned_loss=0.02247, over 24748.00 frames. ], batch size: 66, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:52:43,349 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 20:52:58,768 INFO [train.py:1119] (2/4) Epoch 73, validation: loss=0.1503, simple_loss=0.2466, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 20:52:58,769 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 20:53:23,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=429666.6666666667, ans=0.09899494936611666 2023-12-04 20:53:28,290 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429733.3333333333, ans=0.1 2023-12-04 20:53:48,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-12-04 20:54:00,434 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=429866.6666666667, ans=0.125 2023-12-04 20:54:05,234 INFO [train.py:1087] (2/4) Epoch 73, batch 50, loss[loss=0.1465, simple_loss=0.2422, pruned_loss=0.02544, over 24773.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2413, pruned_loss=0.02722, over 1075968.75 frames. ], batch size: 64, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:54:10,683 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=429933.3333333333, ans=0.0 2023-12-04 20:54:18,810 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-12-04 20:54:27,178 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=430000.0, ans=0.125 2023-12-04 20:54:36,078 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.287e+02 1.367e+02 1.508e+02 1.992e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 20:55:04,252 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-12-04 20:55:11,517 INFO [train.py:1087] (2/4) Epoch 73, batch 100, loss[loss=0.1492, simple_loss=0.2407, pruned_loss=0.02884, over 24429.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2419, pruned_loss=0.02748, over 1916880.78 frames. ], batch size: 77, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:55:36,129 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-12-04 20:56:08,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=430533.3333333333, ans=0.125 2023-12-04 20:56:18,503 INFO [train.py:1087] (2/4) Epoch 73, batch 150, loss[loss=0.143, simple_loss=0.2382, pruned_loss=0.02388, over 24550.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2422, pruned_loss=0.02764, over 2561082.17 frames. ], batch size: 66, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:56:26,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=430600.0, ans=0.125 2023-12-04 20:56:49,241 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.311e+02 1.423e+02 1.594e+02 2.186e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 20:57:19,322 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=430866.6666666667, ans=0.0 2023-12-04 20:57:24,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=430933.3333333333, ans=0.0 2023-12-04 20:57:25,441 INFO [train.py:1087] (2/4) Epoch 73, batch 200, loss[loss=0.1461, simple_loss=0.2403, pruned_loss=0.02598, over 22906.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2419, pruned_loss=0.02778, over 3061295.87 frames. ], batch size: 106, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:57:33,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=430933.3333333333, ans=0.2 2023-12-04 20:57:37,390 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431000.0, ans=0.1 2023-12-04 20:57:53,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=431066.6666666667, ans=0.125 2023-12-04 20:58:26,616 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431200.0, ans=0.1 2023-12-04 20:58:28,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=431200.0, ans=0.2 2023-12-04 20:58:32,515 INFO [train.py:1087] (2/4) Epoch 73, batch 250, loss[loss=0.1484, simple_loss=0.244, pruned_loss=0.02639, over 24732.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2424, pruned_loss=0.02791, over 3458745.93 frames. ], batch size: 67, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 20:58:44,242 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=431333.3333333333, ans=0.0 2023-12-04 20:59:02,554 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.285e+02 1.373e+02 1.461e+02 1.891e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 20:59:08,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=431400.0, ans=0.09899494936611666 2023-12-04 20:59:26,281 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.559e-02 2023-12-04 20:59:38,327 INFO [train.py:1087] (2/4) Epoch 73, batch 300, loss[loss=0.1515, simple_loss=0.2417, pruned_loss=0.03067, over 24557.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2421, pruned_loss=0.02775, over 3773570.42 frames. ], batch size: 66, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 20:59:43,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=431600.0, ans=0.0 2023-12-04 20:59:44,621 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=431600.0, ans=0.125 2023-12-04 20:59:52,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=431666.6666666667, ans=0.125 2023-12-04 21:00:28,012 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:00:43,862 INFO [train.py:1087] (2/4) Epoch 73, batch 350, loss[loss=0.1618, simple_loss=0.2579, pruned_loss=0.03291, over 24147.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2417, pruned_loss=0.02753, over 4007717.44 frames. ], batch size: 58, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:01:11,605 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432066.6666666667, ans=0.1 2023-12-04 21:01:14,881 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.268e+02 1.341e+02 1.476e+02 1.875e+02, threshold=2.682e+02, percent-clipped=0.0 2023-12-04 21:01:36,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=432200.0, ans=0.125 2023-12-04 21:01:50,821 INFO [train.py:1087] (2/4) Epoch 73, batch 400, loss[loss=0.1419, simple_loss=0.2367, pruned_loss=0.02354, over 24558.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02745, over 4177009.33 frames. ], batch size: 66, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:02:05,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=12.0 2023-12-04 21:02:49,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=432533.3333333333, ans=0.125 2023-12-04 21:02:55,975 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=432600.0, ans=0.0 2023-12-04 21:02:56,760 INFO [train.py:1087] (2/4) Epoch 73, batch 450, loss[loss=0.1442, simple_loss=0.2396, pruned_loss=0.02441, over 24074.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02755, over 4305003.50 frames. ], batch size: 87, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:02:58,392 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=432600.0, ans=0.125 2023-12-04 21:03:03,529 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=432600.0, ans=0.125 2023-12-04 21:03:03,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-12-04 21:03:19,102 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=12.0 2023-12-04 21:03:19,188 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-12-04 21:03:26,799 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.266e+02 1.338e+02 1.464e+02 1.805e+02, threshold=2.676e+02, percent-clipped=0.0 2023-12-04 21:04:01,998 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.34 vs. limit=22.5 2023-12-04 21:04:02,534 INFO [train.py:1087] (2/4) Epoch 73, batch 500, loss[loss=0.1613, simple_loss=0.2511, pruned_loss=0.03571, over 23424.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.0278, over 4399869.56 frames. ], batch size: 94, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:04:04,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432933.3333333333, ans=0.1 2023-12-04 21:04:16,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=433000.0, ans=0.2 2023-12-04 21:04:33,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=433066.6666666667, ans=0.125 2023-12-04 21:04:33,218 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=433066.6666666667, ans=0.2 2023-12-04 21:04:41,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-12-04 21:05:05,132 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.01 vs. limit=15.0 2023-12-04 21:05:08,174 INFO [train.py:1087] (2/4) Epoch 73, batch 550, loss[loss=0.1366, simple_loss=0.2322, pruned_loss=0.02044, over 24798.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.028, over 4465505.06 frames. ], batch size: 62, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:05:36,272 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=433400.0, ans=0.2 2023-12-04 21:05:39,653 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.258e+02 1.337e+02 1.412e+02 2.063e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 21:05:46,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=433400.0, ans=0.2 2023-12-04 21:06:13,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433533.3333333333, ans=0.1 2023-12-04 21:06:15,604 INFO [train.py:1087] (2/4) Epoch 73, batch 600, loss[loss=0.1428, simple_loss=0.2398, pruned_loss=0.02293, over 24542.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2421, pruned_loss=0.02762, over 4554192.27 frames. ], batch size: 62, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:06:35,505 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=433666.6666666667, ans=0.2 2023-12-04 21:06:37,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=433666.6666666667, ans=0.125 2023-12-04 21:06:44,245 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=433733.3333333333, ans=0.125 2023-12-04 21:06:47,201 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.27 vs. limit=22.5 2023-12-04 21:07:04,096 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=433800.0, ans=0.0 2023-12-04 21:07:05,826 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=15.0 2023-12-04 21:07:09,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=433866.6666666667, ans=0.0 2023-12-04 21:07:16,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=433866.6666666667, ans=0.2 2023-12-04 21:07:17,968 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=433866.6666666667, ans=0.07 2023-12-04 21:07:19,462 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.18 vs. limit=15.0 2023-12-04 21:07:22,674 INFO [train.py:1087] (2/4) Epoch 73, batch 650, loss[loss=0.1384, simple_loss=0.2314, pruned_loss=0.02273, over 24792.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02758, over 4618376.11 frames. ], batch size: 62, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:07:25,909 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.36 vs. limit=10.0 2023-12-04 21:07:54,269 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.262e+02 1.372e+02 1.499e+02 1.934e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 21:07:59,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=434066.6666666667, ans=0.125 2023-12-04 21:08:10,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=434133.3333333333, ans=0.2 2023-12-04 21:08:30,739 INFO [train.py:1087] (2/4) Epoch 73, batch 700, loss[loss=0.1473, simple_loss=0.2395, pruned_loss=0.02755, over 24000.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2422, pruned_loss=0.02783, over 4647814.33 frames. ], batch size: 87, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:08:51,521 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=434333.3333333333, ans=0.125 2023-12-04 21:09:04,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=434400.0, ans=0.125 2023-12-04 21:09:15,656 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=434466.6666666667, ans=0.2 2023-12-04 21:09:16,993 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=15.0 2023-12-04 21:09:37,086 INFO [train.py:1087] (2/4) Epoch 73, batch 750, loss[loss=0.1483, simple_loss=0.2448, pruned_loss=0.02588, over 24763.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2423, pruned_loss=0.02807, over 4663465.14 frames. ], batch size: 70, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:09:39,594 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2023-12-04 21:09:49,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=434666.6666666667, ans=0.0 2023-12-04 21:09:51,272 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.31 vs. limit=15.0 2023-12-04 21:09:59,816 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=434666.6666666667, ans=0.125 2023-12-04 21:10:07,606 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.279e+02 1.357e+02 1.516e+02 2.377e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-04 21:10:22,334 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=434800.0, ans=0.95 2023-12-04 21:10:37,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434866.6666666667, ans=0.125 2023-12-04 21:10:40,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=434866.6666666667, ans=0.125 2023-12-04 21:10:41,733 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.62 vs. limit=22.5 2023-12-04 21:10:43,664 INFO [train.py:1087] (2/4) Epoch 73, batch 800, loss[loss=0.1446, simple_loss=0.2365, pruned_loss=0.02637, over 24581.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.242, pruned_loss=0.02786, over 4683022.63 frames. ], batch size: 64, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:11:02,712 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-12-04 21:11:08,723 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.80 vs. limit=15.0 2023-12-04 21:11:25,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=435133.3333333333, ans=0.2 2023-12-04 21:11:28,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435133.3333333333, ans=0.1 2023-12-04 21:11:43,655 INFO [train.py:1087] (2/4) Epoch 73, batch 850, loss[loss=0.1465, simple_loss=0.2419, pruned_loss=0.02548, over 24732.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2423, pruned_loss=0.02796, over 4711490.50 frames. ], batch size: 63, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:12:02,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=435333.3333333333, ans=10.0 2023-12-04 21:12:10,271 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.277e+02 1.358e+02 1.521e+02 2.220e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 21:12:12,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=435400.0, ans=15.0 2023-12-04 21:12:21,348 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=15.0 2023-12-04 21:12:47,976 INFO [train.py:1087] (2/4) Epoch 74, batch 0, loss[loss=0.1437, simple_loss=0.2405, pruned_loss=0.02349, over 24854.00 frames. ], tot_loss[loss=0.1437, simple_loss=0.2405, pruned_loss=0.02349, over 24854.00 frames. ], batch size: 68, lr: 3.33e-03, grad_scale: 32.0 2023-12-04 21:12:47,977 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 21:12:57,080 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.6364, 4.3071, 4.3786, 4.3345], device='cuda:2') 2023-12-04 21:13:02,606 INFO [train.py:1119] (2/4) Epoch 74, validation: loss=0.1507, simple_loss=0.2468, pruned_loss=0.02733, over 944034.00 frames. 2023-12-04 21:13:02,607 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 21:13:03,627 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.81 vs. limit=10.0 2023-12-04 21:13:07,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=435566.6666666667, ans=0.0 2023-12-04 21:13:20,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=435633.3333333333, ans=0.125 2023-12-04 21:13:32,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=435700.0, ans=0.5 2023-12-04 21:13:39,441 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.90 vs. limit=15.0 2023-12-04 21:13:39,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=435700.0, ans=15.0 2023-12-04 21:13:41,556 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:13:44,437 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.92 vs. limit=15.0 2023-12-04 21:13:48,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435766.6666666667, ans=0.0 2023-12-04 21:13:57,171 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.75 vs. limit=15.0 2023-12-04 21:14:07,795 INFO [train.py:1087] (2/4) Epoch 74, batch 50, loss[loss=0.1499, simple_loss=0.2438, pruned_loss=0.02797, over 24550.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2434, pruned_loss=0.02906, over 1091069.38 frames. ], batch size: 66, lr: 3.32e-03, grad_scale: 64.0 2023-12-04 21:14:11,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=435900.0, ans=0.125 2023-12-04 21:14:12,208 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=435900.0, ans=0.125 2023-12-04 21:14:44,575 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.267e+02 1.359e+02 1.516e+02 1.990e+02, threshold=2.718e+02, percent-clipped=0.0 2023-12-04 21:14:47,745 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=436100.0, ans=0.0 2023-12-04 21:15:03,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=436166.6666666667, ans=0.05 2023-12-04 21:15:03,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=436166.6666666667, ans=0.0 2023-12-04 21:15:13,992 INFO [train.py:1087] (2/4) Epoch 74, batch 100, loss[loss=0.1434, simple_loss=0.2411, pruned_loss=0.02287, over 24602.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2426, pruned_loss=0.02841, over 1893182.34 frames. ], batch size: 68, lr: 3.32e-03, grad_scale: 64.0 2023-12-04 21:15:29,509 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=436300.0, ans=0.125 2023-12-04 21:15:40,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=436366.6666666667, ans=0.125 2023-12-04 21:15:44,338 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-12-04 21:15:45,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=436366.6666666667, ans=0.0 2023-12-04 21:16:04,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436433.3333333333, ans=0.1 2023-12-04 21:16:19,478 INFO [train.py:1087] (2/4) Epoch 74, batch 150, loss[loss=0.1534, simple_loss=0.2477, pruned_loss=0.02956, over 24569.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2421, pruned_loss=0.02779, over 2550731.50 frames. ], batch size: 63, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:16:33,612 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=436633.3333333333, ans=0.125 2023-12-04 21:16:36,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=436633.3333333333, ans=0.0 2023-12-04 21:16:56,848 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=436700.0, ans=0.2 2023-12-04 21:16:57,627 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.082e+02 1.269e+02 1.335e+02 1.449e+02 2.080e+02, threshold=2.669e+02, percent-clipped=0.0 2023-12-04 21:17:06,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=436766.6666666667, ans=15.0 2023-12-04 21:17:25,280 INFO [train.py:1087] (2/4) Epoch 74, batch 200, loss[loss=0.1495, simple_loss=0.2455, pruned_loss=0.02675, over 24680.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.242, pruned_loss=0.02771, over 3040506.97 frames. ], batch size: 74, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:17:45,883 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=436966.6666666667, ans=0.09899494936611666 2023-12-04 21:17:46,379 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-12-04 21:17:53,542 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=437033.3333333333, ans=0.0 2023-12-04 21:18:04,120 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=437100.0, ans=15.0 2023-12-04 21:18:31,447 INFO [train.py:1087] (2/4) Epoch 74, batch 250, loss[loss=0.145, simple_loss=0.2413, pruned_loss=0.02438, over 24789.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2422, pruned_loss=0.02791, over 3431139.75 frames. ], batch size: 71, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:19:03,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=437366.6666666667, ans=0.125 2023-12-04 21:19:09,604 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.279e+02 1.357e+02 1.509e+02 2.105e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 21:19:37,271 INFO [train.py:1087] (2/4) Epoch 74, batch 300, loss[loss=0.1526, simple_loss=0.245, pruned_loss=0.03017, over 24790.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2418, pruned_loss=0.02774, over 3736480.35 frames. ], batch size: 62, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:20:12,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=437700.0, ans=0.125 2023-12-04 21:20:21,524 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=437766.6666666667, ans=0.125 2023-12-04 21:20:37,014 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=437833.3333333333, ans=0.2 2023-12-04 21:20:42,674 INFO [train.py:1087] (2/4) Epoch 74, batch 350, loss[loss=0.1455, simple_loss=0.2403, pruned_loss=0.02536, over 24781.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2422, pruned_loss=0.02785, over 3975592.37 frames. ], batch size: 71, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:20:59,377 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=15.0 2023-12-04 21:21:21,353 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.265e+02 1.357e+02 1.512e+02 2.007e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 21:21:32,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=438100.0, ans=0.125 2023-12-04 21:21:49,779 INFO [train.py:1087] (2/4) Epoch 74, batch 400, loss[loss=0.1589, simple_loss=0.2482, pruned_loss=0.03486, over 24464.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2419, pruned_loss=0.02746, over 4182596.54 frames. ], batch size: 77, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:22:56,781 INFO [train.py:1087] (2/4) Epoch 74, batch 450, loss[loss=0.1466, simple_loss=0.2424, pruned_loss=0.02537, over 24783.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02754, over 4310760.81 frames. ], batch size: 73, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:23:31,897 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.34 vs. limit=22.5 2023-12-04 21:23:34,850 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.047e+02 1.260e+02 1.354e+02 1.483e+02 1.828e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-04 21:24:00,459 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:24:02,987 INFO [train.py:1087] (2/4) Epoch 74, batch 500, loss[loss=0.1419, simple_loss=0.2348, pruned_loss=0.02452, over 24747.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.242, pruned_loss=0.02758, over 4415880.51 frames. ], batch size: 61, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:24:05,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=438900.0, ans=0.2 2023-12-04 21:24:11,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438900.0, ans=0.1 2023-12-04 21:24:17,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=438966.6666666667, ans=0.0 2023-12-04 21:24:23,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=438966.6666666667, ans=0.125 2023-12-04 21:24:28,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439033.3333333333, ans=0.1 2023-12-04 21:24:29,133 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-12-04 21:24:34,126 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.31 vs. limit=15.0 2023-12-04 21:24:37,420 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-12-04 21:24:50,599 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:24:50,623 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:24:57,184 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=439166.6666666667, ans=0.2 2023-12-04 21:25:01,292 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=12.0 2023-12-04 21:25:01,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=439166.6666666667, ans=0.2 2023-12-04 21:25:07,891 INFO [train.py:1087] (2/4) Epoch 74, batch 550, loss[loss=0.1453, simple_loss=0.237, pruned_loss=0.02682, over 24744.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2417, pruned_loss=0.0274, over 4516595.14 frames. ], batch size: 70, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:25:25,484 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=439300.0, ans=0.125 2023-12-04 21:25:41,992 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2023-12-04 21:25:46,153 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.258e+02 1.337e+02 1.431e+02 1.814e+02, threshold=2.673e+02, percent-clipped=0.0 2023-12-04 21:25:58,037 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=439433.3333333333, ans=0.2 2023-12-04 21:26:03,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=439500.0, ans=0.125 2023-12-04 21:26:12,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=439500.0, ans=0.125 2023-12-04 21:26:12,276 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=439500.0, ans=0.07 2023-12-04 21:26:14,514 INFO [train.py:1087] (2/4) Epoch 74, batch 600, loss[loss=0.1553, simple_loss=0.2504, pruned_loss=0.03006, over 24259.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02757, over 4583883.81 frames. ], batch size: 79, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:26:17,498 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=439566.6666666667, ans=0.125 2023-12-04 21:26:25,537 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.62 vs. limit=22.5 2023-12-04 21:26:38,180 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.37 vs. limit=10.0 2023-12-04 21:26:41,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=439700.0, ans=0.125 2023-12-04 21:27:21,263 INFO [train.py:1087] (2/4) Epoch 74, batch 650, loss[loss=0.1431, simple_loss=0.2357, pruned_loss=0.02526, over 24608.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02762, over 4630475.48 frames. ], batch size: 68, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:27:52,272 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=440033.3333333333, ans=0.125 2023-12-04 21:27:52,691 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-12-04 21:27:59,966 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.287e+02 1.377e+02 1.543e+02 1.843e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 21:28:25,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=440166.6666666667, ans=0.95 2023-12-04 21:28:28,094 INFO [train.py:1087] (2/4) Epoch 74, batch 700, loss[loss=0.1446, simple_loss=0.2406, pruned_loss=0.02436, over 24694.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2418, pruned_loss=0.02744, over 4678480.85 frames. ], batch size: 74, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:28:52,081 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=22.5 2023-12-04 21:28:52,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440366.6666666667, ans=0.1 2023-12-04 21:29:01,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=440366.6666666667, ans=0.125 2023-12-04 21:29:06,984 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=440433.3333333333, ans=0.2 2023-12-04 21:29:12,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440433.3333333333, ans=0.0 2023-12-04 21:29:21,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=440500.0, ans=0.125 2023-12-04 21:29:24,111 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:29:34,200 INFO [train.py:1087] (2/4) Epoch 74, batch 750, loss[loss=0.1465, simple_loss=0.2382, pruned_loss=0.02738, over 24850.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2417, pruned_loss=0.02753, over 4711282.94 frames. ], batch size: 68, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:30:12,276 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.019e+02 1.271e+02 1.351e+02 1.450e+02 1.892e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 21:30:25,807 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-12-04 21:30:34,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440833.3333333333, ans=0.0 2023-12-04 21:30:39,688 INFO [train.py:1087] (2/4) Epoch 74, batch 800, loss[loss=0.1751, simple_loss=0.2618, pruned_loss=0.04417, over 16826.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2417, pruned_loss=0.02768, over 4727902.46 frames. ], batch size: 177, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:30:45,186 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.88 vs. limit=10.0 2023-12-04 21:30:59,516 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=440966.6666666667, ans=0.125 2023-12-04 21:31:00,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=440966.6666666667, ans=0.125 2023-12-04 21:31:06,659 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=441033.3333333333, ans=0.125 2023-12-04 21:31:18,362 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=441100.0, ans=0.05 2023-12-04 21:31:19,355 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=441100.0, ans=0.2 2023-12-04 21:31:19,458 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=441100.0, ans=0.125 2023-12-04 21:31:26,744 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.10 vs. limit=10.0 2023-12-04 21:31:40,143 INFO [train.py:1087] (2/4) Epoch 74, batch 850, loss[loss=0.14, simple_loss=0.2348, pruned_loss=0.02267, over 24552.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2416, pruned_loss=0.02756, over 4756217.01 frames. ], batch size: 62, lr: 3.30e-03, grad_scale: 32.0 2023-12-04 21:32:14,707 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.298e+02 1.361e+02 1.454e+02 1.998e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 21:32:16,074 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=441433.3333333333, ans=0.125 2023-12-04 21:32:23,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=441433.3333333333, ans=0.125 2023-12-04 21:32:50,931 INFO [train.py:1087] (2/4) Epoch 75, batch 0, loss[loss=0.1436, simple_loss=0.2377, pruned_loss=0.0247, over 24856.00 frames. ], tot_loss[loss=0.1436, simple_loss=0.2377, pruned_loss=0.0247, over 24856.00 frames. ], batch size: 68, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:32:50,932 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 21:33:05,652 INFO [train.py:1119] (2/4) Epoch 75, validation: loss=0.1512, simple_loss=0.247, pruned_loss=0.02763, over 944034.00 frames. 2023-12-04 21:33:05,653 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 21:33:10,840 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441533.3333333333, ans=0.125 2023-12-04 21:33:18,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=441600.0, ans=0.125 2023-12-04 21:33:23,701 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=441600.0, ans=0.125 2023-12-04 21:34:02,875 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-12-04 21:34:11,147 INFO [train.py:1087] (2/4) Epoch 75, batch 50, loss[loss=0.1601, simple_loss=0.255, pruned_loss=0.03257, over 23457.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2442, pruned_loss=0.02821, over 1083283.89 frames. ], batch size: 94, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:34:36,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=442000.0, ans=0.125 2023-12-04 21:34:38,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=442000.0, ans=0.0 2023-12-04 21:34:55,352 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.309e+02 1.399e+02 1.562e+02 2.287e+02, threshold=2.798e+02, percent-clipped=0.0 2023-12-04 21:34:58,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=442066.6666666667, ans=0.0 2023-12-04 21:35:07,459 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=442133.3333333333, ans=0.2 2023-12-04 21:35:07,699 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.34 vs. limit=15.0 2023-12-04 21:35:15,755 INFO [train.py:1087] (2/4) Epoch 75, batch 100, loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02689, over 24559.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.242, pruned_loss=0.02711, over 1927836.79 frames. ], batch size: 62, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:35:17,308 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442200.0, ans=0.125 2023-12-04 21:35:18,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=442200.0, ans=0.0 2023-12-04 21:35:44,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=442333.3333333333, ans=0.0 2023-12-04 21:35:53,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=442400.0, ans=0.125 2023-12-04 21:36:11,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=442466.6666666667, ans=0.125 2023-12-04 21:36:21,994 INFO [train.py:1087] (2/4) Epoch 75, batch 150, loss[loss=0.1607, simple_loss=0.2528, pruned_loss=0.03427, over 24174.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02753, over 2566826.20 frames. ], batch size: 82, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:36:32,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442533.3333333333, ans=0.1 2023-12-04 21:36:42,589 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-12-04 21:36:43,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=442600.0, ans=0.125 2023-12-04 21:37:03,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=442733.3333333333, ans=0.125 2023-12-04 21:37:05,694 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.287e+02 1.378e+02 1.496e+02 1.803e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 21:37:06,079 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=442733.3333333333, ans=0.0 2023-12-04 21:37:20,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442800.0, ans=0.125 2023-12-04 21:37:25,761 INFO [train.py:1087] (2/4) Epoch 75, batch 200, loss[loss=0.1473, simple_loss=0.2432, pruned_loss=0.02565, over 23455.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2422, pruned_loss=0.02784, over 3054053.17 frames. ], batch size: 94, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:37:27,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=442866.6666666667, ans=0.2 2023-12-04 21:38:24,194 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=443133.3333333333, ans=0.09899494936611666 2023-12-04 21:38:29,998 INFO [train.py:1087] (2/4) Epoch 75, batch 250, loss[loss=0.1421, simple_loss=0.238, pruned_loss=0.02315, over 24866.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.02795, over 3429968.04 frames. ], batch size: 68, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:38:32,895 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=443200.0, ans=0.125 2023-12-04 21:38:52,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=443266.6666666667, ans=0.125 2023-12-04 21:39:04,904 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-12-04 21:39:12,907 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.284e+02 1.356e+02 1.481e+02 1.863e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 21:39:31,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=443533.3333333333, ans=0.05 2023-12-04 21:39:33,070 INFO [train.py:1087] (2/4) Epoch 75, batch 300, loss[loss=0.1573, simple_loss=0.2485, pruned_loss=0.03302, over 24335.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.242, pruned_loss=0.0276, over 3748279.78 frames. ], batch size: 79, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:39:33,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=443533.3333333333, ans=0.0 2023-12-04 21:39:48,286 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443600.0, ans=0.125 2023-12-04 21:39:53,313 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-12-04 21:40:24,447 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=443800.0, ans=0.0 2023-12-04 21:40:37,646 INFO [train.py:1087] (2/4) Epoch 75, batch 350, loss[loss=0.1464, simple_loss=0.2429, pruned_loss=0.02495, over 24809.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02752, over 3985993.64 frames. ], batch size: 72, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:40:50,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=443933.3333333333, ans=0.0 2023-12-04 21:40:54,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=443933.3333333333, ans=0.125 2023-12-04 21:41:03,137 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-12-04 21:41:12,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444000.0, ans=0.125 2023-12-04 21:41:13,978 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444000.0, ans=0.1 2023-12-04 21:41:21,266 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.270e+02 1.334e+02 1.449e+02 1.837e+02, threshold=2.668e+02, percent-clipped=0.0 2023-12-04 21:41:42,459 INFO [train.py:1087] (2/4) Epoch 75, batch 400, loss[loss=0.1589, simple_loss=0.2524, pruned_loss=0.03272, over 24782.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2422, pruned_loss=0.02762, over 4174051.20 frames. ], batch size: 62, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:41:51,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=444200.0, ans=0.125 2023-12-04 21:42:27,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=444400.0, ans=0.0 2023-12-04 21:42:28,845 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:42:47,176 INFO [train.py:1087] (2/4) Epoch 75, batch 450, loss[loss=0.1497, simple_loss=0.2425, pruned_loss=0.02848, over 24082.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2426, pruned_loss=0.02786, over 4305500.21 frames. ], batch size: 87, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:42:47,469 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=444533.3333333333, ans=0.1 2023-12-04 21:43:00,158 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.87 vs. limit=22.5 2023-12-04 21:43:10,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=444666.6666666667, ans=0.05 2023-12-04 21:43:23,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=444666.6666666667, ans=0.125 2023-12-04 21:43:26,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=444733.3333333333, ans=0.125 2023-12-04 21:43:30,951 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.270e+02 1.368e+02 1.485e+02 2.151e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 21:43:32,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=444733.3333333333, ans=0.125 2023-12-04 21:43:35,429 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-12-04 21:43:36,483 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.42 vs. limit=15.0 2023-12-04 21:43:37,490 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=444800.0, ans=0.125 2023-12-04 21:43:49,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=444866.6666666667, ans=0.09899494936611666 2023-12-04 21:43:50,521 INFO [train.py:1087] (2/4) Epoch 75, batch 500, loss[loss=0.145, simple_loss=0.2428, pruned_loss=0.02362, over 24685.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2421, pruned_loss=0.02772, over 4422692.63 frames. ], batch size: 74, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:44:04,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444933.3333333333, ans=0.1 2023-12-04 21:44:22,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=445000.0, ans=0.2 2023-12-04 21:44:28,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=445000.0, ans=0.125 2023-12-04 21:44:34,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=445066.6666666667, ans=0.0 2023-12-04 21:44:35,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=445066.6666666667, ans=0.125 2023-12-04 21:44:43,796 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.48 vs. limit=22.5 2023-12-04 21:44:44,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=445133.3333333333, ans=0.0 2023-12-04 21:44:54,578 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=445133.3333333333, ans=0.2 2023-12-04 21:44:56,532 INFO [train.py:1087] (2/4) Epoch 75, batch 550, loss[loss=0.148, simple_loss=0.2378, pruned_loss=0.02913, over 24574.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2423, pruned_loss=0.02763, over 4508324.68 frames. ], batch size: 64, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:44:56,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=445200.0, ans=0.125 2023-12-04 21:44:59,409 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=445200.0, ans=0.0 2023-12-04 21:45:17,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=445266.6666666667, ans=0.1 2023-12-04 21:45:24,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=445333.3333333333, ans=0.0 2023-12-04 21:45:27,026 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=445333.3333333333, ans=0.125 2023-12-04 21:45:40,183 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.270e+02 1.369e+02 1.475e+02 1.946e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 21:45:41,644 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-12-04 21:45:49,588 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=445466.6666666667, ans=0.04949747468305833 2023-12-04 21:45:55,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=445466.6666666667, ans=0.125 2023-12-04 21:46:02,144 INFO [train.py:1087] (2/4) Epoch 75, batch 600, loss[loss=0.1584, simple_loss=0.2523, pruned_loss=0.0323, over 24777.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2417, pruned_loss=0.02752, over 4577332.30 frames. ], batch size: 73, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:46:06,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=445533.3333333333, ans=0.0 2023-12-04 21:46:25,192 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.49 vs. limit=6.0 2023-12-04 21:46:48,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=445733.3333333333, ans=0.04949747468305833 2023-12-04 21:47:07,608 INFO [train.py:1087] (2/4) Epoch 75, batch 650, loss[loss=0.1515, simple_loss=0.2446, pruned_loss=0.02923, over 24570.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2419, pruned_loss=0.02772, over 4616240.31 frames. ], batch size: 65, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:47:09,227 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=445866.6666666667, ans=0.07 2023-12-04 21:47:22,182 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=445933.3333333333, ans=0.0 2023-12-04 21:47:22,363 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=445933.3333333333, ans=0.125 2023-12-04 21:47:51,535 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.294e+02 1.373e+02 1.463e+02 1.900e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-04 21:48:04,898 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=446133.3333333333, ans=0.125 2023-12-04 21:48:12,772 INFO [train.py:1087] (2/4) Epoch 75, batch 700, loss[loss=0.1518, simple_loss=0.2439, pruned_loss=0.02982, over 24663.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.02741, over 4670408.17 frames. ], batch size: 74, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:48:14,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=446200.0, ans=0.2 2023-12-04 21:48:23,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=446200.0, ans=0.035 2023-12-04 21:48:43,114 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=446333.3333333333, ans=0.125 2023-12-04 21:49:05,388 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=446466.6666666667, ans=0.0 2023-12-04 21:49:17,445 INFO [train.py:1087] (2/4) Epoch 75, batch 750, loss[loss=0.1449, simple_loss=0.2375, pruned_loss=0.02618, over 24775.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2418, pruned_loss=0.02749, over 4698363.50 frames. ], batch size: 71, lr: 3.26e-03, grad_scale: 16.0 2023-12-04 21:49:17,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=446533.3333333333, ans=0.125 2023-12-04 21:49:22,875 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-12-04 21:49:24,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=446533.3333333333, ans=0.125 2023-12-04 21:49:31,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=446600.0, ans=0.125 2023-12-04 21:50:03,649 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.259e+02 1.344e+02 1.448e+02 1.875e+02, threshold=2.689e+02, percent-clipped=0.0 2023-12-04 21:50:21,841 INFO [train.py:1087] (2/4) Epoch 75, batch 800, loss[loss=0.1485, simple_loss=0.2406, pruned_loss=0.02821, over 24761.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.0276, over 4713262.98 frames. ], batch size: 70, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:50:54,637 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-12-04 21:51:04,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=447066.6666666667, ans=0.125 2023-12-04 21:51:09,970 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=447133.3333333333, ans=10.0 2023-12-04 21:51:20,974 INFO [train.py:1087] (2/4) Epoch 75, batch 850, loss[loss=0.1607, simple_loss=0.2529, pruned_loss=0.03431, over 24491.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2421, pruned_loss=0.02776, over 4716520.87 frames. ], batch size: 77, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:51:26,767 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=447200.0, ans=0.09899494936611666 2023-12-04 21:51:37,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=447266.6666666667, ans=0.09899494936611666 2023-12-04 21:51:45,900 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=447333.3333333333, ans=0.1 2023-12-04 21:51:51,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=447333.3333333333, ans=0.2 2023-12-04 21:51:54,262 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.96 vs. limit=22.5 2023-12-04 21:52:00,230 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.286e+02 1.370e+02 1.501e+02 2.135e+02, threshold=2.740e+02, percent-clipped=0.0 2023-12-04 21:52:29,082 INFO [train.py:1087] (2/4) Epoch 76, batch 0, loss[loss=0.1424, simple_loss=0.2366, pruned_loss=0.02408, over 24766.00 frames. ], tot_loss[loss=0.1424, simple_loss=0.2366, pruned_loss=0.02408, over 24766.00 frames. ], batch size: 70, lr: 3.24e-03, grad_scale: 32.0 2023-12-04 21:52:29,083 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 21:52:43,479 INFO [train.py:1119] (2/4) Epoch 76, validation: loss=0.1514, simple_loss=0.2471, pruned_loss=0.02786, over 944034.00 frames. 2023-12-04 21:52:43,480 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 21:52:43,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=447500.0, ans=0.0 2023-12-04 21:52:44,990 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=447500.0, ans=0.1 2023-12-04 21:52:51,545 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-12-04 21:52:52,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=447500.0, ans=0.2 2023-12-04 21:52:54,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=447566.6666666667, ans=0.0 2023-12-04 21:53:44,875 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-12-04 21:53:45,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=447833.3333333333, ans=0.0 2023-12-04 21:53:47,235 INFO [train.py:1087] (2/4) Epoch 76, batch 50, loss[loss=0.1411, simple_loss=0.2391, pruned_loss=0.02159, over 24803.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2403, pruned_loss=0.02742, over 1080410.46 frames. ], batch size: 72, lr: 3.24e-03, grad_scale: 16.0 2023-12-04 21:53:57,725 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2023-12-04 21:54:10,554 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=447966.6666666667, ans=0.0 2023-12-04 21:54:10,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=447966.6666666667, ans=12.0 2023-12-04 21:54:27,811 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=448033.3333333333, ans=0.05 2023-12-04 21:54:29,373 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-12-04 21:54:38,799 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.295e+02 1.384e+02 1.507e+02 2.552e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-04 21:54:49,815 INFO [train.py:1087] (2/4) Epoch 76, batch 100, loss[loss=0.1457, simple_loss=0.2419, pruned_loss=0.02471, over 23363.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2421, pruned_loss=0.02724, over 1909037.06 frames. ], batch size: 94, lr: 3.24e-03, grad_scale: 16.0 2023-12-04 21:54:59,982 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.61 vs. limit=15.0 2023-12-04 21:55:04,084 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=12.0 2023-12-04 21:55:04,722 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=448233.3333333333, ans=0.1 2023-12-04 21:55:07,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=448233.3333333333, ans=0.2 2023-12-04 21:55:19,417 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=448300.0, ans=0.0 2023-12-04 21:55:48,755 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.01 vs. limit=6.0 2023-12-04 21:55:54,159 INFO [train.py:1087] (2/4) Epoch 76, batch 150, loss[loss=0.1486, simple_loss=0.2425, pruned_loss=0.02739, over 24703.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2421, pruned_loss=0.02749, over 2552112.84 frames. ], batch size: 74, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:56:06,750 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448566.6666666667, ans=0.1 2023-12-04 21:56:22,113 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.25 vs. limit=15.0 2023-12-04 21:56:24,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=448633.3333333333, ans=0.125 2023-12-04 21:56:26,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=448633.3333333333, ans=0.1 2023-12-04 21:56:41,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=448700.0, ans=0.1 2023-12-04 21:56:47,483 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.239e+02 1.337e+02 1.421e+02 1.871e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 21:56:57,550 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-12-04 21:56:59,066 INFO [train.py:1087] (2/4) Epoch 76, batch 200, loss[loss=0.1522, simple_loss=0.2439, pruned_loss=0.03024, over 23487.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2411, pruned_loss=0.02719, over 3052974.77 frames. ], batch size: 94, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:57:19,996 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=448900.0, ans=0.0 2023-12-04 21:57:26,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=448966.6666666667, ans=0.125 2023-12-04 21:57:32,072 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.15 vs. limit=6.0 2023-12-04 21:57:36,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=448966.6666666667, ans=0.04949747468305833 2023-12-04 21:57:44,830 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2023-12-04 21:57:59,361 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.37 vs. limit=10.0 2023-12-04 21:58:03,440 INFO [train.py:1087] (2/4) Epoch 76, batch 250, loss[loss=0.1382, simple_loss=0.2329, pruned_loss=0.02172, over 24727.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2405, pruned_loss=0.02705, over 3455342.77 frames. ], batch size: 67, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:58:04,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449166.6666666667, ans=0.1 2023-12-04 21:58:05,033 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=449166.6666666667, ans=0.1 2023-12-04 21:58:06,332 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=449166.6666666667, ans=0.125 2023-12-04 21:58:33,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=449300.0, ans=0.125 2023-12-04 21:58:55,907 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.278e+02 1.372e+02 1.483e+02 1.898e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 21:59:06,999 INFO [train.py:1087] (2/4) Epoch 76, batch 300, loss[loss=0.1533, simple_loss=0.2423, pruned_loss=0.03211, over 24296.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2409, pruned_loss=0.0273, over 3762069.80 frames. ], batch size: 79, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:59:13,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=449500.0, ans=0.125 2023-12-04 21:59:15,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=449500.0, ans=0.0 2023-12-04 21:59:23,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=449566.6666666667, ans=0.125 2023-12-04 22:00:02,561 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449766.6666666667, ans=0.1 2023-12-04 22:00:09,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=449766.6666666667, ans=0.125 2023-12-04 22:00:11,276 INFO [train.py:1087] (2/4) Epoch 76, batch 350, loss[loss=0.1386, simple_loss=0.234, pruned_loss=0.02164, over 24772.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2412, pruned_loss=0.02757, over 3977538.80 frames. ], batch size: 70, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 22:00:38,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=449966.6666666667, ans=0.07 2023-12-04 22:00:40,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=449966.6666666667, ans=0.125 2023-12-04 22:01:04,961 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.318e+02 1.386e+02 1.479e+02 1.987e+02, threshold=2.773e+02, percent-clipped=0.0 2023-12-04 22:01:09,955 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450100.0, ans=0.125 2023-12-04 22:01:16,011 INFO [train.py:1087] (2/4) Epoch 76, batch 400, loss[loss=0.1377, simple_loss=0.2326, pruned_loss=0.02138, over 24764.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2415, pruned_loss=0.02774, over 4164720.95 frames. ], batch size: 70, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:01:21,694 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=450166.6666666667, ans=0.125 2023-12-04 22:01:25,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=450166.6666666667, ans=10.0 2023-12-04 22:01:26,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=450166.6666666667, ans=0.025 2023-12-04 22:01:37,030 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450233.3333333333, ans=0.125 2023-12-04 22:01:58,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=450366.6666666667, ans=0.125 2023-12-04 22:02:03,474 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=450366.6666666667, ans=0.0 2023-12-04 22:02:21,001 INFO [train.py:1087] (2/4) Epoch 76, batch 450, loss[loss=0.1415, simple_loss=0.2324, pruned_loss=0.0253, over 24573.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2413, pruned_loss=0.02768, over 4295650.67 frames. ], batch size: 64, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:02:42,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=450566.6666666667, ans=0.0 2023-12-04 22:02:45,104 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.77 vs. limit=15.0 2023-12-04 22:02:46,257 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=450633.3333333333, ans=0.125 2023-12-04 22:03:08,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=450700.0, ans=0.0 2023-12-04 22:03:13,399 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.288e+02 1.371e+02 1.459e+02 1.960e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 22:03:25,618 INFO [train.py:1087] (2/4) Epoch 76, batch 500, loss[loss=0.1494, simple_loss=0.2418, pruned_loss=0.02853, over 24763.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2415, pruned_loss=0.02774, over 4414242.98 frames. ], batch size: 70, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:03:42,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=450900.0, ans=0.1 2023-12-04 22:04:22,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=451100.0, ans=0.2 2023-12-04 22:04:28,669 INFO [train.py:1087] (2/4) Epoch 76, batch 550, loss[loss=0.1536, simple_loss=0.2458, pruned_loss=0.0307, over 24476.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2418, pruned_loss=0.0279, over 4493733.37 frames. ], batch size: 75, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:04:30,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=451166.6666666667, ans=0.125 2023-12-04 22:04:46,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=451233.3333333333, ans=0.2 2023-12-04 22:05:02,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=451300.0, ans=0.0 2023-12-04 22:05:05,691 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-12-04 22:05:21,737 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.076e+02 1.266e+02 1.377e+02 1.479e+02 2.187e+02, threshold=2.754e+02, percent-clipped=0.0 2023-12-04 22:05:24,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=451433.3333333333, ans=0.125 2023-12-04 22:05:27,119 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-12-04 22:05:31,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=451500.0, ans=0.125 2023-12-04 22:05:32,883 INFO [train.py:1087] (2/4) Epoch 76, batch 600, loss[loss=0.1713, simple_loss=0.2538, pruned_loss=0.04438, over 16569.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2421, pruned_loss=0.02793, over 4536138.68 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:05:47,726 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=451566.6666666667, ans=0.125 2023-12-04 22:05:54,965 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.07 vs. limit=6.0 2023-12-04 22:05:55,899 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451566.6666666667, ans=0.1 2023-12-04 22:06:37,344 INFO [train.py:1087] (2/4) Epoch 76, batch 650, loss[loss=0.1388, simple_loss=0.2319, pruned_loss=0.0229, over 24801.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2421, pruned_loss=0.02783, over 4608587.45 frames. ], batch size: 73, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:06:38,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=451833.3333333333, ans=0.05 2023-12-04 22:06:41,236 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=451833.3333333333, ans=0.125 2023-12-04 22:06:51,239 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.97 vs. limit=15.0 2023-12-04 22:06:52,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=451900.0, ans=0.125 2023-12-04 22:07:06,815 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.83 vs. limit=22.5 2023-12-04 22:07:09,957 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-12-04 22:07:13,297 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=451966.6666666667, ans=0.125 2023-12-04 22:07:14,401 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:07:15,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=452033.3333333333, ans=0.125 2023-12-04 22:07:22,791 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=452033.3333333333, ans=0.04949747468305833 2023-12-04 22:07:25,118 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=452033.3333333333, ans=0.2 2023-12-04 22:07:28,157 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-12-04 22:07:29,897 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.170e+02 1.268e+02 1.338e+02 1.426e+02 2.110e+02, threshold=2.676e+02, percent-clipped=0.0 2023-12-04 22:07:40,673 INFO [train.py:1087] (2/4) Epoch 76, batch 700, loss[loss=0.1418, simple_loss=0.2321, pruned_loss=0.02573, over 24764.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2421, pruned_loss=0.02787, over 4651273.23 frames. ], batch size: 70, lr: 3.22e-03, grad_scale: 16.0 2023-12-04 22:07:52,204 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=452233.3333333333, ans=0.125 2023-12-04 22:08:02,067 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.42 vs. limit=15.0 2023-12-04 22:08:09,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=452300.0, ans=0.1 2023-12-04 22:08:15,726 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.07 vs. limit=6.0 2023-12-04 22:08:43,692 INFO [train.py:1087] (2/4) Epoch 76, batch 750, loss[loss=0.1554, simple_loss=0.2473, pruned_loss=0.03178, over 24482.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2419, pruned_loss=0.02781, over 4694737.43 frames. ], batch size: 75, lr: 3.22e-03, grad_scale: 16.0 2023-12-04 22:09:14,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=452633.3333333333, ans=0.035 2023-12-04 22:09:36,903 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.251e+02 1.333e+02 1.395e+02 1.598e+02, threshold=2.666e+02, percent-clipped=0.0 2023-12-04 22:09:46,926 INFO [train.py:1087] (2/4) Epoch 76, batch 800, loss[loss=0.1468, simple_loss=0.2416, pruned_loss=0.02597, over 24808.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2417, pruned_loss=0.02765, over 4716565.10 frames. ], batch size: 72, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:09:53,564 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=452833.3333333333, ans=0.1 2023-12-04 22:10:10,121 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.01 vs. limit=15.0 2023-12-04 22:10:30,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453033.3333333333, ans=0.1 2023-12-04 22:10:37,527 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=453100.0, ans=0.0 2023-12-04 22:10:41,778 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=453100.0, ans=0.0 2023-12-04 22:10:43,674 INFO [train.py:1087] (2/4) Epoch 76, batch 850, loss[loss=0.1582, simple_loss=0.2502, pruned_loss=0.03314, over 23998.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.242, pruned_loss=0.02786, over 4740818.39 frames. ], batch size: 87, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:10:50,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=453166.6666666667, ans=0.125 2023-12-04 22:11:00,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=453233.3333333333, ans=0.0 2023-12-04 22:11:23,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=453366.6666666667, ans=0.125 2023-12-04 22:11:33,616 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.179e+02 1.275e+02 1.393e+02 1.516e+02 2.346e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-04 22:11:45,018 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=453466.6666666667, ans=0.125 2023-12-04 22:11:53,990 INFO [train.py:1087] (2/4) Epoch 77, batch 0, loss[loss=0.136, simple_loss=0.2323, pruned_loss=0.01986, over 24791.00 frames. ], tot_loss[loss=0.136, simple_loss=0.2323, pruned_loss=0.01986, over 24791.00 frames. ], batch size: 71, lr: 3.20e-03, grad_scale: 32.0 2023-12-04 22:11:53,992 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 22:12:07,720 INFO [train.py:1119] (2/4) Epoch 77, validation: loss=0.1509, simple_loss=0.2467, pruned_loss=0.02756, over 944034.00 frames. 2023-12-04 22:12:07,721 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 22:12:26,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453533.3333333333, ans=0.1 2023-12-04 22:12:32,798 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=453600.0, ans=0.125 2023-12-04 22:12:42,096 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=453600.0, ans=0.0 2023-12-04 22:12:43,337 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453600.0, ans=0.1 2023-12-04 22:12:44,689 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.48 vs. limit=15.0 2023-12-04 22:12:50,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=453666.6666666667, ans=0.125 2023-12-04 22:12:57,217 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=453733.3333333333, ans=0.2 2023-12-04 22:13:04,225 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=453733.3333333333, ans=0.0 2023-12-04 22:13:09,710 INFO [train.py:1087] (2/4) Epoch 77, batch 50, loss[loss=0.1441, simple_loss=0.2417, pruned_loss=0.02329, over 24571.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2407, pruned_loss=0.02643, over 1080425.69 frames. ], batch size: 64, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:13:39,529 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.37 vs. limit=22.5 2023-12-04 22:13:53,037 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=454000.0, ans=0.125 2023-12-04 22:13:54,289 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=454000.0, ans=0.125 2023-12-04 22:13:58,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=454066.6666666667, ans=0.0 2023-12-04 22:14:09,210 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.276e+02 1.348e+02 1.437e+02 2.275e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-04 22:14:11,598 INFO [train.py:1087] (2/4) Epoch 77, batch 100, loss[loss=0.1508, simple_loss=0.245, pruned_loss=0.02836, over 24264.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2408, pruned_loss=0.02664, over 1916842.53 frames. ], batch size: 79, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:14:17,625 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=454133.3333333333, ans=0.125 2023-12-04 22:14:25,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=454200.0, ans=0.125 2023-12-04 22:14:42,800 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=454266.6666666667, ans=0.0 2023-12-04 22:14:47,364 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=454333.3333333333, ans=0.125 2023-12-04 22:14:53,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=454333.3333333333, ans=0.2 2023-12-04 22:15:08,655 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=454400.0, ans=0.125 2023-12-04 22:15:12,884 INFO [train.py:1087] (2/4) Epoch 77, batch 150, loss[loss=0.1419, simple_loss=0.2324, pruned_loss=0.02576, over 24563.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.241, pruned_loss=0.0268, over 2562289.89 frames. ], batch size: 66, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:15:48,403 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=454600.0, ans=0.125 2023-12-04 22:15:49,652 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:15:51,181 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=454666.6666666667, ans=0.125 2023-12-04 22:16:03,930 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:16:09,172 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.70 vs. limit=15.0 2023-12-04 22:16:13,578 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.230e+02 1.302e+02 1.377e+02 1.981e+02, threshold=2.604e+02, percent-clipped=0.0 2023-12-04 22:16:16,000 INFO [train.py:1087] (2/4) Epoch 77, batch 200, loss[loss=0.1386, simple_loss=0.2312, pruned_loss=0.02295, over 24767.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.0267, over 3059570.04 frames. ], batch size: 65, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:16:23,141 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2023-12-04 22:16:48,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=454933.3333333333, ans=0.125 2023-12-04 22:17:16,350 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=455066.6666666667, ans=0.2 2023-12-04 22:17:19,421 INFO [train.py:1087] (2/4) Epoch 77, batch 250, loss[loss=0.1564, simple_loss=0.2494, pruned_loss=0.03175, over 24801.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2409, pruned_loss=0.02674, over 3458847.74 frames. ], batch size: 62, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:17:43,612 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=455266.6666666667, ans=0.0 2023-12-04 22:17:56,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=455333.3333333333, ans=0.0 2023-12-04 22:18:04,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=455333.3333333333, ans=0.05 2023-12-04 22:18:19,578 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.256e+02 1.393e+02 1.475e+02 1.772e+02, threshold=2.785e+02, percent-clipped=0.0 2023-12-04 22:18:22,521 INFO [train.py:1087] (2/4) Epoch 77, batch 300, loss[loss=0.1373, simple_loss=0.2327, pruned_loss=0.021, over 24864.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2407, pruned_loss=0.02676, over 3770868.40 frames. ], batch size: 68, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:18:34,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=455533.3333333333, ans=0.125 2023-12-04 22:18:38,046 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=455533.3333333333, ans=0.2 2023-12-04 22:18:44,032 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455533.3333333333, ans=0.0 2023-12-04 22:18:50,352 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:18:50,811 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-12-04 22:18:53,174 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-12-04 22:19:05,838 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:19:23,114 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:19:25,332 INFO [train.py:1087] (2/4) Epoch 77, batch 350, loss[loss=0.1552, simple_loss=0.2459, pruned_loss=0.03223, over 24500.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2413, pruned_loss=0.02707, over 3998926.51 frames. ], batch size: 77, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:19:41,153 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455866.6666666667, ans=0.1 2023-12-04 22:20:05,939 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=10.0 2023-12-04 22:20:09,535 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=456000.0, ans=0.0 2023-12-04 22:20:24,166 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.88 vs. limit=15.0 2023-12-04 22:20:27,039 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.291e+02 1.396e+02 1.522e+02 1.787e+02, threshold=2.793e+02, percent-clipped=0.0 2023-12-04 22:20:27,673 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-12-04 22:20:29,064 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=22.5 2023-12-04 22:20:29,534 INFO [train.py:1087] (2/4) Epoch 77, batch 400, loss[loss=0.144, simple_loss=0.2348, pruned_loss=0.02659, over 24726.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2414, pruned_loss=0.02724, over 4178694.31 frames. ], batch size: 67, lr: 3.19e-03, grad_scale: 32.0 2023-12-04 22:20:54,881 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=456266.6666666667, ans=0.05 2023-12-04 22:21:32,184 INFO [train.py:1087] (2/4) Epoch 77, batch 450, loss[loss=0.1421, simple_loss=0.2385, pruned_loss=0.0228, over 24605.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.241, pruned_loss=0.02702, over 4319713.23 frames. ], batch size: 68, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:21:32,803 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-12-04 22:22:08,165 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=456600.0, ans=0.125 2023-12-04 22:22:13,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=456666.6666666667, ans=0.0 2023-12-04 22:22:32,336 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.262e+02 1.346e+02 1.442e+02 1.753e+02, threshold=2.693e+02, percent-clipped=0.0 2023-12-04 22:22:35,726 INFO [train.py:1087] (2/4) Epoch 77, batch 500, loss[loss=0.1378, simple_loss=0.2321, pruned_loss=0.02172, over 24763.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2407, pruned_loss=0.02701, over 4441762.73 frames. ], batch size: 70, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:23:02,795 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=456933.3333333333, ans=0.125 2023-12-04 22:23:17,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=457000.0, ans=0.2 2023-12-04 22:23:26,241 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=457066.6666666667, ans=0.125 2023-12-04 22:23:37,781 INFO [train.py:1087] (2/4) Epoch 77, batch 550, loss[loss=0.1518, simple_loss=0.2474, pruned_loss=0.02807, over 24750.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02719, over 4513821.22 frames. ], batch size: 70, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:23:39,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=457133.3333333333, ans=0.125 2023-12-04 22:24:07,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=457266.6666666667, ans=0.015 2023-12-04 22:24:35,317 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=8.0 2023-12-04 22:24:40,637 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.270e+02 1.347e+02 1.504e+02 2.289e+02, threshold=2.694e+02, percent-clipped=0.0 2023-12-04 22:24:41,834 INFO [train.py:1087] (2/4) Epoch 77, batch 600, loss[loss=0.1498, simple_loss=0.2442, pruned_loss=0.02773, over 24802.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02701, over 4585222.98 frames. ], batch size: 62, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:24:43,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=457466.6666666667, ans=0.0 2023-12-04 22:25:01,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=457533.3333333333, ans=0.0 2023-12-04 22:25:17,292 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457600.0, ans=0.1 2023-12-04 22:25:18,380 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=457666.6666666667, ans=0.125 2023-12-04 22:25:26,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=457666.6666666667, ans=0.2 2023-12-04 22:25:33,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=457733.3333333333, ans=0.125 2023-12-04 22:25:39,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=457733.3333333333, ans=0.2 2023-12-04 22:25:44,104 INFO [train.py:1087] (2/4) Epoch 77, batch 650, loss[loss=0.1454, simple_loss=0.2388, pruned_loss=0.02599, over 24547.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2412, pruned_loss=0.02722, over 4637797.92 frames. ], batch size: 63, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:25:53,397 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.82 vs. limit=22.5 2023-12-04 22:26:00,462 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2023-12-04 22:26:11,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457933.3333333333, ans=0.1 2023-12-04 22:26:38,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458066.6666666667, ans=0.125 2023-12-04 22:26:44,058 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.277e+02 1.365e+02 1.551e+02 2.272e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 22:26:45,290 INFO [train.py:1087] (2/4) Epoch 77, batch 700, loss[loss=0.1556, simple_loss=0.2473, pruned_loss=0.03199, over 23488.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2412, pruned_loss=0.02721, over 4654732.29 frames. ], batch size: 94, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:26:47,858 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=458133.3333333333, ans=0.125 2023-12-04 22:27:02,601 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=458200.0, ans=0.125 2023-12-04 22:27:10,849 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-12-04 22:27:17,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=458266.6666666667, ans=0.125 2023-12-04 22:27:45,588 INFO [train.py:1087] (2/4) Epoch 77, batch 750, loss[loss=0.1484, simple_loss=0.2401, pruned_loss=0.02839, over 24712.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2413, pruned_loss=0.02756, over 4686526.17 frames. ], batch size: 69, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:27:58,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=458533.3333333333, ans=0.0 2023-12-04 22:28:32,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=458666.6666666667, ans=0.0 2023-12-04 22:28:37,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=458733.3333333333, ans=0.0 2023-12-04 22:28:46,379 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.260e+02 1.355e+02 1.461e+02 1.889e+02, threshold=2.710e+02, percent-clipped=0.0 2023-12-04 22:28:47,547 INFO [train.py:1087] (2/4) Epoch 77, batch 800, loss[loss=0.1421, simple_loss=0.2394, pruned_loss=0.0224, over 24776.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2411, pruned_loss=0.0273, over 4724901.23 frames. ], batch size: 70, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:28:49,364 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-12-04 22:29:03,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=458866.6666666667, ans=0.5 2023-12-04 22:29:13,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=458933.3333333333, ans=0.2 2023-12-04 22:29:34,069 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459066.6666666667, ans=0.1 2023-12-04 22:29:43,818 INFO [train.py:1087] (2/4) Epoch 77, batch 850, loss[loss=0.1454, simple_loss=0.2416, pruned_loss=0.02462, over 23762.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2413, pruned_loss=0.02735, over 4740814.47 frames. ], batch size: 57, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:29:58,427 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=459200.0, ans=0.0 2023-12-04 22:29:59,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=459200.0, ans=0.2 2023-12-04 22:30:04,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459266.6666666667, ans=0.125 2023-12-04 22:30:21,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=459333.3333333333, ans=0.2 2023-12-04 22:30:28,613 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:30:43,557 INFO [train.py:1087] (2/4) Epoch 78, batch 0, loss[loss=0.1364, simple_loss=0.2364, pruned_loss=0.01817, over 24795.00 frames. ], tot_loss[loss=0.1364, simple_loss=0.2364, pruned_loss=0.01817, over 24795.00 frames. ], batch size: 73, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:30:43,557 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 22:30:51,953 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([1.9346, 2.3004, 2.5076, 2.4276, 2.2120, 2.3866, 2.3872, 2.2616], device='cuda:2') 2023-12-04 22:30:56,956 INFO [train.py:1119] (2/4) Epoch 78, validation: loss=0.1512, simple_loss=0.2469, pruned_loss=0.02777, over 944034.00 frames. 2023-12-04 22:30:56,957 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 22:31:01,567 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.318e+02 1.400e+02 1.529e+02 2.377e+02, threshold=2.801e+02, percent-clipped=0.0 2023-12-04 22:31:02,178 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2023-12-04 22:31:04,200 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459433.3333333333, ans=0.125 2023-12-04 22:31:16,957 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=459500.0, ans=0.0 2023-12-04 22:31:35,691 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.54 vs. limit=15.0 2023-12-04 22:31:54,364 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-12-04 22:31:58,206 INFO [train.py:1087] (2/4) Epoch 78, batch 50, loss[loss=0.1746, simple_loss=0.2576, pruned_loss=0.04579, over 17218.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2408, pruned_loss=0.02723, over 1064777.54 frames. ], batch size: 177, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:32:40,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=459966.6666666667, ans=0.0 2023-12-04 22:32:56,585 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=460033.3333333333, ans=0.125 2023-12-04 22:32:56,636 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=460033.3333333333, ans=0.125 2023-12-04 22:32:58,612 INFO [train.py:1087] (2/4) Epoch 78, batch 100, loss[loss=0.1529, simple_loss=0.2447, pruned_loss=0.03058, over 24502.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2412, pruned_loss=0.02732, over 1908858.74 frames. ], batch size: 75, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:33:03,696 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.269e+02 1.352e+02 1.523e+02 1.962e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 22:33:10,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=460166.6666666667, ans=0.125 2023-12-04 22:33:18,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=460166.6666666667, ans=0.125 2023-12-04 22:33:18,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=460166.6666666667, ans=0.05 2023-12-04 22:33:30,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460233.3333333333, ans=0.1 2023-12-04 22:33:32,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=460233.3333333333, ans=0.125 2023-12-04 22:33:53,371 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=460366.6666666667, ans=0.07 2023-12-04 22:33:56,096 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=460366.6666666667, ans=0.0 2023-12-04 22:33:59,203 INFO [train.py:1087] (2/4) Epoch 78, batch 150, loss[loss=0.1473, simple_loss=0.2415, pruned_loss=0.02661, over 24782.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2413, pruned_loss=0.02706, over 2549732.03 frames. ], batch size: 62, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:34:09,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=460433.3333333333, ans=0.125 2023-12-04 22:34:17,429 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=460500.0, ans=0.125 2023-12-04 22:34:30,506 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=460566.6666666667, ans=0.0 2023-12-04 22:34:30,576 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=460566.6666666667, ans=0.2 2023-12-04 22:34:36,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=460633.3333333333, ans=0.0 2023-12-04 22:34:36,877 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-12-04 22:34:50,950 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.16 vs. limit=10.0 2023-12-04 22:34:51,851 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=460700.0, ans=0.09899494936611666 2023-12-04 22:34:59,860 INFO [train.py:1087] (2/4) Epoch 78, batch 200, loss[loss=0.153, simple_loss=0.2471, pruned_loss=0.02942, over 24022.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2411, pruned_loss=0.02691, over 3059640.36 frames. ], batch size: 87, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:35:03,682 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=460766.6666666667, ans=0.2 2023-12-04 22:35:04,560 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.245e+02 1.353e+02 1.456e+02 1.804e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 22:35:13,949 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=460833.3333333333, ans=0.125 2023-12-04 22:35:18,174 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=460833.3333333333, ans=0.125 2023-12-04 22:35:27,596 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=460900.0, ans=0.02 2023-12-04 22:35:30,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=460900.0, ans=0.125 2023-12-04 22:35:31,039 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=460900.0, ans=0.0 2023-12-04 22:35:37,000 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=460966.6666666667, ans=0.0 2023-12-04 22:35:47,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=460966.6666666667, ans=0.125 2023-12-04 22:36:01,521 INFO [train.py:1087] (2/4) Epoch 78, batch 250, loss[loss=0.155, simple_loss=0.2491, pruned_loss=0.03042, over 24025.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.0274, over 3425047.81 frames. ], batch size: 87, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:36:43,428 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=461300.0, ans=0.125 2023-12-04 22:36:51,334 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-12-04 22:37:04,044 INFO [train.py:1087] (2/4) Epoch 78, batch 300, loss[loss=0.1698, simple_loss=0.2622, pruned_loss=0.03868, over 24208.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02753, over 3726379.04 frames. ], batch size: 82, lr: 3.15e-03, grad_scale: 16.0 2023-12-04 22:37:04,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=461433.3333333333, ans=0.125 2023-12-04 22:37:10,080 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.267e+02 1.351e+02 1.468e+02 1.970e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 22:37:29,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=461566.6666666667, ans=0.09899494936611666 2023-12-04 22:37:33,474 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461566.6666666667, ans=0.125 2023-12-04 22:37:33,741 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.37 vs. limit=10.0 2023-12-04 22:37:43,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=461633.3333333333, ans=0.0 2023-12-04 22:37:46,190 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.73 vs. limit=15.0 2023-12-04 22:37:59,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=461700.0, ans=0.125 2023-12-04 22:38:03,198 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461700.0, ans=0.125 2023-12-04 22:38:05,863 INFO [train.py:1087] (2/4) Epoch 78, batch 350, loss[loss=0.1403, simple_loss=0.2303, pruned_loss=0.02517, over 24559.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2418, pruned_loss=0.02768, over 3958916.26 frames. ], batch size: 66, lr: 3.15e-03, grad_scale: 16.0 2023-12-04 22:38:07,243 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=461766.6666666667, ans=0.0 2023-12-04 22:38:10,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461766.6666666667, ans=0.1 2023-12-04 22:38:23,664 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=461833.3333333333, ans=0.125 2023-12-04 22:38:27,601 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-12-04 22:38:43,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=461966.6666666667, ans=0.125 2023-12-04 22:38:43,974 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-12-04 22:38:44,966 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=461966.6666666667, ans=0.0 2023-12-04 22:38:57,824 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=462033.3333333333, ans=0.1 2023-12-04 22:39:07,201 INFO [train.py:1087] (2/4) Epoch 78, batch 400, loss[loss=0.152, simple_loss=0.246, pruned_loss=0.02898, over 24715.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02749, over 4140203.33 frames. ], batch size: 69, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:39:13,453 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.316e+02 1.421e+02 1.551e+02 2.009e+02, threshold=2.843e+02, percent-clipped=0.0 2023-12-04 22:39:38,149 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=462233.3333333333, ans=0.2 2023-12-04 22:39:39,518 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-12-04 22:39:53,489 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=462300.0, ans=0.0 2023-12-04 22:40:09,001 INFO [train.py:1087] (2/4) Epoch 78, batch 450, loss[loss=0.1331, simple_loss=0.2294, pruned_loss=0.01835, over 24781.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02717, over 4293223.43 frames. ], batch size: 71, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:40:12,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=462433.3333333333, ans=0.125 2023-12-04 22:40:23,845 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462500.0, ans=0.1 2023-12-04 22:40:23,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=462500.0, ans=0.125 2023-12-04 22:40:25,054 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=462500.0, ans=0.2 2023-12-04 22:40:26,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=462500.0, ans=0.1 2023-12-04 22:40:37,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=462566.6666666667, ans=0.125 2023-12-04 22:40:38,365 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=462566.6666666667, ans=0.125 2023-12-04 22:41:01,239 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=462700.0, ans=0.0 2023-12-04 22:41:03,866 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.56 vs. limit=5.0 2023-12-04 22:41:11,613 INFO [train.py:1087] (2/4) Epoch 78, batch 500, loss[loss=0.1672, simple_loss=0.2565, pruned_loss=0.03892, over 17335.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2412, pruned_loss=0.02727, over 4398218.21 frames. ], batch size: 177, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:41:17,372 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.273e+02 1.355e+02 1.434e+02 2.086e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 22:41:34,775 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:42:02,897 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-12-04 22:42:12,109 INFO [train.py:1087] (2/4) Epoch 78, batch 550, loss[loss=0.1358, simple_loss=0.2287, pruned_loss=0.02144, over 24839.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02721, over 4502364.70 frames. ], batch size: 68, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:42:25,083 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=463166.6666666667, ans=0.1 2023-12-04 22:42:43,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463233.3333333333, ans=0.0 2023-12-04 22:42:51,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.43 vs. limit=15.0 2023-12-04 22:42:57,271 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=463300.0, ans=0.0 2023-12-04 22:43:03,503 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=463366.6666666667, ans=0.125 2023-12-04 22:43:12,737 INFO [train.py:1087] (2/4) Epoch 78, batch 600, loss[loss=0.1438, simple_loss=0.2423, pruned_loss=0.02259, over 24763.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2408, pruned_loss=0.02723, over 4575795.29 frames. ], batch size: 70, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:43:14,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=463433.3333333333, ans=0.0 2023-12-04 22:43:20,558 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.076e+02 1.286e+02 1.376e+02 1.492e+02 2.367e+02, threshold=2.752e+02, percent-clipped=0.0 2023-12-04 22:44:01,389 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=463700.0, ans=0.0 2023-12-04 22:44:14,354 INFO [train.py:1087] (2/4) Epoch 78, batch 650, loss[loss=0.1363, simple_loss=0.2298, pruned_loss=0.02138, over 24720.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2409, pruned_loss=0.02723, over 4624276.51 frames. ], batch size: 67, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:44:41,196 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=463900.0, ans=0.125 2023-12-04 22:44:43,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=463900.0, ans=0.125 2023-12-04 22:44:44,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=463900.0, ans=0.125 2023-12-04 22:45:15,113 INFO [train.py:1087] (2/4) Epoch 78, batch 700, loss[loss=0.1546, simple_loss=0.2488, pruned_loss=0.03022, over 22910.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.241, pruned_loss=0.02731, over 4652827.32 frames. ], batch size: 107, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:45:22,021 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.303e+02 1.388e+02 1.509e+02 1.902e+02, threshold=2.776e+02, percent-clipped=0.0 2023-12-04 22:45:44,597 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=464233.3333333333, ans=0.035 2023-12-04 22:45:51,709 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=464300.0, ans=0.2 2023-12-04 22:46:15,853 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=464433.3333333333, ans=0.125 2023-12-04 22:46:16,300 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=464433.3333333333, ans=22.5 2023-12-04 22:46:16,930 INFO [train.py:1087] (2/4) Epoch 78, batch 750, loss[loss=0.1534, simple_loss=0.2473, pruned_loss=0.02974, over 24600.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2408, pruned_loss=0.02717, over 4697796.50 frames. ], batch size: 68, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:46:23,359 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.91 vs. limit=15.0 2023-12-04 22:47:11,372 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=464700.0, ans=0.5 2023-12-04 22:47:17,389 INFO [train.py:1087] (2/4) Epoch 78, batch 800, loss[loss=0.1434, simple_loss=0.2352, pruned_loss=0.02582, over 24555.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2405, pruned_loss=0.027, over 4742261.50 frames. ], batch size: 66, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:47:19,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=464766.6666666667, ans=0.2 2023-12-04 22:47:24,941 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.273e+02 1.372e+02 1.455e+02 1.988e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-04 22:47:39,460 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=464833.3333333333, ans=0.0 2023-12-04 22:48:02,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=465033.3333333333, ans=0.0 2023-12-04 22:48:04,468 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-12-04 22:48:13,423 INFO [train.py:1087] (2/4) Epoch 78, batch 850, loss[loss=0.1545, simple_loss=0.2499, pruned_loss=0.0296, over 23390.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.0271, over 4764819.17 frames. ], batch size: 94, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:48:26,618 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=465166.6666666667, ans=10.0 2023-12-04 22:48:53,756 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=22.5 2023-12-04 22:49:18,208 INFO [train.py:1087] (2/4) Epoch 79, batch 0, loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02637, over 23338.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02637, over 23338.00 frames. ], batch size: 94, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:49:18,209 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 22:49:27,085 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.3492, 3.6794, 3.6446, 4.6088, 4.5237, 3.6863, 4.1908, 4.0042], device='cuda:2') 2023-12-04 22:49:31,407 INFO [train.py:1119] (2/4) Epoch 79, validation: loss=0.1512, simple_loss=0.2468, pruned_loss=0.02781, over 944034.00 frames. 2023-12-04 22:49:31,408 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 22:49:44,324 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.283e+02 1.371e+02 1.514e+02 2.003e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 22:50:06,869 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-12-04 22:50:07,955 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.71 vs. limit=22.5 2023-12-04 22:50:11,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=465600.0, ans=0.125 2023-12-04 22:50:32,086 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=465733.3333333333, ans=0.0 2023-12-04 22:50:32,848 INFO [train.py:1087] (2/4) Epoch 79, batch 50, loss[loss=0.1443, simple_loss=0.237, pruned_loss=0.0258, over 22839.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02761, over 1071868.38 frames. ], batch size: 106, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:50:41,211 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:51:03,641 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=465866.6666666667, ans=0.2 2023-12-04 22:51:18,679 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=465933.3333333333, ans=0.125 2023-12-04 22:51:21,022 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=466000.0, ans=10.0 2023-12-04 22:51:24,834 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466000.0, ans=0.125 2023-12-04 22:51:28,175 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=466000.0, ans=0.2 2023-12-04 22:51:29,481 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=466000.0, ans=0.0 2023-12-04 22:51:32,896 INFO [train.py:1087] (2/4) Epoch 79, batch 100, loss[loss=0.1401, simple_loss=0.2378, pruned_loss=0.02115, over 24710.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02706, over 1906089.14 frames. ], batch size: 69, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:51:42,042 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=466066.6666666667, ans=0.0 2023-12-04 22:51:46,213 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.313e+02 1.387e+02 1.492e+02 1.840e+02, threshold=2.775e+02, percent-clipped=0.0 2023-12-04 22:51:49,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.90 vs. limit=22.5 2023-12-04 22:51:51,284 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=466133.3333333333, ans=0.125 2023-12-04 22:52:00,025 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=466200.0, ans=0.025 2023-12-04 22:52:08,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=466266.6666666667, ans=0.2 2023-12-04 22:52:09,776 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=466266.6666666667, ans=0.0 2023-12-04 22:52:26,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=466333.3333333333, ans=0.125 2023-12-04 22:52:30,214 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2023-12-04 22:52:31,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=466333.3333333333, ans=0.125 2023-12-04 22:52:33,017 INFO [train.py:1087] (2/4) Epoch 79, batch 150, loss[loss=0.1383, simple_loss=0.2352, pruned_loss=0.02073, over 24608.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02723, over 2559541.88 frames. ], batch size: 68, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:52:38,896 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=22.5 2023-12-04 22:52:48,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=466466.6666666667, ans=0.125 2023-12-04 22:52:54,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=466466.6666666667, ans=0.0 2023-12-04 22:53:18,497 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466600.0, ans=0.125 2023-12-04 22:53:27,377 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=466666.6666666667, ans=0.125 2023-12-04 22:53:28,341 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=466666.6666666667, ans=0.0 2023-12-04 22:53:30,966 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-12-04 22:53:33,909 INFO [train.py:1087] (2/4) Epoch 79, batch 200, loss[loss=0.1392, simple_loss=0.2365, pruned_loss=0.02092, over 24849.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2413, pruned_loss=0.0271, over 3052784.01 frames. ], batch size: 68, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:53:43,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=466733.3333333333, ans=0.125 2023-12-04 22:53:46,903 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.258e+02 1.330e+02 1.443e+02 2.425e+02, threshold=2.659e+02, percent-clipped=0.0 2023-12-04 22:53:55,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466800.0, ans=0.125 2023-12-04 22:54:10,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=466933.3333333333, ans=0.125 2023-12-04 22:54:30,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=467000.0, ans=0.125 2023-12-04 22:54:35,249 INFO [train.py:1087] (2/4) Epoch 79, batch 250, loss[loss=0.1506, simple_loss=0.2419, pruned_loss=0.02958, over 24279.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2421, pruned_loss=0.02743, over 3425889.21 frames. ], batch size: 82, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:54:39,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=467066.6666666667, ans=0.2 2023-12-04 22:54:55,587 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=467133.3333333333, ans=0.125 2023-12-04 22:54:58,019 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=467200.0, ans=0.0 2023-12-04 22:55:17,640 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=467266.6666666667, ans=0.125 2023-12-04 22:55:19,179 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.83 vs. limit=15.0 2023-12-04 22:55:19,366 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-12-04 22:55:26,855 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=467333.3333333333, ans=0.125 2023-12-04 22:55:29,436 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=467333.3333333333, ans=0.2 2023-12-04 22:55:29,654 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-12-04 22:55:36,048 INFO [train.py:1087] (2/4) Epoch 79, batch 300, loss[loss=0.1585, simple_loss=0.2522, pruned_loss=0.03239, over 23580.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02742, over 3714166.68 frames. ], batch size: 94, lr: 3.11e-03, grad_scale: 16.0 2023-12-04 22:55:48,728 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=467466.6666666667, ans=0.0 2023-12-04 22:55:50,633 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.277e+02 1.357e+02 1.469e+02 1.891e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 22:55:55,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=467466.6666666667, ans=0.05 2023-12-04 22:56:01,594 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=467533.3333333333, ans=0.2 2023-12-04 22:56:35,546 INFO [train.py:1087] (2/4) Epoch 79, batch 350, loss[loss=0.1517, simple_loss=0.2457, pruned_loss=0.02884, over 24196.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2422, pruned_loss=0.02782, over 3955573.86 frames. ], batch size: 82, lr: 3.11e-03, grad_scale: 16.0 2023-12-04 22:56:35,802 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=467733.3333333333, ans=0.0 2023-12-04 22:56:43,123 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=467733.3333333333, ans=0.2 2023-12-04 22:56:52,762 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.10 vs. limit=15.0 2023-12-04 22:57:34,572 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=468000.0, ans=0.2 2023-12-04 22:57:36,760 INFO [train.py:1087] (2/4) Epoch 79, batch 400, loss[loss=0.1559, simple_loss=0.2437, pruned_loss=0.03405, over 24297.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2421, pruned_loss=0.02766, over 4131636.08 frames. ], batch size: 79, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:57:47,134 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.63 vs. limit=10.0 2023-12-04 22:57:47,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=468133.3333333333, ans=0.125 2023-12-04 22:57:49,236 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.16 vs. limit=12.0 2023-12-04 22:57:50,851 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.294e+02 1.402e+02 1.517e+02 2.343e+02, threshold=2.805e+02, percent-clipped=0.0 2023-12-04 22:57:56,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=468133.3333333333, ans=0.125 2023-12-04 22:58:03,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=468200.0, ans=0.125 2023-12-04 22:58:26,189 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=468333.3333333333, ans=0.0 2023-12-04 22:58:37,207 INFO [train.py:1087] (2/4) Epoch 79, batch 450, loss[loss=0.1534, simple_loss=0.2395, pruned_loss=0.03362, over 24513.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2418, pruned_loss=0.0275, over 4277675.49 frames. ], batch size: 75, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 22:58:37,501 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=468400.0, ans=0.2 2023-12-04 22:58:40,166 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-12-04 22:58:55,057 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=468466.6666666667, ans=0.1 2023-12-04 22:58:58,886 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=22.5 2023-12-04 22:59:05,309 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.44 vs. limit=10.0 2023-12-04 22:59:13,429 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=468600.0, ans=0.125 2023-12-04 22:59:20,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=468600.0, ans=0.1 2023-12-04 22:59:35,409 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=15.0 2023-12-04 22:59:36,835 INFO [train.py:1087] (2/4) Epoch 79, batch 500, loss[loss=0.1472, simple_loss=0.2365, pruned_loss=0.02889, over 24783.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2414, pruned_loss=0.02726, over 4390947.57 frames. ], batch size: 65, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 22:59:51,129 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.271e+02 1.358e+02 1.492e+02 2.025e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 23:00:01,146 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=468866.6666666667, ans=0.0 2023-12-04 23:00:25,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=469000.0, ans=0.125 2023-12-04 23:00:29,435 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:00:29,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=469000.0, ans=0.2 2023-12-04 23:00:32,955 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:00:36,461 INFO [train.py:1087] (2/4) Epoch 79, batch 550, loss[loss=0.1401, simple_loss=0.2334, pruned_loss=0.02338, over 24782.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02688, over 4503773.36 frames. ], batch size: 72, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:00:45,523 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469066.6666666667, ans=0.1 2023-12-04 23:00:55,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=469133.3333333333, ans=0.04949747468305833 2023-12-04 23:00:57,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=469133.3333333333, ans=0.125 2023-12-04 23:01:36,443 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=469400.0, ans=0.125 2023-12-04 23:01:37,308 INFO [train.py:1087] (2/4) Epoch 79, batch 600, loss[loss=0.1397, simple_loss=0.2328, pruned_loss=0.02333, over 24697.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2407, pruned_loss=0.02707, over 4575602.31 frames. ], batch size: 74, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:01:37,674 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=469400.0, ans=0.0 2023-12-04 23:01:52,265 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.255e+02 1.326e+02 1.396e+02 1.933e+02, threshold=2.652e+02, percent-clipped=0.0 2023-12-04 23:02:03,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=469533.3333333333, ans=0.125 2023-12-04 23:02:23,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=469600.0, ans=0.125 2023-12-04 23:02:32,037 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-12-04 23:02:38,291 INFO [train.py:1087] (2/4) Epoch 79, batch 650, loss[loss=0.1423, simple_loss=0.2373, pruned_loss=0.02366, over 24740.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.241, pruned_loss=0.02722, over 4635373.84 frames. ], batch size: 66, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:03:04,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469866.6666666667, ans=0.125 2023-12-04 23:03:14,452 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:03:32,025 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=470000.0, ans=0.125 2023-12-04 23:03:38,857 INFO [train.py:1087] (2/4) Epoch 79, batch 700, loss[loss=0.1401, simple_loss=0.2358, pruned_loss=0.02215, over 24710.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2406, pruned_loss=0.02697, over 4693679.13 frames. ], batch size: 69, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:03:42,939 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.51 vs. limit=22.5 2023-12-04 23:03:52,641 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:03:53,561 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.295e+02 1.372e+02 1.521e+02 2.112e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 23:04:38,913 INFO [train.py:1087] (2/4) Epoch 79, batch 750, loss[loss=0.1428, simple_loss=0.2363, pruned_loss=0.02468, over 24460.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2406, pruned_loss=0.02689, over 4731794.63 frames. ], batch size: 72, lr: 3.10e-03, grad_scale: 16.0 2023-12-04 23:04:39,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=470400.0, ans=0.1 2023-12-04 23:04:56,955 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=470466.6666666667, ans=0.0 2023-12-04 23:05:15,313 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=470600.0, ans=0.09899494936611666 2023-12-04 23:05:17,562 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=470600.0, ans=0.1 2023-12-04 23:05:38,473 INFO [train.py:1087] (2/4) Epoch 79, batch 800, loss[loss=0.1676, simple_loss=0.2572, pruned_loss=0.03904, over 24199.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2407, pruned_loss=0.02701, over 4737222.85 frames. ], batch size: 58, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:05:49,284 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-12-04 23:05:50,080 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=470800.0, ans=0.0 2023-12-04 23:05:53,743 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=470800.0, ans=0.125 2023-12-04 23:05:54,529 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.232e+02 1.327e+02 1.422e+02 1.884e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 23:06:34,417 INFO [train.py:1087] (2/4) Epoch 79, batch 850, loss[loss=0.1466, simple_loss=0.2442, pruned_loss=0.0245, over 24818.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.241, pruned_loss=0.02717, over 4752341.44 frames. ], batch size: 72, lr: 3.10e-03, grad_scale: 16.0 2023-12-04 23:06:47,568 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:06:52,268 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-12-04 23:06:55,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471200.0, ans=0.125 2023-12-04 23:06:56,085 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471200.0, ans=0.1 2023-12-04 23:06:59,236 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=471200.0, ans=0.2 2023-12-04 23:06:59,782 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-12-04 23:07:31,223 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=471366.6666666667, ans=0.125 2023-12-04 23:07:38,919 INFO [train.py:1087] (2/4) Epoch 80, batch 0, loss[loss=0.1424, simple_loss=0.2456, pruned_loss=0.01957, over 24762.00 frames. ], tot_loss[loss=0.1424, simple_loss=0.2456, pruned_loss=0.01957, over 24762.00 frames. ], batch size: 72, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:07:38,920 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 23:07:52,529 INFO [train.py:1119] (2/4) Epoch 80, validation: loss=0.151, simple_loss=0.2466, pruned_loss=0.02767, over 944034.00 frames. 2023-12-04 23:07:52,530 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 23:07:56,203 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=471366.6666666667, ans=0.2 2023-12-04 23:07:56,636 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.02 vs. limit=15.0 2023-12-04 23:08:00,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=471366.6666666667, ans=0.125 2023-12-04 23:08:03,210 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471433.3333333333, ans=0.1 2023-12-04 23:08:07,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=471433.3333333333, ans=0.1 2023-12-04 23:08:15,517 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.302e+02 1.425e+02 1.549e+02 2.145e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 23:08:50,255 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.35 vs. limit=15.0 2023-12-04 23:08:50,939 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=471633.3333333333, ans=0.125 2023-12-04 23:08:52,201 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=471700.0, ans=0.125 2023-12-04 23:08:53,210 INFO [train.py:1087] (2/4) Epoch 80, batch 50, loss[loss=0.1447, simple_loss=0.2382, pruned_loss=0.02557, over 24168.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2427, pruned_loss=0.02773, over 1069519.97 frames. ], batch size: 58, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:09:15,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=471766.6666666667, ans=0.125 2023-12-04 23:09:40,438 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=471900.0, ans=0.125 2023-12-04 23:09:52,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=471966.6666666667, ans=0.125 2023-12-04 23:09:54,313 INFO [train.py:1087] (2/4) Epoch 80, batch 100, loss[loss=0.144, simple_loss=0.2399, pruned_loss=0.024, over 21449.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2419, pruned_loss=0.02682, over 1902489.96 frames. ], batch size: 128, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:10:12,631 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=472100.0, ans=0.0 2023-12-04 23:10:12,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=472100.0, ans=0.125 2023-12-04 23:10:18,043 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.286e+02 1.368e+02 1.525e+02 1.879e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 23:10:29,181 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-12-04 23:10:37,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=472233.3333333333, ans=0.07 2023-12-04 23:10:41,190 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-12-04 23:10:47,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472300.0, ans=0.1 2023-12-04 23:10:54,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=472300.0, ans=0.2 2023-12-04 23:10:56,399 INFO [train.py:1087] (2/4) Epoch 80, batch 150, loss[loss=0.1353, simple_loss=0.2243, pruned_loss=0.02313, over 24790.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2407, pruned_loss=0.02666, over 2557900.45 frames. ], batch size: 62, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:11:10,130 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=472433.3333333333, ans=0.125 2023-12-04 23:11:19,756 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=472500.0, ans=0.0 2023-12-04 23:11:28,281 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472500.0, ans=0.1 2023-12-04 23:11:31,864 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=472566.6666666667, ans=0.2 2023-12-04 23:11:38,710 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472566.6666666667, ans=0.1 2023-12-04 23:11:57,562 INFO [train.py:1087] (2/4) Epoch 80, batch 200, loss[loss=0.1475, simple_loss=0.2411, pruned_loss=0.02699, over 24733.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.241, pruned_loss=0.02683, over 3061047.63 frames. ], batch size: 61, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:12:11,839 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.21 vs. limit=15.0 2023-12-04 23:12:18,773 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=472766.6666666667, ans=0.2 2023-12-04 23:12:18,879 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=472766.6666666667, ans=0.0 2023-12-04 23:12:21,637 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.292e+02 1.417e+02 1.507e+02 2.188e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 23:12:27,754 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=472833.3333333333, ans=0.125 2023-12-04 23:12:41,156 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472900.0, ans=0.1 2023-12-04 23:12:42,493 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=15.0 2023-12-04 23:12:43,541 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=472900.0, ans=0.125 2023-12-04 23:12:58,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=473033.3333333333, ans=0.0 2023-12-04 23:12:59,024 INFO [train.py:1087] (2/4) Epoch 80, batch 250, loss[loss=0.143, simple_loss=0.2396, pruned_loss=0.02323, over 24762.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.241, pruned_loss=0.02689, over 3438651.64 frames. ], batch size: 61, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:13:08,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=473033.3333333333, ans=0.125 2023-12-04 23:13:09,297 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-12-04 23:13:12,274 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473100.0, ans=0.1 2023-12-04 23:13:18,687 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=473100.0, ans=0.09899494936611666 2023-12-04 23:13:34,046 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.68 vs. limit=15.0 2023-12-04 23:13:40,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=473233.3333333333, ans=0.125 2023-12-04 23:13:44,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=473233.3333333333, ans=0.2 2023-12-04 23:13:51,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=473300.0, ans=0.05 2023-12-04 23:13:59,755 INFO [train.py:1087] (2/4) Epoch 80, batch 300, loss[loss=0.1493, simple_loss=0.239, pruned_loss=0.02981, over 24741.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2407, pruned_loss=0.02664, over 3747445.27 frames. ], batch size: 61, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:14:02,768 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=473366.6666666667, ans=0.125 2023-12-04 23:14:16,158 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=473433.3333333333, ans=0.0 2023-12-04 23:14:22,725 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.053e+02 1.246e+02 1.309e+02 1.433e+02 1.869e+02, threshold=2.619e+02, percent-clipped=0.0 2023-12-04 23:14:32,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=473500.0, ans=0.125 2023-12-04 23:14:45,164 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:15:01,685 INFO [train.py:1087] (2/4) Epoch 80, batch 350, loss[loss=0.1777, simple_loss=0.2594, pruned_loss=0.04807, over 16715.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2407, pruned_loss=0.02685, over 3969554.03 frames. ], batch size: 177, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:15:23,680 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=12.0 2023-12-04 23:15:26,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473833.3333333333, ans=0.1 2023-12-04 23:15:38,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=473900.0, ans=0.0 2023-12-04 23:15:50,849 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=473966.6666666667, ans=0.125 2023-12-04 23:15:52,071 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:15:53,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473966.6666666667, ans=0.1 2023-12-04 23:16:02,803 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.17 vs. limit=15.0 2023-12-04 23:16:03,403 INFO [train.py:1087] (2/4) Epoch 80, batch 400, loss[loss=0.1447, simple_loss=0.2386, pruned_loss=0.02538, over 24523.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2408, pruned_loss=0.0267, over 4155852.64 frames. ], batch size: 75, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:16:04,869 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=474033.3333333333, ans=0.2 2023-12-04 23:16:26,874 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.273e+02 1.337e+02 1.434e+02 1.847e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 23:16:28,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=474166.6666666667, ans=0.125 2023-12-04 23:16:34,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=474166.6666666667, ans=0.125 2023-12-04 23:16:55,197 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=474300.0, ans=0.125 2023-12-04 23:16:57,981 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=474300.0, ans=0.125 2023-12-04 23:17:04,791 INFO [train.py:1087] (2/4) Epoch 80, batch 450, loss[loss=0.1515, simple_loss=0.2462, pruned_loss=0.02844, over 20876.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2408, pruned_loss=0.0267, over 4309635.82 frames. ], batch size: 127, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:17:05,049 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=474366.6666666667, ans=0.125 2023-12-04 23:17:10,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=474366.6666666667, ans=0.0 2023-12-04 23:17:22,695 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=474433.3333333333, ans=0.125 2023-12-04 23:17:38,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=474500.0, ans=0.0 2023-12-04 23:17:41,515 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=474566.6666666667, ans=0.125 2023-12-04 23:17:42,025 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.46 vs. limit=22.5 2023-12-04 23:18:06,326 INFO [train.py:1087] (2/4) Epoch 80, batch 500, loss[loss=0.1461, simple_loss=0.2394, pruned_loss=0.02642, over 24783.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2407, pruned_loss=0.02679, over 4438336.71 frames. ], batch size: 62, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:18:11,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=474700.0, ans=0.125 2023-12-04 23:18:13,583 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=474700.0, ans=0.0 2023-12-04 23:18:19,617 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=474766.6666666667, ans=0.0 2023-12-04 23:18:23,342 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.71 vs. limit=15.0 2023-12-04 23:18:28,523 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.264e+02 1.348e+02 1.436e+02 1.913e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-04 23:18:32,302 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=474833.3333333333, ans=0.125 2023-12-04 23:18:43,944 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:19:01,908 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474966.6666666667, ans=0.1 2023-12-04 23:19:06,468 INFO [train.py:1087] (2/4) Epoch 80, batch 550, loss[loss=0.1377, simple_loss=0.2342, pruned_loss=0.02058, over 24786.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2405, pruned_loss=0.02658, over 4532064.95 frames. ], batch size: 71, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:19:40,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=475166.6666666667, ans=0.2 2023-12-04 23:19:41,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475166.6666666667, ans=0.1 2023-12-04 23:19:42,362 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:19:43,376 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475233.3333333333, ans=0.125 2023-12-04 23:20:08,001 INFO [train.py:1087] (2/4) Epoch 80, batch 600, loss[loss=0.1454, simple_loss=0.2424, pruned_loss=0.02419, over 24748.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2408, pruned_loss=0.02677, over 4595483.37 frames. ], batch size: 63, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:20:21,566 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=475433.3333333333, ans=0.125 2023-12-04 23:20:31,887 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.148e+02 1.302e+02 1.395e+02 1.504e+02 2.408e+02, threshold=2.791e+02, percent-clipped=0.0 2023-12-04 23:20:52,061 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=475566.6666666667, ans=0.5 2023-12-04 23:21:10,056 INFO [train.py:1087] (2/4) Epoch 80, batch 650, loss[loss=0.1364, simple_loss=0.2335, pruned_loss=0.01966, over 24603.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2404, pruned_loss=0.02659, over 4658051.65 frames. ], batch size: 68, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:21:10,743 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=22.5 2023-12-04 23:21:34,829 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=475833.3333333333, ans=0.125 2023-12-04 23:21:46,858 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=475900.0, ans=0.2 2023-12-04 23:21:49,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=475900.0, ans=0.125 2023-12-04 23:21:58,487 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=475966.6666666667, ans=0.2 2023-12-04 23:22:11,482 INFO [train.py:1087] (2/4) Epoch 80, batch 700, loss[loss=0.176, simple_loss=0.2583, pruned_loss=0.04684, over 16781.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2402, pruned_loss=0.02653, over 4681758.93 frames. ], batch size: 177, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:22:17,451 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=476033.3333333333, ans=0.125 2023-12-04 23:22:31,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=476100.0, ans=0.2 2023-12-04 23:22:33,962 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.276e+02 1.339e+02 1.452e+02 1.981e+02, threshold=2.679e+02, percent-clipped=0.0 2023-12-04 23:22:36,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=476166.6666666667, ans=0.09899494936611666 2023-12-04 23:22:41,723 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=476166.6666666667, ans=0.0 2023-12-04 23:22:53,096 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.59 vs. limit=15.0 2023-12-04 23:22:53,988 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=476233.3333333333, ans=0.0 2023-12-04 23:23:02,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=476300.0, ans=0.0 2023-12-04 23:23:11,812 INFO [train.py:1087] (2/4) Epoch 80, batch 750, loss[loss=0.1351, simple_loss=0.2293, pruned_loss=0.02043, over 24703.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02672, over 4703828.17 frames. ], batch size: 74, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:23:52,974 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=476566.6666666667, ans=0.07 2023-12-04 23:23:54,072 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=476566.6666666667, ans=10.0 2023-12-04 23:24:10,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=476633.3333333333, ans=0.025 2023-12-04 23:24:10,898 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-12-04 23:24:12,650 INFO [train.py:1087] (2/4) Epoch 80, batch 800, loss[loss=0.1471, simple_loss=0.2403, pruned_loss=0.02699, over 24596.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02677, over 4721872.38 frames. ], batch size: 68, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:24:26,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=476766.6666666667, ans=0.0 2023-12-04 23:24:32,223 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=476766.6666666667, ans=0.125 2023-12-04 23:24:35,156 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.046e+02 1.250e+02 1.373e+02 1.517e+02 1.879e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 23:24:38,794 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=476833.3333333333, ans=0.125 2023-12-04 23:25:07,126 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476966.6666666667, ans=0.1 2023-12-04 23:25:09,017 INFO [train.py:1087] (2/4) Epoch 80, batch 850, loss[loss=0.136, simple_loss=0.2282, pruned_loss=0.02195, over 24563.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02668, over 4745104.81 frames. ], batch size: 66, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:25:10,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=477033.3333333333, ans=0.125 2023-12-04 23:25:22,540 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=477100.0, ans=0.02 2023-12-04 23:25:23,768 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=477100.0, ans=0.125 2023-12-04 23:25:28,158 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477100.0, ans=0.1 2023-12-04 23:25:33,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=477166.6666666667, ans=0.125 2023-12-04 23:25:54,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=477300.0, ans=0.125 2023-12-04 23:26:14,797 INFO [train.py:1087] (2/4) Epoch 81, batch 0, loss[loss=0.1423, simple_loss=0.2393, pruned_loss=0.02271, over 24715.00 frames. ], tot_loss[loss=0.1423, simple_loss=0.2393, pruned_loss=0.02271, over 24715.00 frames. ], batch size: 67, lr: 3.04e-03, grad_scale: 32.0 2023-12-04 23:26:14,798 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 23:26:28,301 INFO [train.py:1119] (2/4) Epoch 81, validation: loss=0.151, simple_loss=0.2464, pruned_loss=0.02775, over 944034.00 frames. 2023-12-04 23:26:28,302 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 23:26:56,774 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.252e+02 1.352e+02 1.455e+02 2.085e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 23:27:07,763 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=477533.3333333333, ans=0.0 2023-12-04 23:27:12,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=477533.3333333333, ans=0.0 2023-12-04 23:27:28,246 INFO [train.py:1087] (2/4) Epoch 81, batch 50, loss[loss=0.1444, simple_loss=0.2364, pruned_loss=0.02618, over 24785.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02745, over 1085610.71 frames. ], batch size: 62, lr: 3.04e-03, grad_scale: 32.0 2023-12-04 23:27:29,702 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=477666.6666666667, ans=0.1 2023-12-04 23:27:32,476 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.77 vs. limit=12.0 2023-12-04 23:27:49,440 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=22.5 2023-12-04 23:28:28,258 INFO [train.py:1087] (2/4) Epoch 81, batch 100, loss[loss=0.1463, simple_loss=0.2368, pruned_loss=0.02792, over 24759.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2415, pruned_loss=0.02715, over 1914327.29 frames. ], batch size: 65, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:28:45,017 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478066.6666666667, ans=0.125 2023-12-04 23:28:48,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=478066.6666666667, ans=0.125 2023-12-04 23:28:56,569 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.160e+02 1.303e+02 1.380e+02 1.472e+02 1.883e+02, threshold=2.760e+02, percent-clipped=0.0 2023-12-04 23:29:27,230 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=478333.3333333333, ans=0.05 2023-12-04 23:29:28,110 INFO [train.py:1087] (2/4) Epoch 81, batch 150, loss[loss=0.1394, simple_loss=0.2308, pruned_loss=0.024, over 24756.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.241, pruned_loss=0.02703, over 2566732.62 frames. ], batch size: 65, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:29:29,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=478333.3333333333, ans=0.125 2023-12-04 23:29:47,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=478400.0, ans=0.125 2023-12-04 23:30:23,811 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478600.0, ans=0.1 2023-12-04 23:30:28,451 INFO [train.py:1087] (2/4) Epoch 81, batch 200, loss[loss=0.155, simple_loss=0.2482, pruned_loss=0.0309, over 21599.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2415, pruned_loss=0.02716, over 3062848.33 frames. ], batch size: 127, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:30:37,136 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=478666.6666666667, ans=0.0 2023-12-04 23:30:43,322 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=478733.3333333333, ans=0.2 2023-12-04 23:30:50,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=478733.3333333333, ans=0.0 2023-12-04 23:30:57,195 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.273e+02 1.365e+02 1.482e+02 1.977e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 23:31:12,114 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.36 vs. limit=15.0 2023-12-04 23:31:20,498 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-12-04 23:31:28,019 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479000.0, ans=0.1 2023-12-04 23:31:28,918 INFO [train.py:1087] (2/4) Epoch 81, batch 250, loss[loss=0.1413, simple_loss=0.2357, pruned_loss=0.02342, over 24574.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2407, pruned_loss=0.02692, over 3463930.13 frames. ], batch size: 64, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:31:39,604 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479066.6666666667, ans=0.1 2023-12-04 23:32:01,842 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479133.3333333333, ans=0.1 2023-12-04 23:32:08,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479200.0, ans=0.125 2023-12-04 23:32:12,173 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=479200.0, ans=0.2 2023-12-04 23:32:15,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=479266.6666666667, ans=0.0 2023-12-04 23:32:29,522 INFO [train.py:1087] (2/4) Epoch 81, batch 300, loss[loss=0.1423, simple_loss=0.2378, pruned_loss=0.02337, over 24557.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2409, pruned_loss=0.02678, over 3775161.98 frames. ], batch size: 66, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:32:53,062 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479466.6666666667, ans=0.0 2023-12-04 23:32:55,517 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=479466.6666666667, ans=0.125 2023-12-04 23:32:57,846 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.265e+02 1.364e+02 1.472e+02 1.855e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 23:33:08,080 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-12-04 23:33:29,019 INFO [train.py:1087] (2/4) Epoch 81, batch 350, loss[loss=0.158, simple_loss=0.242, pruned_loss=0.03704, over 24500.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2403, pruned_loss=0.02677, over 4013459.83 frames. ], batch size: 75, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:33:34,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=479666.6666666667, ans=0.1 2023-12-04 23:33:41,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=479733.3333333333, ans=0.125 2023-12-04 23:33:43,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=479733.3333333333, ans=0.0 2023-12-04 23:33:53,027 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-12-04 23:33:59,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=479800.0, ans=0.2 2023-12-04 23:34:22,004 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=479933.3333333333, ans=0.0 2023-12-04 23:34:26,988 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2023-12-04 23:34:32,771 INFO [train.py:1087] (2/4) Epoch 81, batch 400, loss[loss=0.1416, simple_loss=0.236, pruned_loss=0.02356, over 24486.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2402, pruned_loss=0.02669, over 4189019.26 frames. ], batch size: 77, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:34:50,899 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:34:56,211 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=22.5 2023-12-04 23:35:02,674 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.269e+02 1.346e+02 1.459e+02 1.874e+02, threshold=2.692e+02, percent-clipped=0.0 2023-12-04 23:35:12,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=480200.0, ans=0.125 2023-12-04 23:35:20,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=480266.6666666667, ans=0.2 2023-12-04 23:35:30,059 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=480266.6666666667, ans=0.2 2023-12-04 23:35:34,510 INFO [train.py:1087] (2/4) Epoch 81, batch 450, loss[loss=0.1537, simple_loss=0.2486, pruned_loss=0.02939, over 23542.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02665, over 4331533.45 frames. ], batch size: 94, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:35:39,450 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=480333.3333333333, ans=0.125 2023-12-04 23:35:41,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=480333.3333333333, ans=0.125 2023-12-04 23:35:44,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=480333.3333333333, ans=0.0 2023-12-04 23:35:47,591 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=480400.0, ans=0.125 2023-12-04 23:36:03,570 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=480466.6666666667, ans=0.0 2023-12-04 23:36:21,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=480600.0, ans=0.0 2023-12-04 23:36:21,783 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=480600.0, ans=0.05 2023-12-04 23:36:34,914 INFO [train.py:1087] (2/4) Epoch 81, batch 500, loss[loss=0.1481, simple_loss=0.2418, pruned_loss=0.02721, over 24711.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2402, pruned_loss=0.02666, over 4438036.10 frames. ], batch size: 74, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:36:40,397 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.70 vs. limit=22.5 2023-12-04 23:36:44,111 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-12-04 23:36:48,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=480733.3333333333, ans=0.0 2023-12-04 23:36:56,319 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=480733.3333333333, ans=0.0 2023-12-04 23:37:03,571 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.241e+02 1.323e+02 1.432e+02 2.037e+02, threshold=2.646e+02, percent-clipped=0.0 2023-12-04 23:37:03,907 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480800.0, ans=0.1 2023-12-04 23:37:20,725 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.73 vs. limit=15.0 2023-12-04 23:37:33,444 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-12-04 23:37:35,450 INFO [train.py:1087] (2/4) Epoch 81, batch 550, loss[loss=0.1467, simple_loss=0.2433, pruned_loss=0.02506, over 22843.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02663, over 4519366.07 frames. ], batch size: 106, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:37:50,816 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.20 vs. limit=15.0 2023-12-04 23:38:06,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=481133.3333333333, ans=0.0 2023-12-04 23:38:12,545 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-12-04 23:38:34,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=481266.6666666667, ans=0.125 2023-12-04 23:38:34,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=481266.6666666667, ans=0.125 2023-12-04 23:38:37,232 INFO [train.py:1087] (2/4) Epoch 81, batch 600, loss[loss=0.1554, simple_loss=0.2469, pruned_loss=0.03198, over 24737.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2404, pruned_loss=0.02676, over 4583035.98 frames. ], batch size: 61, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:38:37,556 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=481333.3333333333, ans=0.0 2023-12-04 23:38:42,202 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=481333.3333333333, ans=0.125 2023-12-04 23:38:49,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=481400.0, ans=0.0 2023-12-04 23:38:57,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=481400.0, ans=0.125 2023-12-04 23:39:03,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=481466.6666666667, ans=0.0 2023-12-04 23:39:04,425 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=481466.6666666667, ans=0.0 2023-12-04 23:39:06,413 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.246e+02 1.329e+02 1.442e+02 1.824e+02, threshold=2.657e+02, percent-clipped=0.0 2023-12-04 23:39:31,369 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.32 vs. limit=22.5 2023-12-04 23:39:38,085 INFO [train.py:1087] (2/4) Epoch 81, batch 650, loss[loss=0.1438, simple_loss=0.2416, pruned_loss=0.02298, over 24853.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2407, pruned_loss=0.02679, over 4623515.87 frames. ], batch size: 68, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:39:52,477 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.21 vs. limit=15.0 2023-12-04 23:40:03,638 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=481800.0, ans=0.125 2023-12-04 23:40:12,141 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=481800.0, ans=0.125 2023-12-04 23:40:33,895 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:40:35,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=481933.3333333333, ans=0.035 2023-12-04 23:40:39,250 INFO [train.py:1087] (2/4) Epoch 81, batch 700, loss[loss=0.155, simple_loss=0.249, pruned_loss=0.0305, over 23609.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02663, over 4653864.38 frames. ], batch size: 94, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:41:02,222 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:41:06,125 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=482133.3333333333, ans=0.125 2023-12-04 23:41:08,948 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.264e+02 1.382e+02 1.480e+02 1.897e+02, threshold=2.765e+02, percent-clipped=0.0 2023-12-04 23:41:13,850 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=482133.3333333333, ans=0.0 2023-12-04 23:41:18,773 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=482200.0, ans=0.125 2023-12-04 23:41:29,929 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=482266.6666666667, ans=0.125 2023-12-04 23:41:41,145 INFO [train.py:1087] (2/4) Epoch 81, batch 750, loss[loss=0.1399, simple_loss=0.2366, pruned_loss=0.02158, over 24870.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02665, over 4680622.63 frames. ], batch size: 68, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:41:47,799 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=482333.3333333333, ans=0.0 2023-12-04 23:41:53,935 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.85 vs. limit=15.0 2023-12-04 23:42:26,311 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=482533.3333333333, ans=0.0 2023-12-04 23:42:27,493 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=482533.3333333333, ans=0.125 2023-12-04 23:42:38,038 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:42:41,428 INFO [train.py:1087] (2/4) Epoch 81, batch 800, loss[loss=0.1416, simple_loss=0.2337, pruned_loss=0.02473, over 24566.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2404, pruned_loss=0.02663, over 4706374.85 frames. ], batch size: 62, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:42:49,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=482666.6666666667, ans=0.125 2023-12-04 23:42:59,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=482733.3333333333, ans=0.125 2023-12-04 23:43:00,534 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=482733.3333333333, ans=0.125 2023-12-04 23:43:08,907 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.262e+02 1.349e+02 1.462e+02 1.885e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 23:43:37,364 INFO [train.py:1087] (2/4) Epoch 81, batch 850, loss[loss=0.1329, simple_loss=0.2274, pruned_loss=0.0192, over 24576.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02667, over 4736336.01 frames. ], batch size: 65, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:43:46,213 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=483000.0, ans=0.125 2023-12-04 23:43:52,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=483066.6666666667, ans=0.125 2023-12-04 23:43:53,467 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.61 vs. limit=15.0 2023-12-04 23:43:57,321 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=483066.6666666667, ans=0.125 2023-12-04 23:44:01,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=483133.3333333333, ans=0.0 2023-12-04 23:44:02,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=483133.3333333333, ans=0.125 2023-12-04 23:44:07,143 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=483133.3333333333, ans=0.0 2023-12-04 23:44:09,277 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=483200.0, ans=0.2 2023-12-04 23:44:42,163 INFO [train.py:1087] (2/4) Epoch 82, batch 0, loss[loss=0.1349, simple_loss=0.23, pruned_loss=0.01989, over 24705.00 frames. ], tot_loss[loss=0.1349, simple_loss=0.23, pruned_loss=0.01989, over 24705.00 frames. ], batch size: 69, lr: 3.00e-03, grad_scale: 32.0 2023-12-04 23:44:42,165 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-04 23:44:55,458 INFO [train.py:1119] (2/4) Epoch 82, validation: loss=0.1511, simple_loss=0.2466, pruned_loss=0.02783, over 944034.00 frames. 2023-12-04 23:44:55,459 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-04 23:45:15,360 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=22.5 2023-12-04 23:45:30,354 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.295e+02 1.378e+02 1.508e+02 2.453e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 23:45:44,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=483566.6666666667, ans=0.125 2023-12-04 23:45:56,416 INFO [train.py:1087] (2/4) Epoch 82, batch 50, loss[loss=0.1345, simple_loss=0.2281, pruned_loss=0.0204, over 24735.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02638, over 1076372.90 frames. ], batch size: 67, lr: 3.00e-03, grad_scale: 32.0 2023-12-04 23:46:24,765 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=483766.6666666667, ans=0.2 2023-12-04 23:46:30,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=483766.6666666667, ans=0.2 2023-12-04 23:46:36,889 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=483833.3333333333, ans=0.1 2023-12-04 23:46:38,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=483833.3333333333, ans=0.0 2023-12-04 23:46:43,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=483900.0, ans=0.09899494936611666 2023-12-04 23:46:57,130 INFO [train.py:1087] (2/4) Epoch 82, batch 100, loss[loss=0.1379, simple_loss=0.2313, pruned_loss=0.02224, over 24782.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02677, over 1896764.78 frames. ], batch size: 73, lr: 3.00e-03, grad_scale: 16.0 2023-12-04 23:47:01,716 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=483966.6666666667, ans=0.125 2023-12-04 23:47:04,538 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=483966.6666666667, ans=0.0 2023-12-04 23:47:34,848 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.246e+02 1.323e+02 1.441e+02 1.748e+02, threshold=2.646e+02, percent-clipped=0.0 2023-12-04 23:47:36,278 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=484166.6666666667, ans=0.125 2023-12-04 23:47:37,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=484166.6666666667, ans=0.125 2023-12-04 23:47:43,167 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=484166.6666666667, ans=0.2 2023-12-04 23:47:56,826 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=484300.0, ans=0.125 2023-12-04 23:47:58,392 INFO [train.py:1087] (2/4) Epoch 82, batch 150, loss[loss=0.1513, simple_loss=0.2445, pruned_loss=0.02902, over 24119.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2403, pruned_loss=0.02678, over 2526114.75 frames. ], batch size: 58, lr: 3.00e-03, grad_scale: 16.0 2023-12-04 23:48:10,621 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.57 vs. limit=10.0 2023-12-04 23:48:12,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=484366.6666666667, ans=0.2 2023-12-04 23:48:14,404 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=484366.6666666667, ans=10.0 2023-12-04 23:48:21,194 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-12-04 23:48:59,557 INFO [train.py:1087] (2/4) Epoch 82, batch 200, loss[loss=0.1442, simple_loss=0.238, pruned_loss=0.02523, over 24557.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.24, pruned_loss=0.02655, over 3029991.52 frames. ], batch size: 62, lr: 3.00e-03, grad_scale: 8.0 2023-12-04 23:49:13,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=484700.0, ans=0.0 2023-12-04 23:49:31,482 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.17 vs. limit=22.5 2023-12-04 23:49:36,790 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.311e+02 1.433e+02 1.635e+02 2.108e+02, threshold=2.866e+02, percent-clipped=0.0 2023-12-04 23:50:00,367 INFO [train.py:1087] (2/4) Epoch 82, batch 250, loss[loss=0.1388, simple_loss=0.2329, pruned_loss=0.0223, over 24576.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02637, over 3434862.35 frames. ], batch size: 65, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:50:05,276 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484966.6666666667, ans=0.1 2023-12-04 23:50:53,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=485233.3333333333, ans=0.07 2023-12-04 23:50:55,505 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=485233.3333333333, ans=0.2 2023-12-04 23:51:01,008 INFO [train.py:1087] (2/4) Epoch 82, batch 300, loss[loss=0.1579, simple_loss=0.2512, pruned_loss=0.03236, over 24556.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02638, over 3752358.39 frames. ], batch size: 63, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:51:05,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=485300.0, ans=0.0 2023-12-04 23:51:07,394 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=485300.0, ans=0.0 2023-12-04 23:51:12,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=485366.6666666667, ans=0.125 2023-12-04 23:51:38,091 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.293e+02 1.389e+02 1.563e+02 1.990e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-04 23:51:40,777 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:51:56,837 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=485566.6666666667, ans=0.125 2023-12-04 23:52:00,130 INFO [train.py:1087] (2/4) Epoch 82, batch 350, loss[loss=0.1464, simple_loss=0.2417, pruned_loss=0.02557, over 24333.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2402, pruned_loss=0.0266, over 3980434.04 frames. ], batch size: 79, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:52:13,284 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=485700.0, ans=0.0 2023-12-04 23:52:23,222 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=12.0 2023-12-04 23:52:27,627 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=485766.6666666667, ans=0.125 2023-12-04 23:52:38,340 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=485833.3333333333, ans=0.0 2023-12-04 23:52:38,641 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-12-04 23:52:54,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=485900.0, ans=0.125 2023-12-04 23:52:58,067 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485900.0, ans=0.125 2023-12-04 23:53:01,248 INFO [train.py:1087] (2/4) Epoch 82, batch 400, loss[loss=0.1473, simple_loss=0.2379, pruned_loss=0.02836, over 24681.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02668, over 4154807.76 frames. ], batch size: 74, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:53:06,234 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=485966.6666666667, ans=0.125 2023-12-04 23:53:09,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485966.6666666667, ans=0.1 2023-12-04 23:53:20,762 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=486033.3333333333, ans=0.0 2023-12-04 23:53:27,569 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-12-04 23:53:32,031 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=486100.0, ans=0.125 2023-12-04 23:53:38,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486166.6666666667, ans=0.125 2023-12-04 23:53:39,809 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.288e+02 1.362e+02 1.482e+02 1.838e+02, threshold=2.724e+02, percent-clipped=0.0 2023-12-04 23:53:42,968 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=12.0 2023-12-04 23:54:03,634 INFO [train.py:1087] (2/4) Epoch 82, batch 450, loss[loss=0.1395, simple_loss=0.2381, pruned_loss=0.02043, over 24716.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2405, pruned_loss=0.02683, over 4300003.18 frames. ], batch size: 74, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:54:08,460 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=486300.0, ans=0.125 2023-12-04 23:54:17,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486366.6666666667, ans=0.125 2023-12-04 23:54:21,335 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=486366.6666666667, ans=0.125 2023-12-04 23:54:21,807 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=15.0 2023-12-04 23:54:40,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=486500.0, ans=0.125 2023-12-04 23:54:40,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=486500.0, ans=0.125 2023-12-04 23:54:41,379 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=486500.0, ans=0.125 2023-12-04 23:54:50,771 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=486566.6666666667, ans=0.0 2023-12-04 23:55:03,900 INFO [train.py:1087] (2/4) Epoch 82, batch 500, loss[loss=0.1586, simple_loss=0.2482, pruned_loss=0.03448, over 24216.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2406, pruned_loss=0.02686, over 4415386.49 frames. ], batch size: 82, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:55:08,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=486633.3333333333, ans=0.0 2023-12-04 23:55:39,555 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=486833.3333333333, ans=10.0 2023-12-04 23:55:40,896 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=486833.3333333333, ans=0.125 2023-12-04 23:55:41,002 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-12-04 23:55:41,585 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.266e+02 1.337e+02 1.455e+02 2.001e+02, threshold=2.675e+02, percent-clipped=0.0 2023-12-04 23:55:41,838 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=486833.3333333333, ans=0.07 2023-12-04 23:55:56,209 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-12-04 23:56:04,002 INFO [train.py:1087] (2/4) Epoch 82, batch 550, loss[loss=0.1411, simple_loss=0.2359, pruned_loss=0.02315, over 23568.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02671, over 4502267.03 frames. ], batch size: 95, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:56:05,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=486966.6666666667, ans=0.95 2023-12-04 23:56:07,458 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=486966.6666666667, ans=0.0 2023-12-04 23:56:11,366 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.45 vs. limit=10.0 2023-12-04 23:56:17,624 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487033.3333333333, ans=0.1 2023-12-04 23:56:35,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=487100.0, ans=0.2 2023-12-04 23:56:44,881 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.58 vs. limit=15.0 2023-12-04 23:56:54,928 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=487233.3333333333, ans=0.0 2023-12-04 23:56:59,196 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:57:00,902 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2023-12-04 23:57:04,685 INFO [train.py:1087] (2/4) Epoch 82, batch 600, loss[loss=0.1554, simple_loss=0.2506, pruned_loss=0.0301, over 23350.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02683, over 4561043.50 frames. ], batch size: 94, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:57:20,872 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=487366.6666666667, ans=0.125 2023-12-04 23:57:33,683 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=487433.3333333333, ans=0.0 2023-12-04 23:57:41,668 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.269e+02 1.328e+02 1.424e+02 1.653e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 23:57:48,852 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.05 vs. limit=15.0 2023-12-04 23:58:03,984 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.97 vs. limit=12.0 2023-12-04 23:58:04,391 INFO [train.py:1087] (2/4) Epoch 82, batch 650, loss[loss=0.141, simple_loss=0.2412, pruned_loss=0.02041, over 24680.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02641, over 4623246.63 frames. ], batch size: 74, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:58:30,966 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:58:31,512 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=12.0 2023-12-04 23:58:32,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=487766.6666666667, ans=0.125 2023-12-04 23:58:52,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=487900.0, ans=0.2 2023-12-04 23:59:00,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=487900.0, ans=15.0 2023-12-04 23:59:04,989 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=487966.6666666667, ans=0.0 2023-12-04 23:59:05,940 INFO [train.py:1087] (2/4) Epoch 82, batch 700, loss[loss=0.1533, simple_loss=0.2468, pruned_loss=0.02988, over 24737.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2405, pruned_loss=0.02641, over 4657581.29 frames. ], batch size: 61, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:59:13,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=487966.6666666667, ans=0.015 2023-12-04 23:59:16,882 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488033.3333333333, ans=0.0 2023-12-04 23:59:18,385 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-12-04 23:59:32,759 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488100.0, ans=0.1 2023-12-04 23:59:41,917 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=488166.6666666667, ans=0.95 2023-12-04 23:59:43,846 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.242e+02 1.310e+02 1.427e+02 1.837e+02, threshold=2.620e+02, percent-clipped=0.0 2023-12-04 23:59:44,284 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=488166.6666666667, ans=0.125 2023-12-04 23:59:46,492 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=488166.6666666667, ans=0.0 2023-12-04 23:59:49,877 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=488166.6666666667, ans=0.0 2023-12-04 23:59:54,926 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.21 vs. limit=15.0 2023-12-05 00:00:07,473 INFO [train.py:1087] (2/4) Epoch 82, batch 750, loss[loss=0.1351, simple_loss=0.2283, pruned_loss=0.02097, over 24707.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2405, pruned_loss=0.02653, over 4673586.96 frames. ], batch size: 74, lr: 2.98e-03, grad_scale: 16.0 2023-12-05 00:00:09,344 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.51 vs. limit=15.0 2023-12-05 00:00:24,544 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-12-05 00:01:04,655 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=488566.6666666667, ans=0.035 2023-12-05 00:01:07,136 INFO [train.py:1087] (2/4) Epoch 82, batch 800, loss[loss=0.1481, simple_loss=0.2475, pruned_loss=0.02434, over 22922.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2404, pruned_loss=0.0265, over 4717637.59 frames. ], batch size: 106, lr: 2.98e-03, grad_scale: 32.0 2023-12-05 00:01:32,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=488766.6666666667, ans=0.1 2023-12-05 00:01:38,697 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-12-05 00:01:42,593 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.299e+02 1.380e+02 1.506e+02 1.865e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-05 00:01:49,322 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=488833.3333333333, ans=0.125 2023-12-05 00:01:58,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=488900.0, ans=0.1 2023-12-05 00:02:02,896 INFO [train.py:1087] (2/4) Epoch 82, batch 850, loss[loss=0.1383, simple_loss=0.2332, pruned_loss=0.0217, over 24564.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2399, pruned_loss=0.02626, over 4757929.96 frames. ], batch size: 62, lr: 2.98e-03, grad_scale: 32.0 2023-12-05 00:02:03,395 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.58 vs. limit=22.5 2023-12-05 00:02:06,476 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-12-05 00:03:07,199 INFO [train.py:1087] (2/4) Epoch 83, batch 0, loss[loss=0.1341, simple_loss=0.2279, pruned_loss=0.02019, over 24611.00 frames. ], tot_loss[loss=0.1341, simple_loss=0.2279, pruned_loss=0.02019, over 24611.00 frames. ], batch size: 68, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:03:07,200 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 00:03:21,102 INFO [train.py:1119] (2/4) Epoch 83, validation: loss=0.1508, simple_loss=0.2463, pruned_loss=0.02768, over 944034.00 frames. 2023-12-05 00:03:21,103 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 00:03:27,262 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=489266.6666666667, ans=0.0 2023-12-05 00:03:43,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=489333.3333333333, ans=0.0 2023-12-05 00:04:04,726 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.317e+02 1.403e+02 1.518e+02 2.335e+02, threshold=2.805e+02, percent-clipped=0.0 2023-12-05 00:04:15,245 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=22.5 2023-12-05 00:04:21,590 INFO [train.py:1087] (2/4) Epoch 83, batch 50, loss[loss=0.1481, simple_loss=0.2483, pruned_loss=0.02396, over 21401.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2399, pruned_loss=0.026, over 1089114.91 frames. ], batch size: 127, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:04:35,444 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.44 vs. limit=15.0 2023-12-05 00:04:52,880 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=489733.3333333333, ans=0.125 2023-12-05 00:05:12,987 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=489866.6666666667, ans=0.125 2023-12-05 00:05:15,155 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=489866.6666666667, ans=0.0 2023-12-05 00:05:20,973 INFO [train.py:1087] (2/4) Epoch 83, batch 100, loss[loss=0.1478, simple_loss=0.2406, pruned_loss=0.02752, over 24138.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.24, pruned_loss=0.02584, over 1919332.42 frames. ], batch size: 82, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:05:34,278 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=490000.0, ans=0.0 2023-12-05 00:05:38,115 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=490000.0, ans=0.0 2023-12-05 00:05:48,232 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2023-12-05 00:06:04,453 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.049e+02 1.256e+02 1.358e+02 1.501e+02 1.956e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 00:06:20,842 INFO [train.py:1087] (2/4) Epoch 83, batch 150, loss[loss=0.1376, simple_loss=0.2287, pruned_loss=0.02324, over 24759.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02602, over 2577075.33 frames. ], batch size: 66, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:06:31,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=490333.3333333333, ans=0.09899494936611666 2023-12-05 00:06:50,488 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2023-12-05 00:06:53,599 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=490400.0, ans=0.125 2023-12-05 00:07:21,714 INFO [train.py:1087] (2/4) Epoch 83, batch 200, loss[loss=0.139, simple_loss=0.2347, pruned_loss=0.02167, over 24769.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02639, over 3067043.24 frames. ], batch size: 65, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:07:31,411 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=490600.0, ans=0.0 2023-12-05 00:07:34,278 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=490666.6666666667, ans=0.125 2023-12-05 00:07:44,110 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=490666.6666666667, ans=0.2 2023-12-05 00:07:45,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=490733.3333333333, ans=0.125 2023-12-05 00:08:07,064 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.244e+02 1.327e+02 1.425e+02 1.700e+02, threshold=2.653e+02, percent-clipped=0.0 2023-12-05 00:08:13,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=490866.6666666667, ans=0.0 2023-12-05 00:08:18,845 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-12-05 00:08:23,351 INFO [train.py:1087] (2/4) Epoch 83, batch 250, loss[loss=0.1605, simple_loss=0.2525, pruned_loss=0.0342, over 24300.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2404, pruned_loss=0.02663, over 3454483.42 frames. ], batch size: 79, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:08:32,144 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.59 vs. limit=15.0 2023-12-05 00:08:45,609 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=15.0 2023-12-05 00:08:56,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=491066.6666666667, ans=0.125 2023-12-05 00:09:08,645 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.06 vs. limit=15.0 2023-12-05 00:09:10,462 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=491200.0, ans=0.125 2023-12-05 00:09:23,696 INFO [train.py:1087] (2/4) Epoch 83, batch 300, loss[loss=0.1611, simple_loss=0.2541, pruned_loss=0.03407, over 24200.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2405, pruned_loss=0.02658, over 3767902.88 frames. ], batch size: 82, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:09:37,771 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.85 vs. limit=15.0 2023-12-05 00:09:42,004 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=491333.3333333333, ans=0.0 2023-12-05 00:09:47,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=491400.0, ans=0.2 2023-12-05 00:10:08,637 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.277e+02 1.353e+02 1.463e+02 1.799e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 00:10:10,183 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=491466.6666666667, ans=0.0 2023-12-05 00:10:23,401 INFO [train.py:1087] (2/4) Epoch 83, batch 350, loss[loss=0.1471, simple_loss=0.2438, pruned_loss=0.02519, over 24765.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2407, pruned_loss=0.02668, over 3997640.29 frames. ], batch size: 65, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:10:33,925 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=491600.0, ans=0.125 2023-12-05 00:10:47,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491733.3333333333, ans=0.1 2023-12-05 00:11:25,199 INFO [train.py:1087] (2/4) Epoch 83, batch 400, loss[loss=0.1579, simple_loss=0.2498, pruned_loss=0.03302, over 17345.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2409, pruned_loss=0.02669, over 4166839.40 frames. ], batch size: 177, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:11:25,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=491933.3333333333, ans=0.125 2023-12-05 00:11:32,632 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-12-05 00:11:43,749 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=492000.0, ans=0.125 2023-12-05 00:11:52,039 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-12-05 00:11:52,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=492066.6666666667, ans=0.125 2023-12-05 00:12:05,076 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=492133.3333333333, ans=10.0 2023-12-05 00:12:09,422 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.296e+02 1.398e+02 1.505e+02 2.033e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-05 00:12:23,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=492200.0, ans=0.1 2023-12-05 00:12:25,969 INFO [train.py:1087] (2/4) Epoch 83, batch 450, loss[loss=0.1394, simple_loss=0.2312, pruned_loss=0.0238, over 24544.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2405, pruned_loss=0.02659, over 4319322.83 frames. ], batch size: 62, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:12:32,618 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=6.0 2023-12-05 00:12:33,467 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.16 vs. limit=15.0 2023-12-05 00:12:35,827 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2023-12-05 00:12:37,208 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.94 vs. limit=15.0 2023-12-05 00:12:41,576 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:12:46,535 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.75 vs. limit=10.0 2023-12-05 00:13:14,575 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=492533.3333333333, ans=0.125 2023-12-05 00:13:18,093 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=492533.3333333333, ans=0.09899494936611666 2023-12-05 00:13:22,059 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-12-05 00:13:25,969 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=492600.0, ans=0.0 2023-12-05 00:13:27,558 INFO [train.py:1087] (2/4) Epoch 83, batch 500, loss[loss=0.1646, simple_loss=0.2487, pruned_loss=0.04025, over 24279.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02667, over 4423740.07 frames. ], batch size: 79, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:13:35,299 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.79 vs. limit=22.5 2023-12-05 00:13:44,295 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=492666.6666666667, ans=0.09899494936611666 2023-12-05 00:13:54,690 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=492733.3333333333, ans=0.0 2023-12-05 00:14:05,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=492800.0, ans=0.0 2023-12-05 00:14:11,606 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:14:12,435 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.267e+02 1.360e+02 1.459e+02 2.108e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-05 00:14:12,673 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=492800.0, ans=0.5 2023-12-05 00:14:19,803 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=492866.6666666667, ans=0.125 2023-12-05 00:14:27,101 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-12-05 00:14:27,536 INFO [train.py:1087] (2/4) Epoch 83, batch 550, loss[loss=0.1418, simple_loss=0.2347, pruned_loss=0.02452, over 24745.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2402, pruned_loss=0.02656, over 4519887.14 frames. ], batch size: 63, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:14:27,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=492933.3333333333, ans=0.09899494936611666 2023-12-05 00:14:30,335 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492933.3333333333, ans=0.1 2023-12-05 00:14:34,433 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:14:41,028 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493000.0, ans=0.1 2023-12-05 00:15:01,980 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=493066.6666666667, ans=0.0 2023-12-05 00:15:03,117 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=493133.3333333333, ans=0.125 2023-12-05 00:15:13,012 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=493133.3333333333, ans=0.125 2023-12-05 00:15:28,166 INFO [train.py:1087] (2/4) Epoch 83, batch 600, loss[loss=0.1583, simple_loss=0.2515, pruned_loss=0.03253, over 24022.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2402, pruned_loss=0.02662, over 4578046.74 frames. ], batch size: 87, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:15:33,815 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-12-05 00:15:39,546 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:16:03,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493400.0, ans=0.1 2023-12-05 00:16:06,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=493466.6666666667, ans=0.2 2023-12-05 00:16:15,478 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.284e+02 1.384e+02 1.504e+02 1.924e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-05 00:16:30,191 INFO [train.py:1087] (2/4) Epoch 83, batch 650, loss[loss=0.142, simple_loss=0.2332, pruned_loss=0.02539, over 24539.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02668, over 4604735.77 frames. ], batch size: 62, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:16:30,903 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.63 vs. limit=15.0 2023-12-05 00:16:39,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=493600.0, ans=0.125 2023-12-05 00:16:46,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=493666.6666666667, ans=0.125 2023-12-05 00:17:03,563 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=493733.3333333333, ans=0.125 2023-12-05 00:17:18,087 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493866.6666666667, ans=0.1 2023-12-05 00:17:30,600 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=493933.3333333333, ans=0.125 2023-12-05 00:17:31,468 INFO [train.py:1087] (2/4) Epoch 83, batch 700, loss[loss=0.1523, simple_loss=0.2457, pruned_loss=0.02941, over 22841.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02676, over 4643722.85 frames. ], batch size: 106, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:17:40,277 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=493933.3333333333, ans=0.2 2023-12-05 00:17:41,322 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=493933.3333333333, ans=0.125 2023-12-05 00:17:53,222 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=494000.0, ans=0.125 2023-12-05 00:18:17,567 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.272e+02 1.376e+02 1.523e+02 1.930e+02, threshold=2.752e+02, percent-clipped=0.0 2023-12-05 00:18:24,053 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.52 vs. limit=15.0 2023-12-05 00:18:33,254 INFO [train.py:1087] (2/4) Epoch 83, batch 750, loss[loss=0.1415, simple_loss=0.2377, pruned_loss=0.0226, over 23337.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2403, pruned_loss=0.02656, over 4670394.58 frames. ], batch size: 94, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:18:45,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=494333.3333333333, ans=0.125 2023-12-05 00:19:01,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=494400.0, ans=0.125 2023-12-05 00:19:33,755 INFO [train.py:1087] (2/4) Epoch 83, batch 800, loss[loss=0.1466, simple_loss=0.2364, pruned_loss=0.02839, over 24762.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2397, pruned_loss=0.02625, over 4722377.57 frames. ], batch size: 66, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:19:39,659 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-12-05 00:19:41,676 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=494600.0, ans=0.125 2023-12-05 00:19:58,762 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494733.3333333333, ans=0.1 2023-12-05 00:20:03,438 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.65 vs. limit=15.0 2023-12-05 00:20:06,357 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=494733.3333333333, ans=0.0 2023-12-05 00:20:17,144 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.277e+02 1.353e+02 1.486e+02 1.861e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 00:20:21,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=494866.6666666667, ans=0.125 2023-12-05 00:20:30,208 INFO [train.py:1087] (2/4) Epoch 83, batch 850, loss[loss=0.1563, simple_loss=0.2497, pruned_loss=0.03144, over 24319.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2399, pruned_loss=0.02644, over 4753823.92 frames. ], batch size: 79, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:20:33,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=494933.3333333333, ans=0.125 2023-12-05 00:20:33,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=494933.3333333333, ans=0.0 2023-12-05 00:20:51,314 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=495066.6666666667, ans=0.125 2023-12-05 00:21:00,087 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=495066.6666666667, ans=0.125 2023-12-05 00:21:03,506 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=15.0 2023-12-05 00:21:04,426 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=495133.3333333333, ans=0.09899494936611666 2023-12-05 00:21:06,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=495133.3333333333, ans=0.2 2023-12-05 00:21:10,948 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=495133.3333333333, ans=0.125 2023-12-05 00:21:13,417 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-12-05 00:21:16,476 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.61 vs. limit=15.0 2023-12-05 00:21:30,955 INFO [train.py:1087] (2/4) Epoch 84, batch 0, loss[loss=0.1409, simple_loss=0.2374, pruned_loss=0.02214, over 24802.00 frames. ], tot_loss[loss=0.1409, simple_loss=0.2374, pruned_loss=0.02214, over 24802.00 frames. ], batch size: 62, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:21:30,956 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 00:21:44,643 INFO [train.py:1119] (2/4) Epoch 84, validation: loss=0.1501, simple_loss=0.2459, pruned_loss=0.0271, over 944034.00 frames. 2023-12-05 00:21:44,644 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 00:21:57,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=495300.0, ans=0.0 2023-12-05 00:22:35,876 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.76 vs. limit=22.5 2023-12-05 00:22:36,153 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.295e+02 1.385e+02 1.530e+02 2.174e+02, threshold=2.770e+02, percent-clipped=0.0 2023-12-05 00:22:42,074 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:22:45,480 INFO [train.py:1087] (2/4) Epoch 84, batch 50, loss[loss=0.1554, simple_loss=0.2468, pruned_loss=0.03204, over 24294.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2423, pruned_loss=0.02734, over 1091361.29 frames. ], batch size: 79, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:23:24,579 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=495766.6666666667, ans=0.125 2023-12-05 00:23:41,876 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=495833.3333333333, ans=0.125 2023-12-05 00:23:45,195 INFO [train.py:1087] (2/4) Epoch 84, batch 100, loss[loss=0.1516, simple_loss=0.2442, pruned_loss=0.02954, over 22819.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.241, pruned_loss=0.02637, over 1917331.24 frames. ], batch size: 106, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:24:19,103 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=496033.3333333333, ans=0.0 2023-12-05 00:24:21,590 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.24 vs. limit=12.0 2023-12-05 00:24:31,914 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.60 vs. limit=22.5 2023-12-05 00:24:38,156 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.257e+02 1.321e+02 1.447e+02 1.790e+02, threshold=2.642e+02, percent-clipped=0.0 2023-12-05 00:24:46,811 INFO [train.py:1087] (2/4) Epoch 84, batch 150, loss[loss=0.1533, simple_loss=0.2402, pruned_loss=0.03316, over 24520.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02639, over 2546398.06 frames. ], batch size: 75, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:24:47,171 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=496233.3333333333, ans=0.05 2023-12-05 00:24:49,347 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=496233.3333333333, ans=0.2 2023-12-05 00:24:52,948 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=496233.3333333333, ans=0.2 2023-12-05 00:25:12,779 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=496366.6666666667, ans=0.125 2023-12-05 00:25:24,868 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=496433.3333333333, ans=0.125 2023-12-05 00:25:24,993 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=496433.3333333333, ans=0.125 2023-12-05 00:25:43,894 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=496500.0, ans=0.04949747468305833 2023-12-05 00:25:48,167 INFO [train.py:1087] (2/4) Epoch 84, batch 200, loss[loss=0.148, simple_loss=0.2427, pruned_loss=0.02666, over 23966.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2404, pruned_loss=0.02679, over 3026789.24 frames. ], batch size: 87, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:25:56,285 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=22.5 2023-12-05 00:25:57,007 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=496566.6666666667, ans=0.0 2023-12-05 00:25:58,193 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496566.6666666667, ans=0.1 2023-12-05 00:26:10,784 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.89 vs. limit=10.0 2023-12-05 00:26:17,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=496700.0, ans=0.125 2023-12-05 00:26:17,848 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.91 vs. limit=10.0 2023-12-05 00:26:18,614 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=496700.0, ans=0.125 2023-12-05 00:26:24,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=496766.6666666667, ans=0.125 2023-12-05 00:26:40,410 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.262e+02 1.348e+02 1.464e+02 1.857e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-05 00:26:48,363 INFO [train.py:1087] (2/4) Epoch 84, batch 250, loss[loss=0.142, simple_loss=0.2364, pruned_loss=0.02376, over 23833.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2401, pruned_loss=0.02652, over 3435303.12 frames. ], batch size: 95, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:27:31,307 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.04 vs. limit=15.0 2023-12-05 00:27:38,176 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=497166.6666666667, ans=0.04949747468305833 2023-12-05 00:27:46,565 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=497166.6666666667, ans=0.0 2023-12-05 00:27:48,462 INFO [train.py:1087] (2/4) Epoch 84, batch 300, loss[loss=0.1744, simple_loss=0.2628, pruned_loss=0.04304, over 21263.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2402, pruned_loss=0.02675, over 3732836.70 frames. ], batch size: 51, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:28:13,180 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=497366.6666666667, ans=0.125 2023-12-05 00:28:28,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497433.3333333333, ans=0.1 2023-12-05 00:28:42,158 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.298e+02 1.372e+02 1.505e+02 1.725e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-05 00:28:49,127 INFO [train.py:1087] (2/4) Epoch 84, batch 350, loss[loss=0.1413, simple_loss=0.2383, pruned_loss=0.02217, over 24600.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.027, over 3936495.94 frames. ], batch size: 68, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:29:00,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=497566.6666666667, ans=0.125 2023-12-05 00:29:12,507 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-12-05 00:29:18,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=497700.0, ans=0.0 2023-12-05 00:29:30,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=497766.6666666667, ans=0.125 2023-12-05 00:29:38,780 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=497833.3333333333, ans=0.125 2023-12-05 00:29:42,467 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=497833.3333333333, ans=0.125 2023-12-05 00:29:50,398 INFO [train.py:1087] (2/4) Epoch 84, batch 400, loss[loss=0.1417, simple_loss=0.2375, pruned_loss=0.02295, over 24747.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02691, over 4136489.84 frames. ], batch size: 66, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:30:22,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=498033.3333333333, ans=0.125 2023-12-05 00:30:31,890 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-12-05 00:30:39,927 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-12-05 00:30:43,585 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.306e+02 1.394e+02 1.478e+02 1.847e+02, threshold=2.788e+02, percent-clipped=0.0 2023-12-05 00:30:50,924 INFO [train.py:1087] (2/4) Epoch 84, batch 450, loss[loss=0.1454, simple_loss=0.2409, pruned_loss=0.02499, over 22977.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2405, pruned_loss=0.02651, over 4293537.91 frames. ], batch size: 106, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:31:18,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498366.6666666667, ans=0.1 2023-12-05 00:31:22,502 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=498366.6666666667, ans=0.0 2023-12-05 00:31:35,306 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=498433.3333333333, ans=0.125 2023-12-05 00:31:36,403 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=498433.3333333333, ans=0.2 2023-12-05 00:31:51,007 INFO [train.py:1087] (2/4) Epoch 84, batch 500, loss[loss=0.1437, simple_loss=0.2378, pruned_loss=0.02481, over 23407.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02667, over 4413516.83 frames. ], batch size: 94, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:32:10,639 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=498633.3333333333, ans=0.125 2023-12-05 00:32:12,793 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498633.3333333333, ans=0.1 2023-12-05 00:32:27,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=498766.6666666667, ans=0.125 2023-12-05 00:32:45,954 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.250e+02 1.315e+02 1.443e+02 1.623e+02, threshold=2.631e+02, percent-clipped=0.0 2023-12-05 00:32:49,064 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.24 vs. limit=6.0 2023-12-05 00:32:52,072 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=498900.0, ans=0.0 2023-12-05 00:32:53,032 INFO [train.py:1087] (2/4) Epoch 84, batch 550, loss[loss=0.1441, simple_loss=0.2386, pruned_loss=0.02482, over 24721.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2406, pruned_loss=0.02649, over 4506339.27 frames. ], batch size: 61, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:32:53,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=498900.0, ans=0.0 2023-12-05 00:32:56,204 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.70 vs. limit=10.0 2023-12-05 00:32:59,823 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498900.0, ans=0.1 2023-12-05 00:33:12,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=498966.6666666667, ans=0.125 2023-12-05 00:33:19,097 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=499033.3333333333, ans=0.0 2023-12-05 00:33:46,381 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=15.0 2023-12-05 00:33:53,640 INFO [train.py:1087] (2/4) Epoch 84, batch 600, loss[loss=0.1368, simple_loss=0.2286, pruned_loss=0.02244, over 24718.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2403, pruned_loss=0.02645, over 4580237.17 frames. ], batch size: 67, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:34:07,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=499300.0, ans=0.0 2023-12-05 00:34:11,761 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-12-05 00:34:26,067 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=499366.6666666667, ans=12.0 2023-12-05 00:34:28,235 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=499366.6666666667, ans=0.125 2023-12-05 00:34:39,231 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.88 vs. limit=15.0 2023-12-05 00:34:47,819 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.265e+02 1.379e+02 1.489e+02 1.749e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-05 00:34:55,397 INFO [train.py:1087] (2/4) Epoch 84, batch 650, loss[loss=0.1419, simple_loss=0.2372, pruned_loss=0.02326, over 21527.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02676, over 4630198.51 frames. ], batch size: 128, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:35:10,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=499633.3333333333, ans=0.2 2023-12-05 00:35:25,842 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=499700.0, ans=10.0 2023-12-05 00:35:34,352 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=499766.6666666667, ans=0.0 2023-12-05 00:35:35,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=499766.6666666667, ans=0.125 2023-12-05 00:35:56,153 INFO [train.py:1087] (2/4) Epoch 84, batch 700, loss[loss=0.1431, simple_loss=0.2392, pruned_loss=0.02351, over 24705.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2402, pruned_loss=0.02646, over 4682983.22 frames. ], batch size: 69, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:35:59,580 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=15.0 2023-12-05 00:36:01,239 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=499900.0, ans=0.2 2023-12-05 00:36:02,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499900.0, ans=0.0 2023-12-05 00:36:09,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=499966.6666666667, ans=0.125 2023-12-05 00:36:09,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.33 vs. limit=15.0 2023-12-05 00:36:11,861 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=499966.6666666667, ans=0.2 2023-12-05 00:36:24,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=500033.3333333333, ans=0.0 2023-12-05 00:36:27,845 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500033.3333333333, ans=0.125 2023-12-05 00:36:32,142 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2023-12-05 00:36:41,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500100.0, ans=0.1 2023-12-05 00:36:49,267 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.278e+02 1.395e+02 1.500e+02 1.937e+02, threshold=2.790e+02, percent-clipped=0.0 2023-12-05 00:36:56,328 INFO [train.py:1087] (2/4) Epoch 84, batch 750, loss[loss=0.1418, simple_loss=0.237, pruned_loss=0.02328, over 22778.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2402, pruned_loss=0.02634, over 4713814.75 frames. ], batch size: 106, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:37:00,671 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=500233.3333333333, ans=0.0 2023-12-05 00:37:15,112 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=500300.0, ans=0.0 2023-12-05 00:37:25,961 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=500366.6666666667, ans=0.2 2023-12-05 00:37:31,873 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=500433.3333333333, ans=0.2 2023-12-05 00:37:54,326 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.69 vs. limit=15.0 2023-12-05 00:37:57,173 INFO [train.py:1087] (2/4) Epoch 84, batch 800, loss[loss=0.1508, simple_loss=0.2429, pruned_loss=0.02933, over 24362.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02642, over 4707602.20 frames. ], batch size: 79, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:38:00,309 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500566.6666666667, ans=0.1 2023-12-05 00:38:02,948 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=500566.6666666667, ans=0.125 2023-12-05 00:38:04,384 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.84 vs. limit=15.0 2023-12-05 00:38:16,552 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=500633.3333333333, ans=0.125 2023-12-05 00:38:16,557 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=500633.3333333333, ans=0.125 2023-12-05 00:38:17,582 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=500633.3333333333, ans=0.125 2023-12-05 00:38:25,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=500700.0, ans=0.125 2023-12-05 00:38:25,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=500700.0, ans=0.0 2023-12-05 00:38:30,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=500700.0, ans=0.1 2023-12-05 00:38:31,692 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=500766.6666666667, ans=0.09899494936611666 2023-12-05 00:38:39,396 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=500766.6666666667, ans=0.125 2023-12-05 00:38:43,909 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=500833.3333333333, ans=0.2 2023-12-05 00:38:44,992 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500833.3333333333, ans=0.1 2023-12-05 00:38:45,077 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=500833.3333333333, ans=0.05 2023-12-05 00:38:48,043 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.297e+02 1.393e+02 1.552e+02 2.763e+02, threshold=2.785e+02, percent-clipped=0.0 2023-12-05 00:38:48,917 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.13 vs. limit=22.5 2023-12-05 00:38:54,459 INFO [train.py:1087] (2/4) Epoch 84, batch 850, loss[loss=0.1733, simple_loss=0.2624, pruned_loss=0.04216, over 16108.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2407, pruned_loss=0.02662, over 4714826.38 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:39:02,150 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=500900.0, ans=0.035 2023-12-05 00:39:37,191 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=501166.6666666667, ans=0.125 2023-12-05 00:39:59,481 INFO [train.py:1087] (2/4) Epoch 85, batch 0, loss[loss=0.1502, simple_loss=0.2416, pruned_loss=0.02937, over 22168.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2416, pruned_loss=0.02937, over 22168.00 frames. ], batch size: 53, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:39:59,482 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 00:40:12,999 INFO [train.py:1119] (2/4) Epoch 85, validation: loss=0.1507, simple_loss=0.2462, pruned_loss=0.02756, over 944034.00 frames. 2023-12-05 00:40:13,000 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 00:40:14,677 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.55 vs. limit=15.0 2023-12-05 00:40:25,742 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=501266.6666666667, ans=0.025 2023-12-05 00:40:28,030 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=501266.6666666667, ans=0.0 2023-12-05 00:40:59,723 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:41:09,947 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=501466.6666666667, ans=0.025 2023-12-05 00:41:12,005 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.320e+02 1.398e+02 1.650e+02 2.233e+02, threshold=2.795e+02, percent-clipped=0.0 2023-12-05 00:41:13,208 INFO [train.py:1087] (2/4) Epoch 85, batch 50, loss[loss=0.1373, simple_loss=0.2334, pruned_loss=0.02061, over 24726.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2401, pruned_loss=0.02584, over 1083515.12 frames. ], batch size: 69, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:41:19,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=501533.3333333333, ans=0.125 2023-12-05 00:41:33,048 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.39 vs. limit=15.0 2023-12-05 00:42:01,408 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=501800.0, ans=0.125 2023-12-05 00:42:11,736 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501866.6666666667, ans=0.1 2023-12-05 00:42:12,709 INFO [train.py:1087] (2/4) Epoch 85, batch 100, loss[loss=0.1515, simple_loss=0.2414, pruned_loss=0.03075, over 24568.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2396, pruned_loss=0.0257, over 1922024.92 frames. ], batch size: 64, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:42:15,317 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=501866.6666666667, ans=0.2 2023-12-05 00:42:18,301 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=501866.6666666667, ans=0.0 2023-12-05 00:42:44,275 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.20 vs. limit=10.0 2023-12-05 00:42:47,912 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-12-05 00:42:51,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502066.6666666667, ans=0.125 2023-12-05 00:42:53,337 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-12-05 00:43:01,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=502133.3333333333, ans=0.0 2023-12-05 00:43:05,831 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=502133.3333333333, ans=0.125 2023-12-05 00:43:08,257 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=502133.3333333333, ans=0.0 2023-12-05 00:43:11,810 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.240e+02 1.330e+02 1.421e+02 1.906e+02, threshold=2.659e+02, percent-clipped=0.0 2023-12-05 00:43:13,042 INFO [train.py:1087] (2/4) Epoch 85, batch 150, loss[loss=0.1397, simple_loss=0.2386, pruned_loss=0.02041, over 24758.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2406, pruned_loss=0.02654, over 2545158.63 frames. ], batch size: 71, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:43:14,296 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=502200.0, ans=0.1 2023-12-05 00:43:19,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502200.0, ans=0.1 2023-12-05 00:43:31,263 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-12-05 00:44:02,628 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=502466.6666666667, ans=0.0 2023-12-05 00:44:08,589 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=502466.6666666667, ans=0.125 2023-12-05 00:44:10,097 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.00 vs. limit=15.0 2023-12-05 00:44:13,202 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=502533.3333333333, ans=0.1 2023-12-05 00:44:14,071 INFO [train.py:1087] (2/4) Epoch 85, batch 200, loss[loss=0.1555, simple_loss=0.2507, pruned_loss=0.03015, over 21499.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02686, over 3046804.53 frames. ], batch size: 128, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:44:26,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502600.0, ans=0.1 2023-12-05 00:44:42,883 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-12-05 00:44:50,916 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502733.3333333333, ans=0.125 2023-12-05 00:45:15,837 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.274e+02 1.339e+02 1.453e+02 1.938e+02, threshold=2.677e+02, percent-clipped=0.0 2023-12-05 00:45:16,233 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=502866.6666666667, ans=0.125 2023-12-05 00:45:16,981 INFO [train.py:1087] (2/4) Epoch 85, batch 250, loss[loss=0.1378, simple_loss=0.2343, pruned_loss=0.02068, over 24790.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2407, pruned_loss=0.02685, over 3426199.18 frames. ], batch size: 72, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:45:27,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=502933.3333333333, ans=0.0 2023-12-05 00:45:30,047 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=502933.3333333333, ans=0.125 2023-12-05 00:45:37,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=502933.3333333333, ans=0.2 2023-12-05 00:45:38,300 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=502933.3333333333, ans=0.125 2023-12-05 00:45:41,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=503000.0, ans=0.125 2023-12-05 00:45:54,164 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=503066.6666666667, ans=0.0 2023-12-05 00:46:11,243 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.73 vs. limit=22.5 2023-12-05 00:46:18,514 INFO [train.py:1087] (2/4) Epoch 85, batch 300, loss[loss=0.1448, simple_loss=0.2394, pruned_loss=0.02511, over 21319.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.02667, over 3731952.21 frames. ], batch size: 127, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:46:22,254 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=503200.0, ans=0.125 2023-12-05 00:46:23,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=503200.0, ans=0.05 2023-12-05 00:46:32,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=503266.6666666667, ans=10.0 2023-12-05 00:46:35,201 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.69 vs. limit=15.0 2023-12-05 00:46:36,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=503266.6666666667, ans=0.125 2023-12-05 00:46:56,179 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:47:04,582 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503400.0, ans=0.1 2023-12-05 00:47:11,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=503466.6666666667, ans=0.0 2023-12-05 00:47:14,226 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=503466.6666666667, ans=0.125 2023-12-05 00:47:16,786 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=503466.6666666667, ans=15.0 2023-12-05 00:47:17,382 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.225e+02 1.286e+02 1.410e+02 1.666e+02, threshold=2.573e+02, percent-clipped=0.0 2023-12-05 00:47:18,599 INFO [train.py:1087] (2/4) Epoch 85, batch 350, loss[loss=0.1481, simple_loss=0.2436, pruned_loss=0.02632, over 21310.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.24, pruned_loss=0.02626, over 3981445.48 frames. ], batch size: 128, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:47:36,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=503600.0, ans=0.2 2023-12-05 00:47:41,663 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=503600.0, ans=0.04949747468305833 2023-12-05 00:47:43,811 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=503666.6666666667, ans=0.1 2023-12-05 00:47:49,746 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=503666.6666666667, ans=0.125 2023-12-05 00:48:18,537 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503800.0, ans=0.1 2023-12-05 00:48:20,735 INFO [train.py:1087] (2/4) Epoch 85, batch 400, loss[loss=0.1402, simple_loss=0.2362, pruned_loss=0.02205, over 24795.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2403, pruned_loss=0.0264, over 4164093.88 frames. ], batch size: 73, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:48:31,720 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=503933.3333333333, ans=0.1 2023-12-05 00:48:39,517 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-12-05 00:48:41,439 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:48:53,930 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=504000.0, ans=0.125 2023-12-05 00:49:12,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=504133.3333333333, ans=22.5 2023-12-05 00:49:21,849 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.308e+02 1.368e+02 1.432e+02 1.757e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-05 00:49:22,975 INFO [train.py:1087] (2/4) Epoch 85, batch 450, loss[loss=0.1543, simple_loss=0.2482, pruned_loss=0.03024, over 21505.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2407, pruned_loss=0.02664, over 4296109.55 frames. ], batch size: 127, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:49:42,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=504266.6666666667, ans=0.0 2023-12-05 00:49:43,002 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=504266.6666666667, ans=0.125 2023-12-05 00:49:44,171 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=504266.6666666667, ans=0.05 2023-12-05 00:50:00,476 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=504400.0, ans=0.2 2023-12-05 00:50:02,860 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=504400.0, ans=0.125 2023-12-05 00:50:21,807 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=504466.6666666667, ans=0.1 2023-12-05 00:50:25,099 INFO [train.py:1087] (2/4) Epoch 85, batch 500, loss[loss=0.1528, simple_loss=0.2444, pruned_loss=0.0306, over 23910.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02688, over 4403785.81 frames. ], batch size: 87, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:50:25,357 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=504533.3333333333, ans=0.1 2023-12-05 00:50:31,187 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=504533.3333333333, ans=0.125 2023-12-05 00:50:58,735 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=504666.6666666667, ans=0.125 2023-12-05 00:51:09,232 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=504733.3333333333, ans=0.95 2023-12-05 00:51:20,312 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504800.0, ans=0.125 2023-12-05 00:51:24,563 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.256e+02 1.341e+02 1.457e+02 1.904e+02, threshold=2.682e+02, percent-clipped=0.0 2023-12-05 00:51:25,729 INFO [train.py:1087] (2/4) Epoch 85, batch 550, loss[loss=0.1442, simple_loss=0.2383, pruned_loss=0.02505, over 24456.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02666, over 4482088.49 frames. ], batch size: 77, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:52:24,055 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=505133.3333333333, ans=0.04949747468305833 2023-12-05 00:52:25,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505133.3333333333, ans=0.125 2023-12-05 00:52:26,419 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=505200.0, ans=0.125 2023-12-05 00:52:27,385 INFO [train.py:1087] (2/4) Epoch 85, batch 600, loss[loss=0.1335, simple_loss=0.2255, pruned_loss=0.02072, over 24599.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2397, pruned_loss=0.02641, over 4551980.14 frames. ], batch size: 68, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:53:08,059 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.30 vs. limit=22.5 2023-12-05 00:53:28,368 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.257e+02 1.345e+02 1.467e+02 2.022e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-05 00:53:29,574 INFO [train.py:1087] (2/4) Epoch 85, batch 650, loss[loss=0.1456, simple_loss=0.2383, pruned_loss=0.02647, over 24008.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2394, pruned_loss=0.02631, over 4613205.76 frames. ], batch size: 87, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:53:49,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=505600.0, ans=0.125 2023-12-05 00:53:54,972 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=505666.6666666667, ans=0.0 2023-12-05 00:54:00,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=505666.6666666667, ans=0.125 2023-12-05 00:54:13,422 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=505733.3333333333, ans=0.125 2023-12-05 00:54:28,758 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=15.0 2023-12-05 00:54:31,746 INFO [train.py:1087] (2/4) Epoch 85, batch 700, loss[loss=0.15, simple_loss=0.2431, pruned_loss=0.02841, over 24599.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2394, pruned_loss=0.02637, over 4653428.45 frames. ], batch size: 68, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:54:40,217 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=505866.6666666667, ans=0.0 2023-12-05 00:54:47,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505933.3333333333, ans=0.1 2023-12-05 00:54:50,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=505933.3333333333, ans=0.125 2023-12-05 00:55:06,058 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=506000.0, ans=0.0 2023-12-05 00:55:21,240 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=506133.3333333333, ans=0.0 2023-12-05 00:55:31,825 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.262e+02 1.357e+02 1.466e+02 1.953e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 00:55:33,014 INFO [train.py:1087] (2/4) Epoch 85, batch 750, loss[loss=0.1505, simple_loss=0.2438, pruned_loss=0.02858, over 24730.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02638, over 4688665.87 frames. ], batch size: 61, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:55:46,428 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=506266.6666666667, ans=0.125 2023-12-05 00:56:10,008 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=506400.0, ans=0.125 2023-12-05 00:56:32,891 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.87 vs. limit=22.5 2023-12-05 00:56:33,328 INFO [train.py:1087] (2/4) Epoch 85, batch 800, loss[loss=0.1534, simple_loss=0.2461, pruned_loss=0.03032, over 24042.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02632, over 4717852.23 frames. ], batch size: 87, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:56:38,740 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=506533.3333333333, ans=0.02 2023-12-05 00:56:46,547 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=506600.0, ans=0.0 2023-12-05 00:57:07,229 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=506666.6666666667, ans=0.125 2023-12-05 00:57:19,333 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=506733.3333333333, ans=0.125 2023-12-05 00:57:32,461 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.281e+02 1.341e+02 1.448e+02 1.887e+02, threshold=2.683e+02, percent-clipped=0.0 2023-12-05 00:57:33,626 INFO [train.py:1087] (2/4) Epoch 85, batch 850, loss[loss=0.144, simple_loss=0.2364, pruned_loss=0.02581, over 24576.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2397, pruned_loss=0.02641, over 4744624.89 frames. ], batch size: 65, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:57:40,190 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=506866.6666666667, ans=0.125 2023-12-05 00:57:48,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=506933.3333333333, ans=0.125 2023-12-05 00:57:56,274 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=507000.0, ans=0.0 2023-12-05 00:57:59,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=507000.0, ans=0.0 2023-12-05 00:58:09,067 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=507066.6666666667, ans=0.95 2023-12-05 00:58:18,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=507133.3333333333, ans=0.125 2023-12-05 00:58:32,920 INFO [train.py:1087] (2/4) Epoch 86, batch 0, loss[loss=0.1312, simple_loss=0.224, pruned_loss=0.01921, over 24753.00 frames. ], tot_loss[loss=0.1312, simple_loss=0.224, pruned_loss=0.01921, over 24753.00 frames. ], batch size: 66, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 00:58:32,921 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 00:58:46,429 INFO [train.py:1119] (2/4) Epoch 86, validation: loss=0.151, simple_loss=0.2462, pruned_loss=0.02793, over 944034.00 frames. 2023-12-05 00:58:46,430 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 00:58:55,860 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:58:58,281 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=507233.3333333333, ans=0.125 2023-12-05 00:59:12,468 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=507300.0, ans=0.125 2023-12-05 00:59:23,346 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=507366.6666666667, ans=0.125 2023-12-05 00:59:31,439 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:59:40,213 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-12-05 00:59:45,524 INFO [train.py:1087] (2/4) Epoch 86, batch 50, loss[loss=0.1529, simple_loss=0.2445, pruned_loss=0.03062, over 24185.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2403, pruned_loss=0.02646, over 1098571.20 frames. ], batch size: 82, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 00:59:50,554 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.300e+02 1.379e+02 1.470e+02 1.879e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-05 01:00:00,758 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=15.0 2023-12-05 01:00:11,082 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-12-05 01:00:33,134 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.24 vs. limit=22.5 2023-12-05 01:00:44,852 INFO [train.py:1087] (2/4) Epoch 86, batch 100, loss[loss=0.1525, simple_loss=0.2447, pruned_loss=0.03011, over 24567.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2401, pruned_loss=0.02626, over 1930531.34 frames. ], batch size: 64, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:00:47,231 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=507833.3333333333, ans=0.125 2023-12-05 01:01:14,026 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.82 vs. limit=15.0 2023-12-05 01:01:17,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=507966.6666666667, ans=0.0 2023-12-05 01:01:32,718 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=12.0 2023-12-05 01:01:34,609 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508100.0, ans=0.1 2023-12-05 01:01:36,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=508100.0, ans=0.125 2023-12-05 01:01:41,188 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=22.5 2023-12-05 01:01:45,063 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-12-05 01:01:45,398 INFO [train.py:1087] (2/4) Epoch 86, batch 150, loss[loss=0.1411, simple_loss=0.2388, pruned_loss=0.02175, over 24757.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2398, pruned_loss=0.02605, over 2580845.35 frames. ], batch size: 66, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:01:49,936 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.280e+02 1.367e+02 1.450e+02 1.965e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-05 01:01:59,086 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=508233.3333333333, ans=0.0 2023-12-05 01:01:59,261 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:02:15,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=508300.0, ans=0.125 2023-12-05 01:02:20,238 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=508366.6666666667, ans=0.0 2023-12-05 01:02:35,713 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=508433.3333333333, ans=0.0 2023-12-05 01:02:45,681 INFO [train.py:1087] (2/4) Epoch 86, batch 200, loss[loss=0.1367, simple_loss=0.2283, pruned_loss=0.02259, over 24573.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.239, pruned_loss=0.02598, over 3074381.14 frames. ], batch size: 65, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:02:47,195 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=508500.0, ans=0.125 2023-12-05 01:03:05,770 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=508566.6666666667, ans=0.025 2023-12-05 01:03:28,209 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508700.0, ans=0.1 2023-12-05 01:03:40,733 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=508766.6666666667, ans=0.05 2023-12-05 01:03:41,841 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=508766.6666666667, ans=0.0 2023-12-05 01:03:45,990 INFO [train.py:1087] (2/4) Epoch 86, batch 250, loss[loss=0.1546, simple_loss=0.2458, pruned_loss=0.03169, over 24568.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.0259, over 3468209.60 frames. ], batch size: 65, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:03:50,704 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.297e+02 1.391e+02 1.499e+02 1.858e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 01:03:53,360 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508833.3333333333, ans=0.1 2023-12-05 01:04:29,737 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=509033.3333333333, ans=0.2 2023-12-05 01:04:32,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=509033.3333333333, ans=10.0 2023-12-05 01:04:36,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=509100.0, ans=0.125 2023-12-05 01:04:38,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=509100.0, ans=0.125 2023-12-05 01:04:39,057 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.56 vs. limit=5.0 2023-12-05 01:04:41,112 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-12-05 01:04:47,159 INFO [train.py:1087] (2/4) Epoch 86, batch 300, loss[loss=0.1315, simple_loss=0.2255, pruned_loss=0.01877, over 24565.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.02591, over 3760266.44 frames. ], batch size: 63, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:04:50,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509166.6666666667, ans=0.1 2023-12-05 01:04:52,452 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=509166.6666666667, ans=0.125 2023-12-05 01:05:01,105 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.86 vs. limit=22.5 2023-12-05 01:05:07,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=509233.3333333333, ans=0.0 2023-12-05 01:05:10,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509300.0, ans=0.1 2023-12-05 01:05:12,822 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=509300.0, ans=0.125 2023-12-05 01:05:26,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=509366.6666666667, ans=10.0 2023-12-05 01:05:34,182 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-12-05 01:05:41,273 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-12-05 01:05:47,224 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.14 vs. limit=15.0 2023-12-05 01:05:47,679 INFO [train.py:1087] (2/4) Epoch 86, batch 350, loss[loss=0.1426, simple_loss=0.2343, pruned_loss=0.02544, over 24748.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2392, pruned_loss=0.02589, over 4008787.34 frames. ], batch size: 63, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:05:49,717 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=22.5 2023-12-05 01:05:52,639 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.308e+02 1.400e+02 1.542e+02 2.076e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-05 01:05:54,058 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=509500.0, ans=0.125 2023-12-05 01:05:58,382 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=509500.0, ans=0.025 2023-12-05 01:06:14,389 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=22.5 2023-12-05 01:06:17,464 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=509633.3333333333, ans=0.125 2023-12-05 01:06:22,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=509633.3333333333, ans=0.125 2023-12-05 01:06:25,486 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=509700.0, ans=10.0 2023-12-05 01:06:41,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=509766.6666666667, ans=0.0 2023-12-05 01:06:49,020 INFO [train.py:1087] (2/4) Epoch 86, batch 400, loss[loss=0.1482, simple_loss=0.247, pruned_loss=0.02467, over 24696.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2396, pruned_loss=0.02611, over 4182187.54 frames. ], batch size: 69, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:06:50,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=509833.3333333333, ans=0.125 2023-12-05 01:06:51,689 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=509833.3333333333, ans=0.125 2023-12-05 01:07:03,591 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:07:06,853 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:07:07,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=509900.0, ans=0.2 2023-12-05 01:07:09,459 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.12 vs. limit=15.0 2023-12-05 01:07:44,256 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=510100.0, ans=0.125 2023-12-05 01:07:50,437 INFO [train.py:1087] (2/4) Epoch 86, batch 450, loss[loss=0.1416, simple_loss=0.2361, pruned_loss=0.02354, over 24289.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2397, pruned_loss=0.026, over 4330736.82 frames. ], batch size: 79, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:07:55,004 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.256e+02 1.336e+02 1.478e+02 2.181e+02, threshold=2.672e+02, percent-clipped=0.0 2023-12-05 01:08:34,192 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.92 vs. limit=12.0 2023-12-05 01:08:41,090 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.66 vs. limit=22.5 2023-12-05 01:08:42,293 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-12-05 01:08:51,087 INFO [train.py:1087] (2/4) Epoch 86, batch 500, loss[loss=0.1493, simple_loss=0.2482, pruned_loss=0.02518, over 24605.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02604, over 4436470.42 frames. ], batch size: 68, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:09:00,121 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=510500.0, ans=0.125 2023-12-05 01:09:09,592 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=510566.6666666667, ans=0.125 2023-12-05 01:09:22,697 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=510633.3333333333, ans=0.0 2023-12-05 01:09:30,500 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=22.5 2023-12-05 01:09:52,310 INFO [train.py:1087] (2/4) Epoch 86, batch 550, loss[loss=0.1317, simple_loss=0.2261, pruned_loss=0.01867, over 24843.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02636, over 4487561.14 frames. ], batch size: 68, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:09:57,411 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.286e+02 1.391e+02 1.520e+02 1.971e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 01:10:10,579 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-12-05 01:10:11,449 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=510900.0, ans=0.0 2023-12-05 01:10:21,057 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=510966.6666666667, ans=0.125 2023-12-05 01:10:42,453 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=511100.0, ans=0.125 2023-12-05 01:10:52,967 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=511166.6666666667, ans=0.0 2023-12-05 01:10:53,879 INFO [train.py:1087] (2/4) Epoch 86, batch 600, loss[loss=0.1419, simple_loss=0.2385, pruned_loss=0.02264, over 24562.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2394, pruned_loss=0.02613, over 4562794.50 frames. ], batch size: 66, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:11:24,682 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511300.0, ans=0.125 2023-12-05 01:11:28,260 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=511300.0, ans=0.2 2023-12-05 01:11:40,528 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=511366.6666666667, ans=0.0 2023-12-05 01:11:42,827 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=511433.3333333333, ans=0.04949747468305833 2023-12-05 01:11:56,647 INFO [train.py:1087] (2/4) Epoch 86, batch 650, loss[loss=0.1466, simple_loss=0.236, pruned_loss=0.02858, over 24762.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2393, pruned_loss=0.02622, over 4624732.25 frames. ], batch size: 65, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:12:02,416 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.278e+02 1.366e+02 1.457e+02 1.861e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-05 01:12:58,728 INFO [train.py:1087] (2/4) Epoch 86, batch 700, loss[loss=0.1407, simple_loss=0.2378, pruned_loss=0.02186, over 24571.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2395, pruned_loss=0.02627, over 4663416.38 frames. ], batch size: 64, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:13:18,557 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-12-05 01:13:19,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=511900.0, ans=0.0 2023-12-05 01:13:30,240 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=22.5 2023-12-05 01:13:33,059 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.37 vs. limit=15.0 2023-12-05 01:13:43,453 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-12-05 01:13:58,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.87 vs. limit=10.0 2023-12-05 01:14:00,329 INFO [train.py:1087] (2/4) Epoch 86, batch 750, loss[loss=0.1419, simple_loss=0.2374, pruned_loss=0.02323, over 24851.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2397, pruned_loss=0.02642, over 4695527.01 frames. ], batch size: 68, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:14:07,284 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.276e+02 1.356e+02 1.468e+02 2.035e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-05 01:14:12,385 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=512233.3333333333, ans=0.125 2023-12-05 01:14:14,751 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=512233.3333333333, ans=0.125 2023-12-05 01:14:25,939 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=512300.0, ans=0.2 2023-12-05 01:14:42,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=512366.6666666667, ans=0.0 2023-12-05 01:15:01,791 INFO [train.py:1087] (2/4) Epoch 86, batch 800, loss[loss=0.1512, simple_loss=0.244, pruned_loss=0.02922, over 24714.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.02645, over 4711003.98 frames. ], batch size: 74, lr: 2.84e-03, grad_scale: 32.0 2023-12-05 01:15:25,168 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=512633.3333333333, ans=0.125 2023-12-05 01:15:29,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=512633.3333333333, ans=0.125 2023-12-05 01:15:37,577 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.87 vs. limit=10.0 2023-12-05 01:15:38,328 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=512700.0, ans=0.125 2023-12-05 01:15:46,020 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=512700.0, ans=0.125 2023-12-05 01:15:59,072 INFO [train.py:1087] (2/4) Epoch 86, batch 850, loss[loss=0.1492, simple_loss=0.2426, pruned_loss=0.02793, over 24507.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2401, pruned_loss=0.02659, over 4724030.81 frames. ], batch size: 75, lr: 2.84e-03, grad_scale: 32.0 2023-12-05 01:16:05,595 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.087e+02 1.258e+02 1.364e+02 1.461e+02 1.903e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-05 01:16:06,805 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=512833.3333333333, ans=0.5 2023-12-05 01:16:07,143 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-12-05 01:16:11,496 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=512900.0, ans=0.5 2023-12-05 01:16:16,032 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.21 vs. limit=15.0 2023-12-05 01:16:17,443 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.21 vs. limit=15.0 2023-12-05 01:16:17,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=512900.0, ans=0.125 2023-12-05 01:16:25,021 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.78 vs. limit=15.0 2023-12-05 01:16:37,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=513033.3333333333, ans=0.125 2023-12-05 01:16:41,835 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=513100.0, ans=0.025 2023-12-05 01:16:58,342 INFO [train.py:1087] (2/4) Epoch 87, batch 0, loss[loss=0.1717, simple_loss=0.2617, pruned_loss=0.04089, over 17153.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2617, pruned_loss=0.04089, over 17153.00 frames. ], batch size: 176, lr: 2.83e-03, grad_scale: 32.0 2023-12-05 01:16:58,344 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 01:17:07,940 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.5122, 5.3074, 4.7127, 4.8925], device='cuda:2') 2023-12-05 01:17:11,952 INFO [train.py:1119] (2/4) Epoch 87, validation: loss=0.1509, simple_loss=0.246, pruned_loss=0.02789, over 944034.00 frames. 2023-12-05 01:17:11,953 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 01:17:12,179 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513133.3333333333, ans=0.1 2023-12-05 01:17:30,310 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=513200.0, ans=0.0 2023-12-05 01:17:41,078 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=12.0 2023-12-05 01:17:50,622 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=513333.3333333333, ans=0.125 2023-12-05 01:17:53,193 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.57 vs. limit=12.0 2023-12-05 01:17:53,790 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=513333.3333333333, ans=0.0 2023-12-05 01:17:58,558 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=513400.0, ans=0.025 2023-12-05 01:18:05,044 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.62 vs. limit=10.0 2023-12-05 01:18:11,615 INFO [train.py:1087] (2/4) Epoch 87, batch 50, loss[loss=0.1349, simple_loss=0.2298, pruned_loss=0.02, over 24553.00 frames. ], tot_loss[loss=0.1448, simple_loss=0.239, pruned_loss=0.02527, over 1081286.27 frames. ], batch size: 66, lr: 2.83e-03, grad_scale: 32.0 2023-12-05 01:18:19,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=513466.6666666667, ans=0.2 2023-12-05 01:18:22,540 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=513533.3333333333, ans=0.0 2023-12-05 01:18:24,580 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.166e+02 1.275e+02 1.354e+02 1.461e+02 2.447e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-05 01:18:45,649 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.79 vs. limit=15.0 2023-12-05 01:18:46,293 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=513666.6666666667, ans=0.035 2023-12-05 01:19:06,705 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=513733.3333333333, ans=0.1 2023-12-05 01:19:11,035 INFO [train.py:1087] (2/4) Epoch 87, batch 100, loss[loss=0.1441, simple_loss=0.24, pruned_loss=0.02412, over 24594.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.2392, pruned_loss=0.02553, over 1903136.13 frames. ], batch size: 68, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:19:17,777 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.12 vs. limit=15.0 2023-12-05 01:19:23,298 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513866.6666666667, ans=0.1 2023-12-05 01:19:28,994 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=513866.6666666667, ans=0.0 2023-12-05 01:19:43,124 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=513933.3333333333, ans=0.2 2023-12-05 01:19:57,900 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.86 vs. limit=15.0 2023-12-05 01:20:09,939 INFO [train.py:1087] (2/4) Epoch 87, batch 150, loss[loss=0.1426, simple_loss=0.2367, pruned_loss=0.02424, over 24573.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2392, pruned_loss=0.02577, over 2560499.95 frames. ], batch size: 64, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:20:12,856 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=514133.3333333333, ans=10.0 2023-12-05 01:20:13,257 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.48 vs. limit=12.0 2023-12-05 01:20:24,197 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.275e+02 1.353e+02 1.490e+02 2.022e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 01:20:30,089 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=514200.0, ans=0.125 2023-12-05 01:20:36,006 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=514266.6666666667, ans=0.1 2023-12-05 01:20:43,000 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=514266.6666666667, ans=0.125 2023-12-05 01:20:47,670 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=514333.3333333333, ans=0.125 2023-12-05 01:20:55,991 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.39 vs. limit=15.0 2023-12-05 01:21:03,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=514400.0, ans=0.125 2023-12-05 01:21:10,702 INFO [train.py:1087] (2/4) Epoch 87, batch 200, loss[loss=0.1347, simple_loss=0.2285, pruned_loss=0.0204, over 24816.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.0262, over 3048328.89 frames. ], batch size: 72, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:21:10,890 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=514466.6666666667, ans=0.125 2023-12-05 01:21:16,012 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.95 vs. limit=12.0 2023-12-05 01:21:30,902 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=514533.3333333333, ans=0.1 2023-12-05 01:21:42,684 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.17 vs. limit=12.0 2023-12-05 01:21:44,977 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.86 vs. limit=15.0 2023-12-05 01:22:07,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514733.3333333333, ans=0.1 2023-12-05 01:22:08,079 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:22:11,180 INFO [train.py:1087] (2/4) Epoch 87, batch 250, loss[loss=0.1526, simple_loss=0.2472, pruned_loss=0.02902, over 24765.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02636, over 3448910.62 frames. ], batch size: 70, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:22:18,886 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.0 2023-12-05 01:22:21,912 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=514866.6666666667, ans=0.125 2023-12-05 01:22:23,914 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.316e+02 1.383e+02 1.508e+02 1.759e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-05 01:22:25,472 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=514866.6666666667, ans=0.0 2023-12-05 01:22:28,045 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.91 vs. limit=15.0 2023-12-05 01:22:34,473 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514933.3333333333, ans=0.0 2023-12-05 01:22:40,051 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=514933.3333333333, ans=0.07 2023-12-05 01:23:09,060 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=515133.3333333333, ans=0.125 2023-12-05 01:23:10,272 INFO [train.py:1087] (2/4) Epoch 87, batch 300, loss[loss=0.1394, simple_loss=0.2333, pruned_loss=0.02276, over 24552.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02625, over 3762018.07 frames. ], batch size: 63, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:23:14,825 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:23:23,113 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=515200.0, ans=0.125 2023-12-05 01:23:45,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515333.3333333333, ans=0.1 2023-12-05 01:24:04,914 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.38 vs. limit=15.0 2023-12-05 01:24:10,031 INFO [train.py:1087] (2/4) Epoch 87, batch 350, loss[loss=0.1431, simple_loss=0.2321, pruned_loss=0.02707, over 24741.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2398, pruned_loss=0.02661, over 3987241.18 frames. ], batch size: 61, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:24:22,010 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=515533.3333333333, ans=0.125 2023-12-05 01:24:24,326 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.250e+02 1.331e+02 1.413e+02 1.864e+02, threshold=2.662e+02, percent-clipped=0.0 2023-12-05 01:25:10,269 INFO [train.py:1087] (2/4) Epoch 87, batch 400, loss[loss=0.15, simple_loss=0.241, pruned_loss=0.02948, over 24522.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2401, pruned_loss=0.02668, over 4179485.39 frames. ], batch size: 77, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:25:12,075 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.38 vs. limit=15.0 2023-12-05 01:25:16,847 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-12-05 01:25:47,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=516000.0, ans=0.125 2023-12-05 01:25:48,192 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=516000.0, ans=0.125 2023-12-05 01:26:11,019 INFO [train.py:1087] (2/4) Epoch 87, batch 450, loss[loss=0.1461, simple_loss=0.2371, pruned_loss=0.02753, over 24505.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.24, pruned_loss=0.02658, over 4317423.78 frames. ], batch size: 75, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:26:12,507 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=516133.3333333333, ans=0.0 2023-12-05 01:26:16,106 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=516133.3333333333, ans=0.125 2023-12-05 01:26:19,416 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=516133.3333333333, ans=0.2 2023-12-05 01:26:23,847 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.247e+02 1.316e+02 1.414e+02 1.848e+02, threshold=2.632e+02, percent-clipped=0.0 2023-12-05 01:26:26,476 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=516200.0, ans=0.0 2023-12-05 01:26:49,863 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=516333.3333333333, ans=0.07 2023-12-05 01:27:11,714 INFO [train.py:1087] (2/4) Epoch 87, batch 500, loss[loss=0.1389, simple_loss=0.2383, pruned_loss=0.01978, over 24791.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2398, pruned_loss=0.02636, over 4430972.44 frames. ], batch size: 71, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:27:13,166 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=516466.6666666667, ans=0.0 2023-12-05 01:27:37,830 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=516600.0, ans=0.1 2023-12-05 01:28:11,483 INFO [train.py:1087] (2/4) Epoch 87, batch 550, loss[loss=0.1336, simple_loss=0.2272, pruned_loss=0.02002, over 24789.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2403, pruned_loss=0.02647, over 4520124.27 frames. ], batch size: 72, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:28:13,401 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=516800.0, ans=0.125 2023-12-05 01:28:22,839 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=516866.6666666667, ans=0.125 2023-12-05 01:28:24,755 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.296e+02 1.354e+02 1.450e+02 2.128e+02, threshold=2.708e+02, percent-clipped=0.0 2023-12-05 01:29:01,337 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.74 vs. limit=15.0 2023-12-05 01:29:11,290 INFO [train.py:1087] (2/4) Epoch 87, batch 600, loss[loss=0.1478, simple_loss=0.2438, pruned_loss=0.02589, over 22552.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2397, pruned_loss=0.02622, over 4585963.02 frames. ], batch size: 54, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:29:11,590 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517133.3333333333, ans=0.125 2023-12-05 01:29:13,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=517133.3333333333, ans=0.125 2023-12-05 01:29:16,092 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=517133.3333333333, ans=0.2 2023-12-05 01:29:19,395 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=517133.3333333333, ans=0.2 2023-12-05 01:29:29,183 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.39 vs. limit=15.0 2023-12-05 01:29:33,675 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517200.0, ans=0.125 2023-12-05 01:29:35,875 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=517266.6666666667, ans=0.2 2023-12-05 01:29:49,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=517333.3333333333, ans=0.125 2023-12-05 01:30:11,108 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=517466.6666666667, ans=0.125 2023-12-05 01:30:12,099 INFO [train.py:1087] (2/4) Epoch 87, batch 650, loss[loss=0.1715, simple_loss=0.2573, pruned_loss=0.0428, over 16523.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02626, over 4622595.11 frames. ], batch size: 177, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:30:20,886 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517466.6666666667, ans=0.125 2023-12-05 01:30:21,422 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.44 vs. limit=10.0 2023-12-05 01:30:25,005 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.241e+02 1.373e+02 1.498e+02 1.949e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-05 01:30:26,405 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=517533.3333333333, ans=0.125 2023-12-05 01:30:30,433 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517533.3333333333, ans=0.1 2023-12-05 01:30:44,510 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=517600.0, ans=0.0 2023-12-05 01:30:47,782 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:31:07,884 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=517733.3333333333, ans=0.0 2023-12-05 01:31:09,070 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=517733.3333333333, ans=0.04949747468305833 2023-12-05 01:31:12,144 INFO [train.py:1087] (2/4) Epoch 87, batch 700, loss[loss=0.1509, simple_loss=0.2459, pruned_loss=0.02799, over 22947.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02625, over 4673525.33 frames. ], batch size: 106, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:31:17,075 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=517800.0, ans=0.125 2023-12-05 01:31:20,633 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517800.0, ans=0.125 2023-12-05 01:31:47,264 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=518000.0, ans=0.0 2023-12-05 01:31:49,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=518000.0, ans=0.125 2023-12-05 01:31:59,008 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-12-05 01:32:05,777 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=518066.6666666667, ans=0.0 2023-12-05 01:32:11,971 INFO [train.py:1087] (2/4) Epoch 87, batch 750, loss[loss=0.1419, simple_loss=0.2391, pruned_loss=0.0223, over 24543.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2397, pruned_loss=0.02621, over 4706691.19 frames. ], batch size: 63, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:32:15,742 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:32:16,689 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=518133.3333333333, ans=0.0 2023-12-05 01:32:24,536 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.264e+02 1.322e+02 1.416e+02 1.779e+02, threshold=2.644e+02, percent-clipped=0.0 2023-12-05 01:32:27,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=518200.0, ans=0.125 2023-12-05 01:32:29,356 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=518200.0, ans=0.125 2023-12-05 01:32:35,276 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=518266.6666666667, ans=0.0 2023-12-05 01:32:47,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=518333.3333333333, ans=0.025 2023-12-05 01:33:00,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518400.0, ans=0.1 2023-12-05 01:33:10,569 INFO [train.py:1087] (2/4) Epoch 87, batch 800, loss[loss=0.1346, simple_loss=0.2304, pruned_loss=0.01945, over 24786.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2398, pruned_loss=0.02619, over 4723773.23 frames. ], batch size: 70, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:33:23,814 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=518533.3333333333, ans=0.125 2023-12-05 01:33:32,650 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.32 vs. limit=15.0 2023-12-05 01:33:37,585 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:34:01,238 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-12-05 01:34:01,962 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=518733.3333333333, ans=0.05 2023-12-05 01:34:05,993 INFO [train.py:1087] (2/4) Epoch 87, batch 850, loss[loss=0.1323, simple_loss=0.226, pruned_loss=0.01926, over 24558.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2398, pruned_loss=0.02646, over 4737431.68 frames. ], batch size: 66, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:34:17,654 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.326e+02 1.393e+02 1.545e+02 2.140e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-05 01:34:39,893 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=15.0 2023-12-05 01:34:43,559 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=519000.0, ans=0.125 2023-12-05 01:35:10,931 INFO [train.py:1087] (2/4) Epoch 88, batch 0, loss[loss=0.1504, simple_loss=0.2418, pruned_loss=0.02948, over 24375.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2418, pruned_loss=0.02948, over 24375.00 frames. ], batch size: 79, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:35:10,932 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 01:35:19,315 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.4082, 2.9610, 2.3203, 3.2914], device='cuda:2') 2023-12-05 01:35:24,426 INFO [train.py:1119] (2/4) Epoch 88, validation: loss=0.1507, simple_loss=0.2459, pruned_loss=0.02777, over 944034.00 frames. 2023-12-05 01:35:24,427 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 01:35:26,009 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519100.0, ans=0.1 2023-12-05 01:35:39,711 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=519166.6666666667, ans=0.04949747468305833 2023-12-05 01:35:41,980 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=519166.6666666667, ans=0.125 2023-12-05 01:35:49,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=519233.3333333333, ans=0.1 2023-12-05 01:36:16,707 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=519366.6666666667, ans=0.125 2023-12-05 01:36:21,678 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=519366.6666666667, ans=0.125 2023-12-05 01:36:23,739 INFO [train.py:1087] (2/4) Epoch 88, batch 50, loss[loss=0.1412, simple_loss=0.2402, pruned_loss=0.02108, over 24688.00 frames. ], tot_loss[loss=0.1449, simple_loss=0.2391, pruned_loss=0.02536, over 1082177.07 frames. ], batch size: 74, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:36:33,786 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-12-05 01:36:42,461 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.338e+02 1.439e+02 1.555e+02 2.123e+02, threshold=2.877e+02, percent-clipped=0.0 2023-12-05 01:36:42,680 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=519500.0, ans=0.125 2023-12-05 01:37:15,650 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-12-05 01:37:16,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=519700.0, ans=0.125 2023-12-05 01:37:23,393 INFO [train.py:1087] (2/4) Epoch 88, batch 100, loss[loss=0.1361, simple_loss=0.2303, pruned_loss=0.02092, over 24786.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.24, pruned_loss=0.02608, over 1890796.89 frames. ], batch size: 73, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:37:33,068 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=519766.6666666667, ans=0.0 2023-12-05 01:37:34,424 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=519833.3333333333, ans=0.125 2023-12-05 01:37:37,256 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-12-05 01:37:59,776 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-12-05 01:38:02,923 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=519966.6666666667, ans=0.125 2023-12-05 01:38:03,215 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=519966.6666666667, ans=0.125 2023-12-05 01:38:23,191 INFO [train.py:1087] (2/4) Epoch 88, batch 150, loss[loss=0.1428, simple_loss=0.2346, pruned_loss=0.02551, over 24707.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02632, over 2528854.37 frames. ], batch size: 67, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:38:40,788 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=520166.6666666667, ans=0.2 2023-12-05 01:38:42,589 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.250e+02 1.326e+02 1.416e+02 1.687e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-05 01:38:46,801 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-12-05 01:38:57,171 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=520233.3333333333, ans=0.2 2023-12-05 01:39:07,239 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=520300.0, ans=0.125 2023-12-05 01:39:13,054 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.74 vs. limit=15.0 2023-12-05 01:39:22,303 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.17 vs. limit=10.0 2023-12-05 01:39:23,929 INFO [train.py:1087] (2/4) Epoch 88, batch 200, loss[loss=0.1378, simple_loss=0.2331, pruned_loss=0.02128, over 24851.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.24, pruned_loss=0.02633, over 3029698.98 frames. ], batch size: 68, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:39:24,186 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=520433.3333333333, ans=0.0 2023-12-05 01:39:26,060 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-12-05 01:39:26,979 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=520433.3333333333, ans=0.125 2023-12-05 01:39:31,851 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-12-05 01:39:41,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=520500.0, ans=0.125 2023-12-05 01:39:46,397 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=520500.0, ans=0.125 2023-12-05 01:39:54,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=520566.6666666667, ans=0.0 2023-12-05 01:39:57,852 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=520566.6666666667, ans=0.125 2023-12-05 01:39:59,570 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.42 vs. limit=22.5 2023-12-05 01:40:20,343 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:40:24,688 INFO [train.py:1087] (2/4) Epoch 88, batch 250, loss[loss=0.1468, simple_loss=0.2472, pruned_loss=0.02323, over 23726.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2398, pruned_loss=0.0265, over 3428887.79 frames. ], batch size: 57, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:40:43,726 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.172e+02 1.286e+02 1.349e+02 1.447e+02 1.873e+02, threshold=2.697e+02, percent-clipped=0.0 2023-12-05 01:40:55,109 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=520900.0, ans=0.07 2023-12-05 01:41:25,472 INFO [train.py:1087] (2/4) Epoch 88, batch 300, loss[loss=0.1419, simple_loss=0.2341, pruned_loss=0.02481, over 24737.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02637, over 3728560.91 frames. ], batch size: 61, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:41:34,913 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=521100.0, ans=0.0 2023-12-05 01:41:53,652 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=521233.3333333333, ans=0.125 2023-12-05 01:42:25,196 INFO [train.py:1087] (2/4) Epoch 88, batch 350, loss[loss=0.1488, simple_loss=0.2448, pruned_loss=0.0264, over 24719.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02616, over 3965266.22 frames. ], batch size: 67, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:42:26,133 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.59 vs. limit=10.0 2023-12-05 01:42:31,108 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.05 vs. limit=15.0 2023-12-05 01:42:35,826 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=521433.3333333333, ans=0.125 2023-12-05 01:42:44,772 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.283e+02 1.389e+02 1.504e+02 1.993e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-05 01:42:57,698 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=521566.6666666667, ans=0.125 2023-12-05 01:43:07,154 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=521633.3333333333, ans=0.125 2023-12-05 01:43:08,631 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.42 vs. limit=22.5 2023-12-05 01:43:23,519 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2023-12-05 01:43:25,147 INFO [train.py:1087] (2/4) Epoch 88, batch 400, loss[loss=0.1591, simple_loss=0.2486, pruned_loss=0.03486, over 24289.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02665, over 4125978.49 frames. ], batch size: 79, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:43:27,869 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521766.6666666667, ans=0.125 2023-12-05 01:43:32,920 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=521766.6666666667, ans=0.0 2023-12-05 01:43:44,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=521833.3333333333, ans=0.125 2023-12-05 01:43:50,392 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=521900.0, ans=0.125 2023-12-05 01:43:55,395 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-12-05 01:44:20,700 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=522033.3333333333, ans=0.0 2023-12-05 01:44:25,942 INFO [train.py:1087] (2/4) Epoch 88, batch 450, loss[loss=0.1369, simple_loss=0.231, pruned_loss=0.02143, over 24551.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.24, pruned_loss=0.02645, over 4284825.71 frames. ], batch size: 66, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:44:27,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522100.0, ans=0.1 2023-12-05 01:44:40,571 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=522166.6666666667, ans=0.0 2023-12-05 01:44:41,797 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522166.6666666667, ans=0.1 2023-12-05 01:44:45,808 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.274e+02 1.381e+02 1.491e+02 2.054e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-05 01:44:58,703 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=522233.3333333333, ans=0.0 2023-12-05 01:45:07,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522300.0, ans=0.1 2023-12-05 01:45:07,790 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:45:22,043 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.41 vs. limit=22.5 2023-12-05 01:45:25,701 INFO [scaling.py:1022] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.68 vs. limit=5.0 2023-12-05 01:45:25,903 INFO [train.py:1087] (2/4) Epoch 88, batch 500, loss[loss=0.1314, simple_loss=0.2273, pruned_loss=0.01775, over 24854.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.0263, over 4407656.80 frames. ], batch size: 68, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:46:20,999 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=522700.0, ans=0.5 2023-12-05 01:46:25,637 INFO [train.py:1087] (2/4) Epoch 88, batch 550, loss[loss=0.1357, simple_loss=0.2308, pruned_loss=0.02028, over 24552.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2397, pruned_loss=0.02625, over 4505144.48 frames. ], batch size: 66, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:46:28,623 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=522766.6666666667, ans=0.2 2023-12-05 01:46:33,351 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=522766.6666666667, ans=0.125 2023-12-05 01:46:34,670 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=522766.6666666667, ans=0.125 2023-12-05 01:46:39,549 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=522833.3333333333, ans=0.2 2023-12-05 01:46:45,069 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.255e+02 1.343e+02 1.426e+02 1.828e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-05 01:46:48,812 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=522900.0, ans=0.1 2023-12-05 01:46:59,725 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=522900.0, ans=0.0 2023-12-05 01:47:12,163 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=522966.6666666667, ans=0.04949747468305833 2023-12-05 01:47:25,515 INFO [train.py:1087] (2/4) Epoch 88, batch 600, loss[loss=0.1512, simple_loss=0.2442, pruned_loss=0.02912, over 23005.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02631, over 4585283.48 frames. ], batch size: 106, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:48:00,161 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-12-05 01:48:25,265 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523433.3333333333, ans=0.125 2023-12-05 01:48:26,043 INFO [train.py:1087] (2/4) Epoch 88, batch 650, loss[loss=0.1552, simple_loss=0.2467, pruned_loss=0.03186, over 23597.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02638, over 4641636.53 frames. ], batch size: 94, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:48:31,555 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-12-05 01:48:33,169 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=523433.3333333333, ans=0.125 2023-12-05 01:48:45,124 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.280e+02 1.378e+02 1.457e+02 1.837e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-05 01:48:46,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=523500.0, ans=0.0 2023-12-05 01:48:51,139 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=523566.6666666667, ans=0.0 2023-12-05 01:48:52,731 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-12-05 01:49:12,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523700.0, ans=0.125 2023-12-05 01:49:25,642 INFO [train.py:1087] (2/4) Epoch 88, batch 700, loss[loss=0.1497, simple_loss=0.2432, pruned_loss=0.02811, over 24561.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02633, over 4673896.51 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:49:32,542 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=523766.6666666667, ans=0.0 2023-12-05 01:49:33,115 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-12-05 01:49:52,586 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=523900.0, ans=0.125 2023-12-05 01:49:54,299 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-12-05 01:50:19,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=524033.3333333333, ans=0.025 2023-12-05 01:50:20,904 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524033.3333333333, ans=0.0 2023-12-05 01:50:25,574 INFO [train.py:1087] (2/4) Epoch 88, batch 750, loss[loss=0.1414, simple_loss=0.2332, pruned_loss=0.02481, over 24555.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2398, pruned_loss=0.02628, over 4697738.54 frames. ], batch size: 63, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:50:32,666 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=524100.0, ans=0.2 2023-12-05 01:50:41,644 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=524166.6666666667, ans=0.2 2023-12-05 01:50:42,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=524166.6666666667, ans=0.0 2023-12-05 01:50:43,993 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.193e+02 1.285e+02 1.365e+02 1.514e+02 2.030e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-05 01:50:51,581 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=524233.3333333333, ans=0.125 2023-12-05 01:50:51,677 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524233.3333333333, ans=0.0 2023-12-05 01:50:56,345 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=524233.3333333333, ans=0.1 2023-12-05 01:51:00,220 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=524300.0, ans=0.0 2023-12-05 01:51:24,700 INFO [train.py:1087] (2/4) Epoch 88, batch 800, loss[loss=0.1465, simple_loss=0.2406, pruned_loss=0.02615, over 24788.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.02637, over 4717029.91 frames. ], batch size: 71, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:51:54,855 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-12-05 01:51:56,731 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=524633.3333333334, ans=0.125 2023-12-05 01:52:02,052 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=524633.3333333334, ans=0.125 2023-12-05 01:52:19,126 INFO [train.py:1087] (2/4) Epoch 88, batch 850, loss[loss=0.1538, simple_loss=0.2473, pruned_loss=0.03015, over 23607.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2401, pruned_loss=0.02635, over 4738977.83 frames. ], batch size: 94, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:52:28,094 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.51 vs. limit=15.0 2023-12-05 01:52:36,076 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.284e+02 1.365e+02 1.470e+02 3.303e+02, threshold=2.731e+02, percent-clipped=1.0 2023-12-05 01:52:38,893 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=524833.3333333334, ans=0.025 2023-12-05 01:52:46,584 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=524900.0, ans=0.0 2023-12-05 01:52:52,206 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=524966.6666666666, ans=0.1 2023-12-05 01:52:56,305 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=524966.6666666666, ans=0.125 2023-12-05 01:53:18,528 INFO [train.py:1087] (2/4) Epoch 89, batch 0, loss[loss=0.136, simple_loss=0.2329, pruned_loss=0.01953, over 24808.00 frames. ], tot_loss[loss=0.136, simple_loss=0.2329, pruned_loss=0.01953, over 24808.00 frames. ], batch size: 73, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:53:18,528 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 01:53:26,839 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([1.9393, 2.3255, 2.5508, 2.4176, 2.2412, 2.4097, 2.4202, 2.3086], device='cuda:2') 2023-12-05 01:53:32,049 INFO [train.py:1119] (2/4) Epoch 89, validation: loss=0.1517, simple_loss=0.2464, pruned_loss=0.02848, over 944034.00 frames. 2023-12-05 01:53:32,050 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 01:53:38,082 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=525066.6666666666, ans=0.1 2023-12-05 01:53:51,323 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=525133.3333333334, ans=0.5 2023-12-05 01:54:00,466 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=525200.0, ans=0.125 2023-12-05 01:54:21,333 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-12-05 01:54:30,982 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525400.0, ans=0.1 2023-12-05 01:54:31,911 INFO [train.py:1087] (2/4) Epoch 89, batch 50, loss[loss=0.1426, simple_loss=0.2349, pruned_loss=0.02516, over 24776.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2398, pruned_loss=0.02639, over 1073087.45 frames. ], batch size: 71, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:54:56,622 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.263e+02 1.333e+02 1.502e+02 2.119e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-05 01:54:59,974 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=525533.3333333334, ans=0.04949747468305833 2023-12-05 01:55:02,665 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-12-05 01:55:07,809 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525600.0, ans=0.1 2023-12-05 01:55:31,470 INFO [train.py:1087] (2/4) Epoch 89, batch 100, loss[loss=0.1402, simple_loss=0.233, pruned_loss=0.02372, over 24760.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2399, pruned_loss=0.02593, over 1904576.13 frames. ], batch size: 66, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:55:35,626 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=15.0 2023-12-05 01:55:41,932 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=525733.3333333334, ans=0.0 2023-12-05 01:55:59,396 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.69 vs. limit=15.0 2023-12-05 01:56:00,554 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.60 vs. limit=15.0 2023-12-05 01:56:31,460 INFO [train.py:1087] (2/4) Epoch 89, batch 150, loss[loss=0.1512, simple_loss=0.2402, pruned_loss=0.03115, over 24764.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2401, pruned_loss=0.02629, over 2549486.76 frames. ], batch size: 65, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:56:44,421 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.84 vs. limit=10.0 2023-12-05 01:56:47,520 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=526133.3333333334, ans=0.125 2023-12-05 01:56:56,267 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.268e+02 1.348e+02 1.503e+02 1.865e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-05 01:56:57,965 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.99 vs. limit=22.5 2023-12-05 01:57:30,791 INFO [train.py:1087] (2/4) Epoch 89, batch 200, loss[loss=0.1548, simple_loss=0.2517, pruned_loss=0.02897, over 22813.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2395, pruned_loss=0.02599, over 3054487.11 frames. ], batch size: 106, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 01:57:54,221 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=526533.3333333334, ans=0.0 2023-12-05 01:58:07,815 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526600.0, ans=0.1 2023-12-05 01:58:24,514 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=526666.6666666666, ans=0.0 2023-12-05 01:58:31,076 INFO [train.py:1087] (2/4) Epoch 89, batch 250, loss[loss=0.143, simple_loss=0.2358, pruned_loss=0.02507, over 24601.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.02613, over 3450072.12 frames. ], batch size: 68, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 01:58:34,581 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=526733.3333333334, ans=0.1 2023-12-05 01:58:45,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=526800.0, ans=0.0 2023-12-05 01:58:55,995 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.263e+02 1.358e+02 1.436e+02 1.753e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 01:59:11,856 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=15.0 2023-12-05 01:59:22,895 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=527000.0, ans=0.125 2023-12-05 01:59:30,780 INFO [train.py:1087] (2/4) Epoch 89, batch 300, loss[loss=0.1494, simple_loss=0.243, pruned_loss=0.02792, over 24480.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2393, pruned_loss=0.02607, over 3748137.16 frames. ], batch size: 75, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 02:00:06,349 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-12-05 02:00:15,038 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=527266.6666666666, ans=0.07 2023-12-05 02:00:29,584 INFO [train.py:1087] (2/4) Epoch 89, batch 350, loss[loss=0.1518, simple_loss=0.2492, pruned_loss=0.02718, over 22728.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.02611, over 3978108.82 frames. ], batch size: 106, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 02:00:47,691 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=527466.6666666666, ans=0.125 2023-12-05 02:00:55,289 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.292e+02 1.378e+02 1.472e+02 1.847e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-05 02:01:19,651 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=15.0 2023-12-05 02:01:30,201 INFO [train.py:1087] (2/4) Epoch 89, batch 400, loss[loss=0.1512, simple_loss=0.2421, pruned_loss=0.03015, over 24740.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2397, pruned_loss=0.02613, over 4154729.50 frames. ], batch size: 63, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 02:01:37,400 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=527733.3333333334, ans=0.125 2023-12-05 02:01:49,846 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=22.5 2023-12-05 02:02:03,421 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=527866.6666666666, ans=0.125 2023-12-05 02:02:29,446 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-12-05 02:02:29,842 INFO [train.py:1087] (2/4) Epoch 89, batch 450, loss[loss=0.1396, simple_loss=0.2331, pruned_loss=0.02299, over 24565.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02606, over 4270447.27 frames. ], batch size: 65, lr: 2.75e-03, grad_scale: 64.0 2023-12-05 02:02:31,219 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=528066.6666666666, ans=0.0 2023-12-05 02:02:34,638 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=528066.6666666666, ans=0.125 2023-12-05 02:02:36,892 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=528066.6666666666, ans=0.125 2023-12-05 02:02:50,234 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.34 vs. limit=10.0 2023-12-05 02:02:54,087 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.292e+02 1.369e+02 1.510e+02 2.020e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-05 02:02:54,861 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-12-05 02:02:58,200 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=528200.0, ans=0.1 2023-12-05 02:02:59,596 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.50 vs. limit=15.0 2023-12-05 02:03:04,440 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=528266.6666666666, ans=0.1 2023-12-05 02:03:06,573 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=528266.6666666666, ans=0.0 2023-12-05 02:03:09,946 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=528266.6666666666, ans=0.0 2023-12-05 02:03:13,036 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=528266.6666666666, ans=0.2 2023-12-05 02:03:19,357 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-12-05 02:03:28,776 INFO [train.py:1087] (2/4) Epoch 89, batch 500, loss[loss=0.1396, simple_loss=0.2337, pruned_loss=0.0227, over 24556.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2395, pruned_loss=0.02622, over 4393815.33 frames. ], batch size: 62, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:03:37,727 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=528400.0, ans=0.0 2023-12-05 02:03:44,415 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=528466.6666666666, ans=0.1 2023-12-05 02:03:59,105 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528533.3333333334, ans=0.125 2023-12-05 02:04:02,113 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-12-05 02:04:28,310 INFO [train.py:1087] (2/4) Epoch 89, batch 550, loss[loss=0.1385, simple_loss=0.2295, pruned_loss=0.02373, over 24603.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.02617, over 4462862.61 frames. ], batch size: 68, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:04:32,288 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528733.3333333334, ans=0.125 2023-12-05 02:04:33,695 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=528733.3333333334, ans=0.125 2023-12-05 02:04:36,978 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=528733.3333333334, ans=0.0 2023-12-05 02:04:46,444 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=528800.0, ans=0.125 2023-12-05 02:04:54,039 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.299e+02 1.416e+02 1.516e+02 2.054e+02, threshold=2.831e+02, percent-clipped=0.0 2023-12-05 02:04:59,609 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.71 vs. limit=22.5 2023-12-05 02:05:00,897 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=528866.6666666666, ans=0.2 2023-12-05 02:05:02,071 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=528866.6666666666, ans=0.125 2023-12-05 02:05:02,283 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=528866.6666666666, ans=0.125 2023-12-05 02:05:27,937 INFO [train.py:1087] (2/4) Epoch 89, batch 600, loss[loss=0.1415, simple_loss=0.237, pruned_loss=0.02302, over 24603.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2394, pruned_loss=0.02612, over 4529818.04 frames. ], batch size: 68, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:05:57,857 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=529200.0, ans=0.04949747468305833 2023-12-05 02:06:00,065 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529200.0, ans=0.1 2023-12-05 02:06:02,064 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=529200.0, ans=0.125 2023-12-05 02:06:16,732 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=529333.3333333334, ans=0.5 2023-12-05 02:06:25,659 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=529333.3333333334, ans=0.125 2023-12-05 02:06:27,692 INFO [train.py:1087] (2/4) Epoch 89, batch 650, loss[loss=0.1476, simple_loss=0.2403, pruned_loss=0.0274, over 24777.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2393, pruned_loss=0.02606, over 4590118.30 frames. ], batch size: 73, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:06:29,945 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.27 vs. limit=15.0 2023-12-05 02:06:31,804 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=529400.0, ans=0.0 2023-12-05 02:06:54,545 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=529533.3333333334, ans=0.025 2023-12-05 02:06:55,361 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.271e+02 1.358e+02 1.474e+02 2.858e+02, threshold=2.716e+02, percent-clipped=1.0 2023-12-05 02:07:06,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=529600.0, ans=0.0 2023-12-05 02:07:09,471 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529600.0, ans=0.1 2023-12-05 02:07:14,567 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=529666.6666666666, ans=0.125 2023-12-05 02:07:19,588 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.88 vs. limit=22.5 2023-12-05 02:07:20,524 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=529666.6666666666, ans=0.0 2023-12-05 02:07:27,037 INFO [train.py:1087] (2/4) Epoch 89, batch 700, loss[loss=0.1423, simple_loss=0.2334, pruned_loss=0.02558, over 24571.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.2389, pruned_loss=0.02574, over 4644142.21 frames. ], batch size: 64, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:07:32,764 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=529733.3333333334, ans=0.125 2023-12-05 02:07:32,774 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=529733.3333333334, ans=0.035 2023-12-05 02:07:37,775 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=529800.0, ans=0.125 2023-12-05 02:07:43,821 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-12-05 02:07:47,532 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=529800.0, ans=0.1 2023-12-05 02:08:07,094 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=529933.3333333334, ans=0.0 2023-12-05 02:08:26,385 INFO [train.py:1087] (2/4) Epoch 89, batch 750, loss[loss=0.1539, simple_loss=0.2472, pruned_loss=0.03029, over 24469.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2393, pruned_loss=0.02587, over 4679415.41 frames. ], batch size: 77, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:08:42,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=530133.3333333334, ans=0.125 2023-12-05 02:08:52,911 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.289e+02 1.385e+02 1.489e+02 1.926e+02, threshold=2.771e+02, percent-clipped=0.0 2023-12-05 02:09:07,526 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530266.6666666666, ans=0.125 2023-12-05 02:09:10,973 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=530266.6666666666, ans=0.125 2023-12-05 02:09:25,200 INFO [train.py:1087] (2/4) Epoch 89, batch 800, loss[loss=0.1495, simple_loss=0.2452, pruned_loss=0.02692, over 24738.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2393, pruned_loss=0.02577, over 4717119.39 frames. ], batch size: 61, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:09:25,619 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=530400.0, ans=0.2 2023-12-05 02:09:59,253 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=530600.0, ans=0.1 2023-12-05 02:10:19,317 INFO [train.py:1087] (2/4) Epoch 89, batch 850, loss[loss=0.1472, simple_loss=0.2385, pruned_loss=0.028, over 24476.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2394, pruned_loss=0.02613, over 4734413.10 frames. ], batch size: 75, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:10:22,699 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=530733.3333333334, ans=0.0 2023-12-05 02:10:27,038 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=530733.3333333334, ans=0.07 2023-12-05 02:10:44,189 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.313e+02 1.399e+02 1.494e+02 1.955e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-05 02:10:50,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=530933.3333333334, ans=0.5 2023-12-05 02:11:17,968 INFO [train.py:1087] (2/4) Epoch 90, batch 0, loss[loss=0.1355, simple_loss=0.2307, pruned_loss=0.02009, over 24759.00 frames. ], tot_loss[loss=0.1355, simple_loss=0.2307, pruned_loss=0.02009, over 24759.00 frames. ], batch size: 66, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:11:17,969 INFO [train.py:1110] (2/4) Computing validation loss 2023-12-05 02:11:30,172 INFO [zipformer.py:1876] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.4209, 3.8561, 3.7486, 4.1008], device='cuda:2') 2023-12-05 02:11:31,298 INFO [train.py:1119] (2/4) Epoch 90, validation: loss=0.1508, simple_loss=0.2458, pruned_loss=0.02792, over 944034.00 frames. 2023-12-05 02:11:31,299 INFO [train.py:1120] (2/4) Maximum memory allocated so far is 16177MB 2023-12-05 02:11:40,569 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=531033.3333333334, ans=0.125 2023-12-05 02:11:49,830 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-12-05 02:11:53,142 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531100.0, ans=0.125 2023-12-05 02:12:11,326 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=531233.3333333334, ans=0.1 2023-12-05 02:12:22,522 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531300.0, ans=0.1 2023-12-05 02:12:23,630 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531300.0, ans=0.125 2023-12-05 02:12:31,028 INFO [train.py:1087] (2/4) Epoch 90, batch 50, loss[loss=0.1303, simple_loss=0.2252, pruned_loss=0.01772, over 24715.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.2379, pruned_loss=0.02615, over 1081388.49 frames. ], batch size: 69, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:12:53,685 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=531500.0, ans=0.04949747468305833 2023-12-05 02:12:56,303 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=531500.0, ans=0.0 2023-12-05 02:13:03,791 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.265e+02 1.348e+02 1.429e+02 2.050e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-05 02:13:14,757 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=531566.6666666666, ans=0.125 2023-12-05 02:13:27,123 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=531633.3333333334, ans=0.125 2023-12-05 02:13:29,631 INFO [train.py:1087] (2/4) Epoch 90, batch 100, loss[loss=0.1437, simple_loss=0.2378, pruned_loss=0.02482, over 24805.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2398, pruned_loss=0.026, over 1909841.36 frames. ], batch size: 73, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:13:33,637 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=531700.0, ans=0.0 2023-12-05 02:13:33,761 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=531700.0, ans=0.125 2023-12-05 02:13:35,063 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=531700.0, ans=0.2 2023-12-05 02:13:48,817 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531766.6666666666, ans=0.1 2023-12-05 02:14:00,495 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-12-05 02:14:11,384 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=531900.0, ans=0.125 2023-12-05 02:14:25,639 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-12-05 02:14:28,650 INFO [train.py:1087] (2/4) Epoch 90, batch 150, loss[loss=0.1384, simple_loss=0.2309, pruned_loss=0.023, over 24765.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2396, pruned_loss=0.02591, over 2564917.52 frames. ], batch size: 70, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:14:41,073 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=532100.0, ans=0.125 2023-12-05 02:14:45,866 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=532100.0, ans=0.125 2023-12-05 02:14:50,568 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=532100.0, ans=0.125 2023-12-05 02:14:59,924 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=532166.6666666666, ans=0.125 2023-12-05 02:15:01,842 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.270e+02 1.373e+02 1.508e+02 1.981e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-05 02:15:10,015 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=532233.3333333334, ans=0.5 2023-12-05 02:15:23,792 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=532300.0, ans=0.125 2023-12-05 02:15:27,676 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=12.0 2023-12-05 02:15:28,241 INFO [train.py:1087] (2/4) Epoch 90, batch 200, loss[loss=0.1331, simple_loss=0.2265, pruned_loss=0.01979, over 24744.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2394, pruned_loss=0.02604, over 3064050.30 frames. ], batch size: 63, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:15:32,691 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.11 vs. limit=22.5 2023-12-05 02:15:48,349 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=532433.3333333334, ans=0.125 2023-12-05 02:16:04,672 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532566.6666666666, ans=0.1 2023-12-05 02:16:24,653 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.95 vs. limit=10.0 2023-12-05 02:16:28,514 INFO [train.py:1087] (2/4) Epoch 90, batch 250, loss[loss=0.1403, simple_loss=0.237, pruned_loss=0.02184, over 24860.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2394, pruned_loss=0.0262, over 3440840.77 frames. ], batch size: 68, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:16:32,299 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=532700.0, ans=0.125 2023-12-05 02:16:38,188 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=532700.0, ans=0.125 2023-12-05 02:16:46,259 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:16:46,500 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.03 vs. limit=10.0 2023-12-05 02:16:48,280 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=532766.6666666666, ans=0.025 2023-12-05 02:16:48,431 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:16:49,325 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:17:01,317 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.96 vs. limit=15.0 2023-12-05 02:17:01,772 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.285e+02 1.409e+02 1.536e+02 2.196e+02, threshold=2.819e+02, percent-clipped=0.0 2023-12-05 02:17:05,543 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=532900.0, ans=0.125 2023-12-05 02:17:17,518 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=532966.6666666666, ans=0.125 2023-12-05 02:17:28,247 INFO [train.py:1087] (2/4) Epoch 90, batch 300, loss[loss=0.1427, simple_loss=0.2357, pruned_loss=0.02482, over 24552.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2395, pruned_loss=0.02603, over 3737051.86 frames. ], batch size: 66, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:17:29,626 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533033.3333333334, ans=0.1 2023-12-05 02:17:30,753 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=533033.3333333334, ans=0.125 2023-12-05 02:17:38,706 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=533100.0, ans=0.125 2023-12-05 02:17:46,222 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=15.0 2023-12-05 02:17:50,463 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=533166.6666666666, ans=0.2 2023-12-05 02:17:51,413 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=533166.6666666666, ans=0.125 2023-12-05 02:17:54,910 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=533166.6666666666, ans=0.125 2023-12-05 02:17:56,214 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=533166.6666666666, ans=0.125 2023-12-05 02:18:25,117 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=533300.0, ans=0.125 2023-12-05 02:18:29,394 INFO [train.py:1087] (2/4) Epoch 90, batch 350, loss[loss=0.1638, simple_loss=0.2532, pruned_loss=0.03725, over 16486.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.02593, over 3980044.65 frames. ], batch size: 177, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:18:35,717 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=533366.6666666666, ans=0.0 2023-12-05 02:18:42,942 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=533433.3333333334, ans=0.07 2023-12-05 02:18:42,982 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=533433.3333333334, ans=0.0 2023-12-05 02:19:02,403 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.245e+02 1.317e+02 1.420e+02 1.727e+02, threshold=2.633e+02, percent-clipped=0.0 2023-12-05 02:19:03,818 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=533566.6666666666, ans=0.125 2023-12-05 02:19:05,029 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533566.6666666666, ans=0.1 2023-12-05 02:19:12,287 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=533566.6666666666, ans=15.0 2023-12-05 02:19:13,067 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=533566.6666666666, ans=0.0 2023-12-05 02:19:24,263 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=533633.3333333334, ans=0.125 2023-12-05 02:19:28,689 INFO [train.py:1087] (2/4) Epoch 90, batch 400, loss[loss=0.1412, simple_loss=0.2326, pruned_loss=0.02488, over 24576.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.2389, pruned_loss=0.0257, over 4166601.07 frames. ], batch size: 64, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:20:05,538 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=533900.0, ans=0.0 2023-12-05 02:20:28,284 INFO [train.py:1087] (2/4) Epoch 90, batch 450, loss[loss=0.1448, simple_loss=0.2373, pruned_loss=0.02619, over 24764.00 frames. ], tot_loss[loss=0.1453, simple_loss=0.2391, pruned_loss=0.02579, over 4305040.80 frames. ], batch size: 64, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:20:29,846 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=534033.3333333334, ans=0.125 2023-12-05 02:20:35,919 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.04 vs. limit=15.0 2023-12-05 02:20:36,669 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=534033.3333333334, ans=0.125 2023-12-05 02:21:01,375 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.287e+02 1.360e+02 1.496e+02 1.959e+02, threshold=2.719e+02, percent-clipped=0.0 2023-12-05 02:21:27,496 INFO [train.py:1087] (2/4) Epoch 90, batch 500, loss[loss=0.1656, simple_loss=0.2534, pruned_loss=0.03895, over 16838.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2392, pruned_loss=0.02603, over 4397429.89 frames. ], batch size: 177, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:21:47,247 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.78 vs. limit=15.0 2023-12-05 02:21:56,370 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534500.0, ans=0.125 2023-12-05 02:22:03,523 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.74 vs. limit=10.0 2023-12-05 02:22:25,774 INFO [train.py:1087] (2/4) Epoch 90, batch 550, loss[loss=0.1327, simple_loss=0.2258, pruned_loss=0.0198, over 24756.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2394, pruned_loss=0.02593, over 4487637.87 frames. ], batch size: 70, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:22:28,643 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=534700.0, ans=0.125 2023-12-05 02:22:39,418 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=534766.6666666666, ans=0.0 2023-12-05 02:22:55,159 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=534833.3333333334, ans=0.0 2023-12-05 02:22:58,192 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.275e+02 1.357e+02 1.447e+02 1.780e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-05 02:23:13,131 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=534966.6666666666, ans=15.0 2023-12-05 02:23:24,891 INFO [train.py:1087] (2/4) Epoch 90, batch 600, loss[loss=0.1408, simple_loss=0.2331, pruned_loss=0.02427, over 24804.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2396, pruned_loss=0.02598, over 4541039.04 frames. ], batch size: 62, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:23:28,619 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=535033.3333333334, ans=0.125 2023-12-05 02:23:41,983 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=535100.0, ans=0.125 2023-12-05 02:23:48,581 INFO [scaling.py:1118] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 02:23:55,629 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=535166.6666666666, ans=0.0 2023-12-05 02:23:58,080 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-12-05 02:24:04,933 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=535233.3333333334, ans=0.125 2023-12-05 02:24:11,430 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-12-05 02:24:16,991 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=535300.0, ans=0.0 2023-12-05 02:24:23,404 INFO [train.py:1087] (2/4) Epoch 90, batch 650, loss[loss=0.1469, simple_loss=0.2394, pruned_loss=0.02713, over 24858.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2392, pruned_loss=0.02593, over 4611938.52 frames. ], batch size: 68, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:24:42,461 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=535433.3333333334, ans=0.0 2023-12-05 02:24:42,775 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=22.5 2023-12-05 02:24:49,668 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=535500.0, ans=0.0 2023-12-05 02:24:55,911 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.319e+02 1.391e+02 1.531e+02 2.026e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 02:25:16,087 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-12-05 02:25:21,887 INFO [train.py:1087] (2/4) Epoch 90, batch 700, loss[loss=0.1312, simple_loss=0.2283, pruned_loss=0.01706, over 24692.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.2389, pruned_loss=0.02569, over 4655237.01 frames. ], batch size: 74, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:25:44,588 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=535833.3333333334, ans=0.1 2023-12-05 02:25:52,871 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=535833.3333333334, ans=0.125 2023-12-05 02:26:21,100 INFO [train.py:1087] (2/4) Epoch 90, batch 750, loss[loss=0.1506, simple_loss=0.2463, pruned_loss=0.02745, over 24854.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.2387, pruned_loss=0.02571, over 4688018.75 frames. ], batch size: 68, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:26:21,342 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=536033.3333333334, ans=0.0 2023-12-05 02:26:31,686 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=536100.0, ans=0.2 2023-12-05 02:26:36,927 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-12-05 02:26:42,104 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=536100.0, ans=0.125 2023-12-05 02:26:48,965 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=536166.6666666666, ans=0.0 2023-12-05 02:26:53,937 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.283e+02 1.414e+02 1.604e+02 3.255e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-05 02:27:04,212 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=536233.3333333334, ans=0.125 2023-12-05 02:27:07,615 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=536300.0, ans=0.0 2023-12-05 02:27:19,319 INFO [train.py:1087] (2/4) Epoch 90, batch 800, loss[loss=0.141, simple_loss=0.2384, pruned_loss=0.02174, over 24705.00 frames. ], tot_loss[loss=0.1447, simple_loss=0.2386, pruned_loss=0.0254, over 4727792.85 frames. ], batch size: 69, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:27:31,014 INFO [scaling.py:1022] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.10 vs. limit=15.0 2023-12-05 02:27:33,997 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=536433.3333333334, ans=0.0 2023-12-05 02:27:35,950 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=536433.3333333334, ans=0.0 2023-12-05 02:28:02,495 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=536633.3333333334, ans=0.0 2023-12-05 02:28:03,612 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=536633.3333333334, ans=0.2 2023-12-05 02:28:14,286 INFO [train.py:1087] (2/4) Epoch 90, batch 850, loss[loss=0.1442, simple_loss=0.2409, pruned_loss=0.02379, over 24546.00 frames. ], tot_loss[loss=0.145, simple_loss=0.2388, pruned_loss=0.02555, over 4739192.51 frames. ], batch size: 62, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:28:29,140 INFO [scaling.py:213] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=536766.6666666666, ans=0.125 2023-12-05 02:28:44,743 INFO [optim.py:468] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.273e+02 1.345e+02 1.462e+02 2.000e+02, threshold=2.691e+02, percent-clipped=1.0 2023-12-05 02:29:01,504 INFO [train.py:1352] (2/4) Done!