2023-12-03 22:41:20,221 INFO [train.py:1155] (1/4) Training started 2023-12-03 22:41:20,221 INFO [train.py:1172] (1/4) Device: cuda:1 2023-12-03 22:41:20,226 INFO [train.py:1184] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '2b2ac14b326d61d79d04e53fbd69b1ff6d630411', 'k2-git-date': 'Thu Aug 24 05:58:26 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.1', 'icefall-git-branch': 'zipformer_whisper_mvq', 'icefall-git-sha1': '0d26e9c4-dirty', 'icefall-git-date': 'Tue Oct 24 09:46:03 2023', 'icefall-path': '/star-xy/softwares/icefall_development/icefall_mvq', 'k2-path': '/star-xy/softwares/k2_development/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-xy/softwares/anaconda3/envs/multi_KD/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-1-1220091118-57c4d55446-mvd6x', 'IP address': '10.177.22.19'}, 'world_size': 4, 'master_port': 18130, 'tensorboard': True, 'num_epochs': 90, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_baseline_960h_no_sp_enable_musan0'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'codebook_loss_scale': 0.1, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'enable_distillation': False, 'num_codebooks': 16, 'distillation_layer': 4, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank_with_whisper_embeddings'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2023-12-03 22:41:20,227 INFO [train.py:1186] (1/4) About to create model 2023-12-03 22:41:20,844 INFO [train.py:1190] (1/4) Number of model parameters: 65549011 2023-12-03 22:41:23,420 INFO [train.py:1205] (1/4) Using DDP 2023-12-03 22:41:24,022 INFO [asr_datamodule.py:434] (1/4) About to get the shuffled train-clean-100, train-clean-360 and train-other-500 cuts 2023-12-03 22:41:24,152 INFO [asr_datamodule.py:239] (1/4) Disable MUSAN 2023-12-03 22:41:24,152 INFO [asr_datamodule.py:257] (1/4) Enable SpecAugment 2023-12-03 22:41:24,152 INFO [asr_datamodule.py:258] (1/4) Time warp factor: 80 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:268] (1/4) Num frame mask: 10 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:281] (1/4) About to create train dataset 2023-12-03 22:41:24,153 INFO [asr_datamodule.py:308] (1/4) Using DynamicBucketingSampler. 2023-12-03 22:41:28,417 INFO [asr_datamodule.py:323] (1/4) About to create train dataloader 2023-12-03 22:41:28,418 INFO [asr_datamodule.py:451] (1/4) About to get dev-clean cuts 2023-12-03 22:41:28,420 INFO [asr_datamodule.py:458] (1/4) About to get dev-other cuts 2023-12-03 22:41:28,422 INFO [asr_datamodule.py:354] (1/4) About to create dev dataset 2023-12-03 22:41:28,678 INFO [asr_datamodule.py:371] (1/4) About to create dev dataloader 2023-12-03 22:41:41,220 INFO [train.py:1087] (1/4) Epoch 1, batch 0, loss[loss=7.585, simple_loss=6.903, pruned_loss=6.81, over 24783.00 frames. ], tot_loss[loss=7.585, simple_loss=6.903, pruned_loss=6.81, over 24783.00 frames. ], batch size: 71, lr: 2.25e-02, grad_scale: 1.0 2023-12-03 22:41:41,220 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-03 22:41:53,591 INFO [train.py:1119] (1/4) Epoch 1, validation: loss=7.57, simple_loss=6.893, pruned_loss=6.755, over 944034.00 frames. 2023-12-03 22:41:53,591 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 14431MB 2023-12-03 22:41:57,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=0.0, ans=0.2 2023-12-03 22:41:58,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=0.0, ans=0.2 2023-12-03 22:41:59,968 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=14.35 vs. limit=4.0 2023-12-03 22:42:03,389 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.50 vs. limit=4.0 2023-12-03 22:42:04,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=66.66666666666667, ans=0.0985 2023-12-03 22:42:11,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=66.66666666666667, ans=0.04979166666666667 2023-12-03 22:42:18,283 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=509.13 vs. limit=7.6 2023-12-03 22:42:21,823 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=77.30 vs. limit=4.053333333333334 2023-12-03 22:42:24,274 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=357.51 vs. limit=7.55 2023-12-03 22:42:36,095 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=3.03 2023-12-03 22:42:50,426 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=4.1066666666666665 2023-12-03 22:42:52,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=266.6666666666667, ans=0.19 2023-12-03 22:42:58,021 INFO [train.py:1087] (1/4) Epoch 1, batch 50, loss[loss=1.342, simple_loss=1.192, pruned_loss=1.341, over 24581.00 frames. ], tot_loss[loss=3.169, simple_loss=2.909, pruned_loss=2.533, over 1095873.29 frames. ], batch size: 65, lr: 2.48e-02, grad_scale: 0.25 2023-12-03 22:43:10,080 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=272.26 vs. limit=7.65 2023-12-03 22:43:14,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=400.0, ans=0.45 2023-12-03 22:43:22,189 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=276.41 vs. limit=5.233333333333333 2023-12-03 22:43:26,655 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=122.31 vs. limit=7.85 2023-12-03 22:43:27,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=466.6666666666667, ans=0.0895 2023-12-03 22:43:30,633 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=466.6666666666667, ans=0.478125 2023-12-03 22:43:32,572 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=25.21 vs. limit=7.675 2023-12-03 22:43:40,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2023-12-03 22:43:42,004 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533.3333333333334, ans=0.29466666666666663 2023-12-03 22:43:51,357 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=199.87 vs. limit=7.725 2023-12-03 22:43:54,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=600.0, ans=0.425 2023-12-03 22:44:02,730 INFO [train.py:1087] (1/4) Epoch 1, batch 100, loss[loss=1.192, simple_loss=1.032, pruned_loss=1.275, over 24326.00 frames. ], tot_loss[loss=2.131, simple_loss=1.928, pruned_loss=1.864, over 1909669.47 frames. ], batch size: 79, lr: 2.70e-02, grad_scale: 0.5 2023-12-03 22:44:03,666 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=159.43 vs. limit=7.75 2023-12-03 22:44:06,922 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.552e+01 1.661e+02 5.889e+02 5.635e+03 1.802e+05, threshold=1.178e+03, percent-clipped=0.0 2023-12-03 22:44:18,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=733.3333333333334, ans=0.465625 2023-12-03 22:44:21,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=733.3333333333334, ans=0.04770833333333334 2023-12-03 22:44:25,502 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=4.293333333333333 2023-12-03 22:44:27,862 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=195.69 vs. limit=7.8 2023-12-03 22:44:29,459 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=150.25 vs. limit=7.8 2023-12-03 22:44:36,320 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=33.82 vs. limit=8.1 2023-12-03 22:44:36,767 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=4.32 2023-12-03 22:44:40,213 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=95.73 vs. limit=7.8 2023-12-03 22:44:44,472 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=13.78 vs. limit=4.346666666666667 2023-12-03 22:44:50,704 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=45.13 vs. limit=7.825 2023-12-03 22:44:53,170 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=14.65 vs. limit=4.346666666666667 2023-12-03 22:44:54,028 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=933.3333333333334, ans=0.45625 2023-12-03 22:44:54,487 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=53.39 vs. limit=7.85 2023-12-03 22:44:57,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=933.3333333333334, ans=0.04666666666666667 2023-12-03 22:45:07,807 INFO [train.py:1087] (1/4) Epoch 1, batch 150, loss[loss=1.055, simple_loss=0.8996, pruned_loss=1.129, over 24774.00 frames. ], tot_loss[loss=1.702, simple_loss=1.519, pruned_loss=1.58, over 2548600.30 frames. ], batch size: 71, lr: 2.93e-02, grad_scale: 0.5 2023-12-03 22:45:08,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=1000.0, ans=0.09375 2023-12-03 22:45:11,132 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=35.27 vs. limit=7.875 2023-12-03 22:45:13,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=1000.0, ans=0.1625 2023-12-03 22:45:15,505 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=30.32 vs. limit=7.875 2023-12-03 22:45:18,437 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.29 vs. limit=5.5 2023-12-03 22:45:19,057 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1000.0, ans=0.046875 2023-12-03 22:45:39,215 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=32.16 vs. limit=7.925 2023-12-03 22:45:54,002 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=24.08 vs. limit=5.6 2023-12-03 22:46:04,405 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=23.02 vs. limit=7.975 2023-12-03 22:46:04,805 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=64.68 vs. limit=7.975 2023-12-03 22:46:08,296 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=28.18 vs. limit=8.45 2023-12-03 22:46:09,575 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=157.76 vs. limit=7.975 2023-12-03 22:46:11,960 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1266.6666666666667, ans=0.035 2023-12-03 22:46:13,636 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=30.73 vs. limit=8.0 2023-12-03 22:46:14,021 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=8.5 2023-12-03 22:46:14,756 INFO [train.py:1087] (1/4) Epoch 1, batch 200, loss[loss=0.9535, simple_loss=0.812, pruned_loss=0.9574, over 23497.00 frames. ], tot_loss[loss=1.459, simple_loss=1.288, pruned_loss=1.394, over 3050653.93 frames. ], batch size: 94, lr: 3.15e-02, grad_scale: 1.0 2023-12-03 22:46:18,508 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.195e+01 8.879e+01 1.077e+02 1.337e+02 3.280e+02, threshold=2.153e+02, percent-clipped=0.0 2023-12-03 22:46:20,450 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=10.74 vs. limit=4.533333333333333 2023-12-03 22:46:23,250 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.97 vs. limit=8.0 2023-12-03 22:46:25,013 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333.3333333333333, ans=0.2866666666666667 2023-12-03 22:46:26,112 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 22:46:34,047 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=4.5600000000000005 2023-12-03 22:46:35,503 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.04 vs. limit=5.35 2023-12-03 22:46:36,726 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=25.68 vs. limit=8.025 2023-12-03 22:46:41,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=1466.6666666666667, ans=0.43125 2023-12-03 22:46:45,634 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1466.6666666666667, ans=0.43125 2023-12-03 22:46:47,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=77.39 vs. limit=8.05 2023-12-03 22:46:48,662 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.74 vs. limit=8.6 2023-12-03 22:46:48,668 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=13.34 vs. limit=4.586666666666667 2023-12-03 22:46:59,737 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.56 vs. limit=8.65 2023-12-03 22:47:03,476 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=33.09 vs. limit=8.075 2023-12-03 22:47:03,695 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=65.67 vs. limit=8.075 2023-12-03 22:47:07,572 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.69 vs. limit=8.7 2023-12-03 22:47:11,232 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.48 vs. limit=8.7 2023-12-03 22:47:12,661 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.07 vs. limit=8.7 2023-12-03 22:47:16,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.23 vs. limit=5.4 2023-12-03 22:47:18,055 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=14.47 vs. limit=5.8 2023-12-03 22:47:18,165 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=27.00 vs. limit=8.1 2023-12-03 22:47:21,316 INFO [train.py:1087] (1/4) Epoch 1, batch 250, loss[loss=0.9731, simple_loss=0.8198, pruned_loss=0.9613, over 24690.00 frames. ], tot_loss[loss=1.311, simple_loss=1.146, pruned_loss=1.264, over 3415440.26 frames. ], batch size: 69, lr: 3.38e-02, grad_scale: 1.0 2023-12-03 22:47:23,411 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.29 vs. limit=8.75 2023-12-03 22:47:25,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1666.6666666666667, ans=0.7666666666666666 2023-12-03 22:47:26,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1666.6666666666667, ans=0.23333333333333334 2023-12-03 22:47:33,377 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.93 vs. limit=4.693333333333333 2023-12-03 22:47:40,629 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=10.59 vs. limit=5.433333333333334 2023-12-03 22:47:43,503 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=170.41 vs. limit=8.15 2023-12-03 22:47:48,314 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.43 vs. limit=8.85 2023-12-03 22:47:48,724 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=19.74 vs. limit=5.45 2023-12-03 22:47:53,679 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=3.27 2023-12-03 22:47:55,104 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=15.96 vs. limit=8.175 2023-12-03 22:48:01,042 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1866.6666666666667, ans=0.4125 2023-12-03 22:48:01,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1866.6666666666667, ans=0.26666666666666666 2023-12-03 22:48:03,769 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=8.9 2023-12-03 22:48:04,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1866.6666666666667, ans=0.4125 2023-12-03 22:48:05,298 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.20 vs. limit=8.2 2023-12-03 22:48:08,118 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=137.78 vs. limit=8.2 2023-12-03 22:48:10,772 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=4.746666666666667 2023-12-03 22:48:14,971 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=20.14 vs. limit=8.225 2023-12-03 22:48:16,420 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.31 vs. limit=8.95 2023-12-03 22:48:20,366 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.62 vs. limit=8.225 2023-12-03 22:48:21,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1933.3333333333333, ans=8.225 2023-12-03 22:48:26,891 INFO [train.py:1087] (1/4) Epoch 1, batch 300, loss[loss=0.9398, simple_loss=0.7952, pruned_loss=0.8683, over 16439.00 frames. ], tot_loss[loss=1.209, simple_loss=1.047, pruned_loss=1.169, over 3714181.80 frames. ], batch size: 177, lr: 3.60e-02, grad_scale: 2.0 2023-12-03 22:48:31,064 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.988e+01 1.210e+02 1.529e+02 2.202e+02 3.785e+02, threshold=3.059e+02, percent-clipped=26.0 2023-12-03 22:48:33,291 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=2000.0, ans=9.0 2023-12-03 22:48:39,163 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2066.6666666666665, ans=0.12250000000000001 2023-12-03 22:48:39,626 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=131.71 vs. limit=6.033333333333333 2023-12-03 22:48:51,849 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=60.08 vs. limit=8.275 2023-12-03 22:48:58,960 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2133.3333333333335, ans=0.12 2023-12-03 22:49:00,469 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=9.1 2023-12-03 22:49:03,349 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=8.3 2023-12-03 22:49:07,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2200.0, ans=0.8230000000000001 2023-12-03 22:49:17,649 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=65.21 vs. limit=8.325 2023-12-03 22:49:19,994 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=74.15 vs. limit=8.35 2023-12-03 22:49:23,304 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=19.59 vs. limit=8.35 2023-12-03 22:49:25,704 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=21.92 vs. limit=8.35 2023-12-03 22:49:33,358 INFO [train.py:1087] (1/4) Epoch 1, batch 350, loss[loss=0.9373, simple_loss=0.7774, pruned_loss=0.8842, over 23945.00 frames. ], tot_loss[loss=1.137, simple_loss=0.9759, pruned_loss=1.098, over 3965945.50 frames. ], batch size: 87, lr: 3.83e-02, grad_scale: 2.0 2023-12-03 22:49:36,315 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.45 vs. limit=9.25 2023-12-03 22:49:37,732 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.44 vs. limit=5.583333333333333 2023-12-03 22:49:42,584 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.10 vs. limit=6.166666666666667 2023-12-03 22:49:43,543 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2333.3333333333335, ans=0.08541666666666667 2023-12-03 22:49:45,401 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.87 vs. limit=5.6 2023-12-03 22:49:47,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2400.0, ans=0.11 2023-12-03 22:49:55,594 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.29 vs. limit=8.4 2023-12-03 22:50:02,749 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=8.425 2023-12-03 22:50:06,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2466.6666666666665, ans=0.8136666666666666 2023-12-03 22:50:16,394 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=36.86 vs. limit=9.4 2023-12-03 22:50:20,009 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=8.45 2023-12-03 22:50:21,053 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.28 vs. limit=8.45 2023-12-03 22:50:23,523 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.23 vs. limit=8.475 2023-12-03 22:50:25,218 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=8.475 2023-12-03 22:50:26,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=2600.0, ans=0.27399999999999997 2023-12-03 22:50:34,451 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=15.30 vs. limit=8.475 2023-12-03 22:50:35,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2600.0, ans=0.378125 2023-12-03 22:50:38,481 INFO [train.py:1087] (1/4) Epoch 1, batch 400, loss[loss=0.9104, simple_loss=0.7539, pruned_loss=0.8272, over 24798.00 frames. ], tot_loss[loss=1.085, simple_loss=0.9239, pruned_loss=1.038, over 4144127.18 frames. ], batch size: 73, lr: 4.05e-02, grad_scale: 4.0 2023-12-03 22:50:42,216 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.675e+01 1.297e+02 1.639e+02 1.987e+02 4.198e+02, threshold=3.278e+02, percent-clipped=1.0 2023-12-03 22:50:46,839 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=8.5 2023-12-03 22:50:55,016 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2733.3333333333335, ans=0.0385 2023-12-03 22:50:55,244 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.39 vs. limit=9.55 2023-12-03 22:50:58,138 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=21.96 vs. limit=8.525 2023-12-03 22:51:04,728 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=67.08 vs. limit=8.55 2023-12-03 22:51:05,996 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=8.55 2023-12-03 22:51:11,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2800.0, ans=8.55 2023-12-03 22:51:13,001 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=28.74 vs. limit=8.55 2023-12-03 22:51:14,153 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=50.49 vs. limit=9.6 2023-12-03 22:51:18,416 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=8.44 vs. limit=5.0 2023-12-03 22:51:21,464 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2866.6666666666665, ans=0.365625 2023-12-03 22:51:24,193 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=21.73 vs. limit=8.575 2023-12-03 22:51:25,755 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=29.96 vs. limit=9.65 2023-12-03 22:51:30,698 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.01 vs. limit=9.7 2023-12-03 22:51:32,483 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2933.3333333333335, ans=0.3625 2023-12-03 22:51:42,003 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=7.41 vs. limit=5.2 2023-12-03 22:51:42,940 INFO [train.py:1087] (1/4) Epoch 1, batch 450, loss[loss=0.9217, simple_loss=0.7702, pruned_loss=0.7868, over 24801.00 frames. ], tot_loss[loss=1.044, simple_loss=0.884, pruned_loss=0.9826, over 4302007.68 frames. ], batch size: 72, lr: 4.28e-02, grad_scale: 4.0 2023-12-03 22:51:51,735 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=3.45 2023-12-03 22:51:52,885 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=15.88 vs. limit=5.75 2023-12-03 22:52:15,341 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.03 vs. limit=9.85 2023-12-03 22:52:18,737 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=23.31 vs. limit=8.675 2023-12-03 22:52:20,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=3133.3333333333335, ans=8.675 2023-12-03 22:52:20,963 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3200.0, ans=0.268 2023-12-03 22:52:24,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=3200.0, ans=0.35 2023-12-03 22:52:27,456 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=63.67 vs. limit=8.7 2023-12-03 22:52:39,965 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=9.82 vs. limit=8.725 2023-12-03 22:52:45,085 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.30 vs. limit=9.95 2023-12-03 22:52:46,850 INFO [train.py:1087] (1/4) Epoch 1, batch 500, loss[loss=0.8215, simple_loss=0.7013, pruned_loss=0.6393, over 24106.00 frames. ], tot_loss[loss=1.003, simple_loss=0.8484, pruned_loss=0.9179, over 4405632.08 frames. ], batch size: 82, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:52:49,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3333.3333333333335, ans=0.08333333333333331 2023-12-03 22:52:50,976 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.289e+02 2.115e+02 2.895e+02 4.210e+02 7.644e+02, threshold=5.790e+02, percent-clipped=45.0 2023-12-03 22:52:56,040 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.09 vs. limit=5.333333333333334 2023-12-03 22:53:01,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=3400.0, ans=0.07250000000000001 2023-12-03 22:53:03,294 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=3.266e+01 2023-12-03 22:53:07,217 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.59 vs. limit=10.05 2023-12-03 22:53:07,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten.whitening_limit, batch_count=3400.0, ans=10.05 2023-12-03 22:53:09,435 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=3400.0, ans=0.784 2023-12-03 22:53:15,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3466.6666666666665, ans=0.06666666666666671 2023-12-03 22:53:23,789 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=16.94 vs. limit=8.8 2023-12-03 22:53:26,592 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=17.23 vs. limit=8.825 2023-12-03 22:53:32,724 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=5.682e+01 2023-12-03 22:53:32,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.08 vs. limit=8.825 2023-12-03 22:53:35,499 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=8.825 2023-12-03 22:53:39,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.65 vs. limit=8.85 2023-12-03 22:53:49,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=3666.6666666666665, ans=0.7866666666666666 2023-12-03 22:53:50,633 INFO [train.py:1087] (1/4) Epoch 1, batch 550, loss[loss=0.742, simple_loss=0.6459, pruned_loss=0.532, over 24558.00 frames. ], tot_loss[loss=0.9503, simple_loss=0.8067, pruned_loss=0.8394, over 4503308.35 frames. ], batch size: 63, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:53:52,437 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=15.66 vs. limit=8.875 2023-12-03 22:53:56,080 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.15 vs. limit=8.875 2023-12-03 22:54:19,966 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=10.35 2023-12-03 22:54:37,450 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=10.4 2023-12-03 22:54:40,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=3933.3333333333335, ans=0.7623333333333333 2023-12-03 22:54:44,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=3933.3333333333335, ans=0.04949747468305833 2023-12-03 22:54:51,187 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=10.45 2023-12-03 22:54:53,067 INFO [train.py:1087] (1/4) Epoch 1, batch 600, loss[loss=0.6443, simple_loss=0.5745, pruned_loss=0.4212, over 24757.00 frames. ], tot_loss[loss=0.8902, simple_loss=0.7607, pruned_loss=0.7543, over 4574828.27 frames. ], batch size: 65, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:54:56,649 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 4.670e+02 6.467e+02 1.067e+03 1.868e+03, threshold=1.293e+03, percent-clipped=55.0 2023-12-03 22:55:04,336 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=10.45 vs. limit=10.55 2023-12-03 22:55:14,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=4066.6666666666665, ans=0.309375 2023-12-03 22:55:15,092 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.28 vs. limit=9.025 2023-12-03 22:55:16,275 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.15 vs. limit=5.626666666666667 2023-12-03 22:55:26,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=4133.333333333333, ans=3.62 2023-12-03 22:55:27,207 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=10.65 vs. limit=10.6 2023-12-03 22:55:28,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=4133.333333333333, ans=0.7553333333333334 2023-12-03 22:55:29,923 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=12.28 vs. limit=6.033333333333333 2023-12-03 22:55:39,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=4200.0, ans=0.303125 2023-12-03 22:55:58,046 INFO [train.py:1087] (1/4) Epoch 1, batch 650, loss[loss=0.5935, simple_loss=0.5358, pruned_loss=0.3688, over 24556.00 frames. ], tot_loss[loss=0.8325, simple_loss=0.7172, pruned_loss=0.6759, over 4616931.52 frames. ], batch size: 66, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:56:01,368 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.01 vs. limit=9.125 2023-12-03 22:56:03,284 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=4333.333333333333, ans=0.296875 2023-12-03 22:56:28,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4466.666666666667, ans=0.2553333333333333 2023-12-03 22:56:30,147 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.98 vs. limit=9.175 2023-12-03 22:56:37,037 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=4533.333333333333, ans=0.009884057971014493 2023-12-03 22:56:47,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=4600.0, ans=0.284375 2023-12-03 22:56:58,860 INFO [train.py:1087] (1/4) Epoch 1, batch 700, loss[loss=0.6149, simple_loss=0.5421, pruned_loss=0.4039, over 24001.00 frames. ], tot_loss[loss=0.7746, simple_loss=0.674, pruned_loss=0.6006, over 4659342.72 frames. ], batch size: 87, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:57:02,388 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 4.179e+02 6.821e+02 1.321e+03 5.191e+03, threshold=1.364e+03, percent-clipped=27.0 2023-12-03 22:57:06,600 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.39 vs. limit=11.0 2023-12-03 22:57:09,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4733.333333333333, ans=0.25266666666666665 2023-12-03 22:57:11,271 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=4733.333333333333, ans=0.04694444444444445 2023-12-03 22:57:12,809 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=3.71 2023-12-03 22:57:18,237 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=4733.333333333333, ans=0.278125 2023-12-03 22:57:28,548 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.79 vs. limit=6.2 2023-12-03 22:57:36,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=4866.666666666667, ans=0.00981159420289855 2023-12-03 22:57:41,517 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.06 vs. limit=11.15 2023-12-03 22:57:48,686 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=4933.333333333333, ans=8.083333333333332 2023-12-03 22:57:59,224 INFO [train.py:1087] (1/4) Epoch 1, batch 750, loss[loss=0.5265, simple_loss=0.494, pruned_loss=0.286, over 24781.00 frames. ], tot_loss[loss=0.7256, simple_loss=0.6379, pruned_loss=0.5374, over 4669324.39 frames. ], batch size: 70, lr: 4.49e-02, grad_scale: 8.0 2023-12-03 22:58:22,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=5133.333333333333, ans=0.259375 2023-12-03 22:58:31,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=5133.333333333333, ans=9.425 2023-12-03 22:58:32,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=5133.333333333333, ans=0.259375 2023-12-03 22:58:45,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=5200.0, ans=0.25625 2023-12-03 22:58:59,893 INFO [train.py:1087] (1/4) Epoch 1, batch 800, loss[loss=0.4861, simple_loss=0.4614, pruned_loss=0.2543, over 24845.00 frames. ], tot_loss[loss=0.6782, simple_loss=0.6034, pruned_loss=0.4788, over 4704165.69 frames. ], batch size: 68, lr: 4.49e-02, grad_scale: 16.0 2023-12-03 22:59:00,594 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.98 vs. limit=6.333333333333333 2023-12-03 22:59:03,288 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 4.513e+02 8.468e+02 1.309e+03 3.907e+03, threshold=1.694e+03, percent-clipped=23.0 2023-12-03 22:59:12,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5400.0, ans=0.246 2023-12-03 22:59:12,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=5400.0, ans=0.246875 2023-12-03 22:59:19,884 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=6.35 2023-12-03 22:59:22,947 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=5466.666666666667, ans=0.009681159420289855 2023-12-03 22:59:32,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=5533.333333333333, ans=0.24062499999999998 2023-12-03 22:59:38,214 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.47 vs. limit=7.766666666666667 2023-12-03 22:59:54,877 INFO [train.py:1087] (1/4) Epoch 1, batch 850, loss[loss=0.4883, simple_loss=0.4685, pruned_loss=0.2473, over 24109.00 frames. ], tot_loss[loss=0.6368, simple_loss=0.5734, pruned_loss=0.429, over 4726099.21 frames. ], batch size: 87, lr: 4.49e-02, grad_scale: 16.0 2023-12-03 23:00:29,895 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.97 vs. limit=6.346666666666667 2023-12-03 23:00:30,032 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.92 vs. limit=6.346666666666667 2023-12-03 23:00:36,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=5866.666666666667, ans=0.22499999999999998 2023-12-03 23:00:40,337 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=5933.333333333333, ans=0.221875 2023-12-03 23:00:59,323 INFO [train.py:1087] (1/4) Epoch 2, batch 0, loss[loss=0.4602, simple_loss=0.4476, pruned_loss=0.2243, over 24555.00 frames. ], tot_loss[loss=0.4602, simple_loss=0.4476, pruned_loss=0.2243, over 24555.00 frames. ], batch size: 66, lr: 4.40e-02, grad_scale: 32.0 2023-12-03 23:00:59,324 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-03 23:01:11,639 INFO [train.py:1119] (1/4) Epoch 2, validation: loss=0.4023, simple_loss=0.4118, pruned_loss=0.1645, over 944034.00 frames. 2023-12-03 23:01:11,640 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-03 23:01:11,823 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=5966.666666666667, ans=0.03135416666666667 2023-12-03 23:01:18,557 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:01:21,659 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 3.803e+02 6.586e+02 1.143e+03 2.069e+03, threshold=1.317e+03, percent-clipped=6.0 2023-12-03 23:01:37,165 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=6100.0, ans=0.2140625 2023-12-03 23:01:37,289 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=12.075 2023-12-03 23:01:42,082 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=6100.0, ans=0.04125 2023-12-03 23:01:48,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=6166.666666666667, ans=0.2109375 2023-12-03 23:01:51,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=6166.666666666667, ans=0.6841666666666667 2023-12-03 23:02:09,421 INFO [train.py:1087] (1/4) Epoch 2, batch 50, loss[loss=0.4497, simple_loss=0.441, pruned_loss=0.2148, over 24568.00 frames. ], tot_loss[loss=0.4585, simple_loss=0.4454, pruned_loss=0.2248, over 1078615.20 frames. ], batch size: 64, lr: 4.40e-02, grad_scale: 16.0 2023-12-03 23:02:15,778 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=9.8625 2023-12-03 23:02:29,022 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:02:39,441 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.68 vs. limit=6.608333333333333 2023-12-03 23:02:41,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=6433.333333333333, ans=0.1984375 2023-12-03 23:02:47,340 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.30 vs. limit=12.375 2023-12-03 23:02:56,469 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=6566.666666666667, ans=0.1921875 2023-12-03 23:03:08,039 INFO [train.py:1087] (1/4) Epoch 2, batch 100, loss[loss=0.4224, simple_loss=0.4188, pruned_loss=0.1969, over 24725.00 frames. ], tot_loss[loss=0.4492, simple_loss=0.4394, pruned_loss=0.2165, over 1910496.99 frames. ], batch size: 67, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:03:12,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=6633.333333333333, ans=0.009427536231884057 2023-12-03 23:03:19,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=6700.0, ans=0.18593749999999998 2023-12-03 23:03:19,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=6700.0, ans=0.6655 2023-12-03 23:03:20,232 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 4.752e+02 7.477e+02 1.066e+03 3.809e+03, threshold=1.495e+03, percent-clipped=14.0 2023-12-03 23:03:40,815 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=6.30 vs. limit=6.706666666666667 2023-12-03 23:03:51,369 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6833.333333333333, ans=0.23166666666666666 2023-12-03 23:03:57,435 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.23 vs. limit=12.675 2023-12-03 23:04:05,965 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=10.0875 2023-12-03 23:04:07,461 INFO [train.py:1087] (1/4) Epoch 2, batch 150, loss[loss=0.4131, simple_loss=0.4189, pruned_loss=0.1818, over 24845.00 frames. ], tot_loss[loss=0.4407, simple_loss=0.4339, pruned_loss=0.2093, over 2562610.09 frames. ], batch size: 68, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:04:15,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6966.666666666667, ans=0.17343750000000002 2023-12-03 23:04:18,601 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=7033.333333333333, ans=0.22966666666666666 2023-12-03 23:04:21,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=7033.333333333333, ans=0.6538333333333334 2023-12-03 23:04:38,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=7100.0, ans=0.6515 2023-12-03 23:04:51,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=7166.666666666667, ans=0.1640625 2023-12-03 23:05:04,334 INFO [train.py:1087] (1/4) Epoch 2, batch 200, loss[loss=0.3921, simple_loss=0.3948, pruned_loss=0.1778, over 23476.00 frames. ], tot_loss[loss=0.4314, simple_loss=0.4278, pruned_loss=0.2017, over 3064796.77 frames. ], batch size: 94, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:05:04,938 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.62 vs. limit=10.2375 2023-12-03 23:05:08,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=7300.0, ans=0.3095 2023-12-03 23:05:15,596 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 4.144e+02 5.908e+02 9.300e+02 2.212e+03, threshold=1.182e+03, percent-clipped=6.0 2023-12-03 23:05:59,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.07 vs. limit=8.783333333333333 2023-12-03 23:05:59,952 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=10.3375 2023-12-03 23:06:01,447 INFO [train.py:1087] (1/4) Epoch 2, batch 250, loss[loss=0.3974, simple_loss=0.4018, pruned_loss=0.1798, over 24560.00 frames. ], tot_loss[loss=0.4249, simple_loss=0.4238, pruned_loss=0.1966, over 3451402.77 frames. ], batch size: 63, lr: 4.39e-02, grad_scale: 8.0 2023-12-03 23:06:06,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7633.333333333333, ans=0.14218750000000002 2023-12-03 23:06:12,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=7700.0, ans=0.6305000000000001 2023-12-03 23:06:22,011 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.37 vs. limit=13.275 2023-12-03 23:06:37,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=7833.333333333333, ans=0.03402777777777778 2023-12-03 23:06:43,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=7833.333333333333, ans=0.1328125 2023-12-03 23:06:56,219 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.84 vs. limit=7.16 2023-12-03 23:06:57,915 INFO [train.py:1087] (1/4) Epoch 2, batch 300, loss[loss=0.3864, simple_loss=0.3978, pruned_loss=0.1685, over 24814.00 frames. ], tot_loss[loss=0.418, simple_loss=0.4195, pruned_loss=0.1913, over 3754624.74 frames. ], batch size: 72, lr: 4.38e-02, grad_scale: 8.0 2023-12-03 23:06:58,199 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=7966.666666666667, ans=0.22033333333333333 2023-12-03 23:07:07,579 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.18 vs. limit=6.991666666666667 2023-12-03 23:07:09,115 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.778e+02 5.285e+02 1.000e+03 3.398e+03, threshold=1.057e+03, percent-clipped=16.0 2023-12-03 23:07:12,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=8033.333333333333, ans=0.125 2023-12-03 23:07:19,287 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=10.5375 2023-12-03 23:07:25,915 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.84 vs. limit=13.575 2023-12-03 23:07:27,959 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=13.575 2023-12-03 23:07:40,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=8166.666666666667, ans=0.6141666666666667 2023-12-03 23:07:53,745 INFO [train.py:1087] (1/4) Epoch 2, batch 350, loss[loss=0.3745, simple_loss=0.3924, pruned_loss=0.1579, over 23312.00 frames. ], tot_loss[loss=0.4107, simple_loss=0.4151, pruned_loss=0.1858, over 3988481.73 frames. ], batch size: 56, lr: 4.38e-02, grad_scale: 8.0 2023-12-03 23:08:03,352 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=8300.0, ans=0.03208333333333334 2023-12-03 23:08:36,891 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=10.6875 2023-12-03 23:08:49,259 INFO [train.py:1087] (1/4) Epoch 2, batch 400, loss[loss=0.3639, simple_loss=0.3825, pruned_loss=0.1541, over 24767.00 frames. ], tot_loss[loss=0.4035, simple_loss=0.4107, pruned_loss=0.1805, over 4168455.16 frames. ], batch size: 61, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:08:57,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=8633.333333333334, ans=0.0 2023-12-03 23:09:00,211 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 4.190e+02 6.859e+02 9.810e+02 1.648e+03, threshold=1.372e+03, percent-clipped=19.0 2023-12-03 23:09:02,130 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=14.025 2023-12-03 23:09:34,301 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=10.8375 2023-12-03 23:09:36,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8900.0, ans=0.125 2023-12-03 23:09:43,717 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=8966.666666666666, ans=0.125 2023-12-03 23:09:44,535 INFO [train.py:1087] (1/4) Epoch 2, batch 450, loss[loss=0.3402, simple_loss=0.367, pruned_loss=0.1371, over 24848.00 frames. ], tot_loss[loss=0.3973, simple_loss=0.407, pruned_loss=0.1762, over 4298586.74 frames. ], batch size: 68, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:10:06,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=9100.0, ans=0.125 2023-12-03 23:10:08,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=9100.0, ans=0.125 2023-12-03 23:10:08,753 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=9100.0, ans=0.025 2023-12-03 23:10:25,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=9166.666666666666, ans=0.07 2023-12-03 23:10:26,517 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=10.9375 2023-12-03 23:10:35,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=9233.333333333334, ans=0.125 2023-12-03 23:10:40,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=9300.0, ans=0.008847826086956521 2023-12-03 23:10:41,087 INFO [train.py:1087] (1/4) Epoch 2, batch 500, loss[loss=0.3543, simple_loss=0.3814, pruned_loss=0.1455, over 24754.00 frames. ], tot_loss[loss=0.3904, simple_loss=0.4029, pruned_loss=0.1716, over 4420182.18 frames. ], batch size: 65, lr: 4.38e-02, grad_scale: 16.0 2023-12-03 23:10:43,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=9300.0, ans=0.02791666666666667 2023-12-03 23:10:48,490 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.13 vs. limit=14.475 2023-12-03 23:10:51,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=9366.666666666666, ans=0.125 2023-12-03 23:10:52,175 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.311e+02 5.496e+02 8.640e+02 1.749e+03, threshold=1.099e+03, percent-clipped=4.0 2023-12-03 23:10:56,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=9366.666666666666, ans=0.027638888888888893 2023-12-03 23:11:07,201 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=9433.333333333334, ans=0.125 2023-12-03 23:11:08,191 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9433.333333333334, ans=0.20566666666666666 2023-12-03 23:11:11,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=9433.333333333334, ans=0.5698333333333334 2023-12-03 23:11:29,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=9566.666666666666, ans=0.125 2023-12-03 23:11:33,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9566.666666666666, ans=0.125 2023-12-03 23:11:36,805 INFO [train.py:1087] (1/4) Epoch 2, batch 550, loss[loss=0.3946, simple_loss=0.4127, pruned_loss=0.1747, over 22517.00 frames. ], tot_loss[loss=0.3841, simple_loss=0.3994, pruned_loss=0.1673, over 4495558.12 frames. ], batch size: 54, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:11:41,623 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=9633.333333333334, ans=0.02652777777777778 2023-12-03 23:11:46,314 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=9633.333333333334, ans=0.008775362318840579 2023-12-03 23:11:57,089 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9700.0, ans=0.203 2023-12-03 23:12:10,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=9833.333333333334, ans=0.125 2023-12-03 23:12:32,796 INFO [train.py:1087] (1/4) Epoch 2, batch 600, loss[loss=0.3472, simple_loss=0.3802, pruned_loss=0.1412, over 24567.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.3959, pruned_loss=0.1631, over 4573205.20 frames. ], batch size: 64, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:12:36,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=9966.666666666666, ans=0.20033333333333334 2023-12-03 23:12:40,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=9966.666666666666, ans=0.025138888888888895 2023-12-03 23:12:44,643 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 3.326e+02 4.968e+02 8.215e+02 1.747e+03, threshold=9.937e+02, percent-clipped=13.0 2023-12-03 23:12:45,548 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=6.006666666666667 2023-12-03 23:12:48,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=10033.333333333334, ans=10.0 2023-12-03 23:13:01,526 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10100.0, ans=0.199 2023-12-03 23:13:14,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=10166.666666666666, ans=0.125 2023-12-03 23:13:18,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=10233.333333333334, ans=0.125 2023-12-03 23:13:25,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=10233.333333333334, ans=0.5418333333333334 2023-12-03 23:13:29,361 INFO [train.py:1087] (1/4) Epoch 2, batch 650, loss[loss=0.3625, simple_loss=0.3906, pruned_loss=0.1544, over 24801.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.3922, pruned_loss=0.1587, over 4635248.05 frames. ], batch size: 62, lr: 4.37e-02, grad_scale: 16.0 2023-12-03 23:13:39,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=10366.666666666666, ans=0.125 2023-12-03 23:13:57,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=10433.333333333334, ans=0.125 2023-12-03 23:14:10,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=10500.0, ans=0.125 2023-12-03 23:14:24,182 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=11.4875 2023-12-03 23:14:24,553 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=4.595 2023-12-03 23:14:24,815 INFO [train.py:1087] (1/4) Epoch 2, batch 700, loss[loss=0.3498, simple_loss=0.3815, pruned_loss=0.1474, over 23994.00 frames. ], tot_loss[loss=0.3667, simple_loss=0.3904, pruned_loss=0.1562, over 4660943.73 frames. ], batch size: 87, lr: 4.36e-02, grad_scale: 16.0 2023-12-03 23:14:29,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=10633.333333333334, ans=0.5278333333333334 2023-12-03 23:14:36,331 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 3.790e+02 5.279e+02 7.116e+02 1.447e+03, threshold=1.056e+03, percent-clipped=14.0 2023-12-03 23:15:00,590 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.44 vs. limit=4.625 2023-12-03 23:15:02,666 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:15:07,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=10833.333333333334, ans=0.021527777777777774 2023-12-03 23:15:21,822 INFO [train.py:1087] (1/4) Epoch 2, batch 750, loss[loss=0.3325, simple_loss=0.3714, pruned_loss=0.1356, over 24725.00 frames. ], tot_loss[loss=0.3608, simple_loss=0.3875, pruned_loss=0.1526, over 4696990.46 frames. ], batch size: 67, lr: 4.36e-02, grad_scale: 16.0 2023-12-03 23:15:22,346 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=11.6125 2023-12-03 23:16:08,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11233.333333333334, ans=0.125 2023-12-03 23:16:17,495 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=6.26 2023-12-03 23:16:17,673 INFO [train.py:1087] (1/4) Epoch 2, batch 800, loss[loss=0.3021, simple_loss=0.3514, pruned_loss=0.1155, over 24741.00 frames. ], tot_loss[loss=0.3549, simple_loss=0.3845, pruned_loss=0.149, over 4722881.92 frames. ], batch size: 63, lr: 4.36e-02, grad_scale: 32.0 2023-12-03 23:16:17,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=11300.0, ans=0.5045000000000001 2023-12-03 23:16:24,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=11300.0, ans=0.00841304347826087 2023-12-03 23:16:29,210 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 3.331e+02 5.421e+02 7.994e+02 1.422e+03, threshold=1.084e+03, percent-clipped=13.0 2023-12-03 23:16:35,541 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=11366.666666666666, ans=0.01930555555555556 2023-12-03 23:16:42,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=11433.333333333334, ans=0.019027777777777775 2023-12-03 23:16:51,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=11500.0, ans=0.185 2023-12-03 23:17:09,950 INFO [train.py:1087] (1/4) Epoch 2, batch 850, loss[loss=0.3401, simple_loss=0.3796, pruned_loss=0.1424, over 24159.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3821, pruned_loss=0.1459, over 4755501.55 frames. ], batch size: 58, lr: 4.35e-02, grad_scale: 32.0 2023-12-03 23:17:10,463 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.63 vs. limit=16.225 2023-12-03 23:17:33,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=11766.666666666666, ans=0.01763888888888889 2023-12-03 23:17:45,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11833.333333333334, ans=0.18166666666666664 2023-12-03 23:17:47,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=11833.333333333334, ans=0.48583333333333334 2023-12-03 23:18:03,453 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=11933.333333333334, ans=0.016944444444444443 2023-12-03 23:18:12,791 INFO [train.py:1087] (1/4) Epoch 3, batch 0, loss[loss=0.2955, simple_loss=0.3479, pruned_loss=0.114, over 24799.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3479, pruned_loss=0.114, over 24799.00 frames. ], batch size: 62, lr: 4.14e-02, grad_scale: 32.0 2023-12-03 23:18:12,791 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-03 23:18:25,043 INFO [train.py:1119] (1/4) Epoch 3, validation: loss=0.2657, simple_loss=0.3425, pruned_loss=0.08453, over 944034.00 frames. 2023-12-03 23:18:25,045 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-03 23:18:30,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=11933.333333333334, ans=0.48233333333333334 2023-12-03 23:18:34,411 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.04 vs. limit=4.79 2023-12-03 23:18:42,340 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 3.023e+02 4.473e+02 6.097e+02 1.021e+03, threshold=8.947e+02, percent-clipped=0.0 2023-12-03 23:18:43,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=12000.0, ans=0.01666666666666667 2023-12-03 23:18:44,747 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:18:47,129 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.59 vs. limit=16.55 2023-12-03 23:18:51,025 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=12066.666666666666, ans=0.125 2023-12-03 23:18:55,656 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.59 vs. limit=16.55 2023-12-03 23:18:56,639 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.31 vs. limit=12.025 2023-12-03 23:19:06,844 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=12133.333333333334, ans=0.125 2023-12-03 23:19:12,844 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=12200.0, ans=0.125 2023-12-03 23:19:19,893 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=12.075 2023-12-03 23:19:21,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=12266.666666666666, ans=0.125 2023-12-03 23:19:22,056 INFO [train.py:1087] (1/4) Epoch 3, batch 50, loss[loss=0.3344, simple_loss=0.3829, pruned_loss=0.1376, over 24035.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3694, pruned_loss=0.1305, over 1094117.56 frames. ], batch size: 87, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:19:39,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12333.333333333334, ans=0.125 2023-12-03 23:19:40,306 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=16.75 2023-12-03 23:19:41,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12333.333333333334, ans=0.17666666666666667 2023-12-03 23:19:44,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=12400.0, ans=0.466 2023-12-03 23:19:46,497 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.90 vs. limit=16.8 2023-12-03 23:19:58,677 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:20:14,292 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=12.2 2023-12-03 23:20:18,437 INFO [train.py:1087] (1/4) Epoch 3, batch 100, loss[loss=0.2911, simple_loss=0.3509, pruned_loss=0.1115, over 24763.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3686, pruned_loss=0.13, over 1908311.78 frames. ], batch size: 70, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:20:34,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12666.666666666666, ans=0.17333333333333334 2023-12-03 23:20:35,463 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.710e+02 4.012e+02 6.179e+02 1.254e+03, threshold=8.023e+02, percent-clipped=3.0 2023-12-03 23:20:39,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=12666.666666666666, ans=0.125 2023-12-03 23:20:50,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=12733.333333333334, ans=0.013611111111111109 2023-12-03 23:21:13,998 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.90 vs. limit=4.9399999999999995 2023-12-03 23:21:14,486 INFO [train.py:1087] (1/4) Epoch 3, batch 150, loss[loss=0.2946, simple_loss=0.354, pruned_loss=0.1154, over 24710.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3665, pruned_loss=0.1274, over 2564307.69 frames. ], batch size: 69, lr: 4.13e-02, grad_scale: 32.0 2023-12-03 23:21:27,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=13000.0, ans=0.125 2023-12-03 23:21:35,988 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=12.375 2023-12-03 23:21:56,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=13133.333333333334, ans=0.125 2023-12-03 23:22:11,022 INFO [train.py:1087] (1/4) Epoch 3, batch 200, loss[loss=0.3282, simple_loss=0.3809, pruned_loss=0.1375, over 23944.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3654, pruned_loss=0.1261, over 3074081.18 frames. ], batch size: 87, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:22:21,304 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.62 vs. limit=17.5 2023-12-03 23:22:22,317 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=13333.333333333334, ans=0.125 2023-12-03 23:22:29,351 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.597e+02 2.686e+02 4.511e+02 7.045e+02 2.264e+03, threshold=9.022e+02, percent-clipped=20.0 2023-12-03 23:22:50,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=13466.666666666666, ans=0.010555555555555561 2023-12-03 23:23:01,434 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=13533.333333333334, ans=0.42633333333333334 2023-12-03 23:23:07,714 INFO [train.py:1087] (1/4) Epoch 3, batch 250, loss[loss=0.3056, simple_loss=0.3701, pruned_loss=0.1206, over 24576.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3657, pruned_loss=0.1264, over 3452264.46 frames. ], batch size: 65, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:23:23,768 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.77 vs. limit=17.75 2023-12-03 23:23:40,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=13800.0, ans=0.00916666666666667 2023-12-03 23:23:44,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=13800.0, ans=0.00916666666666667 2023-12-03 23:24:03,783 INFO [train.py:1087] (1/4) Epoch 3, batch 300, loss[loss=0.3042, simple_loss=0.3719, pruned_loss=0.1182, over 24564.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.364, pruned_loss=0.1243, over 3769094.21 frames. ], batch size: 62, lr: 4.12e-02, grad_scale: 16.0 2023-12-03 23:24:04,130 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13933.333333333334, ans=0.125 2023-12-03 23:24:06,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=13933.333333333334, ans=0.0 2023-12-03 23:24:11,075 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.99 vs. limit=12.725 2023-12-03 23:24:12,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=13933.333333333334, ans=0.035 2023-12-03 23:24:21,874 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.699e+02 3.552e+02 4.986e+02 1.435e+03, threshold=7.104e+02, percent-clipped=8.0 2023-12-03 23:24:22,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=14000.0, ans=0.00782608695652174 2023-12-03 23:24:23,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14000.0, ans=0.125 2023-12-03 23:24:23,209 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:24:23,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14000.0, ans=0.125 2023-12-03 23:24:32,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=14066.666666666666, ans=0.125 2023-12-03 23:24:41,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14133.333333333334, ans=0.15866666666666665 2023-12-03 23:25:00,009 INFO [train.py:1087] (1/4) Epoch 3, batch 350, loss[loss=0.3169, simple_loss=0.3719, pruned_loss=0.131, over 24573.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3633, pruned_loss=0.1235, over 3994341.87 frames. ], batch size: 64, lr: 4.11e-02, grad_scale: 16.0 2023-12-03 23:25:19,785 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=14333.333333333334, ans=0.125 2023-12-03 23:25:25,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=14400.0, ans=0.006666666666666668 2023-12-03 23:25:37,575 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.39 vs. limit=18.35 2023-12-03 23:25:47,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=14533.333333333334, ans=0.3913333333333333 2023-12-03 23:25:54,241 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.64 vs. limit=18.4 2023-12-03 23:25:56,776 INFO [train.py:1087] (1/4) Epoch 3, batch 400, loss[loss=0.2875, simple_loss=0.3572, pruned_loss=0.1089, over 24862.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.363, pruned_loss=0.1232, over 4157621.54 frames. ], batch size: 68, lr: 4.11e-02, grad_scale: 32.0 2023-12-03 23:25:59,162 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=14600.0, ans=0.125 2023-12-03 23:26:01,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14600.0, ans=0.125 2023-12-03 23:26:09,418 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=14666.666666666666, ans=0.09899494936611666 2023-12-03 23:26:16,109 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.839e+02 4.294e+02 6.758e+02 1.838e+03, threshold=8.588e+02, percent-clipped=21.0 2023-12-03 23:26:32,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=14800.0, ans=0.15200000000000002 2023-12-03 23:26:38,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=14800.0, ans=0.125 2023-12-03 23:26:44,369 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14866.666666666666, ans=0.125 2023-12-03 23:26:53,945 INFO [train.py:1087] (1/4) Epoch 3, batch 450, loss[loss=0.3148, simple_loss=0.3679, pruned_loss=0.1308, over 24776.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3628, pruned_loss=0.1226, over 4300209.61 frames. ], batch size: 64, lr: 4.10e-02, grad_scale: 32.0 2023-12-03 23:27:22,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=15066.666666666666, ans=0.05 2023-12-03 23:27:32,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=15133.333333333334, ans=0.125 2023-12-03 23:27:40,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=15200.0, ans=0.0 2023-12-03 23:27:48,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=15200.0, ans=0.125 2023-12-03 23:27:50,286 INFO [train.py:1087] (1/4) Epoch 3, batch 500, loss[loss=0.3103, simple_loss=0.3679, pruned_loss=0.1263, over 24050.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.361, pruned_loss=0.1207, over 4406831.53 frames. ], batch size: 87, lr: 4.10e-02, grad_scale: 16.0 2023-12-03 23:28:08,698 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.776e+02 4.021e+02 5.372e+02 1.015e+03, threshold=8.041e+02, percent-clipped=4.0 2023-12-03 23:28:11,054 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=15400.0, ans=0.05755000000000002 2023-12-03 23:28:19,431 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=13.275 2023-12-03 23:28:24,806 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.29 vs. limit=8.866666666666667 2023-12-03 23:28:26,977 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=13.3 2023-12-03 23:28:47,160 INFO [train.py:1087] (1/4) Epoch 3, batch 550, loss[loss=0.3975, simple_loss=0.4223, pruned_loss=0.1864, over 17271.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3608, pruned_loss=0.1202, over 4484452.43 frames. ], batch size: 178, lr: 4.10e-02, grad_scale: 16.0 2023-12-03 23:29:21,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=15800.0, ans=0.125 2023-12-03 23:29:38,527 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=15866.666666666666, ans=0.0005555555555555522 2023-12-03 23:29:42,462 INFO [train.py:1087] (1/4) Epoch 3, batch 600, loss[loss=0.3318, simple_loss=0.3806, pruned_loss=0.1415, over 22713.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3599, pruned_loss=0.1191, over 4553089.85 frames. ], batch size: 106, lr: 4.09e-02, grad_scale: 16.0 2023-12-03 23:29:49,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.86 vs. limit=10.373333333333335 2023-12-03 23:29:59,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=16000.0, ans=0.125 2023-12-03 23:30:02,680 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.514e+02 3.492e+02 5.338e+02 1.061e+03, threshold=6.983e+02, percent-clipped=7.0 2023-12-03 23:30:06,110 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=16066.666666666666, ans=0.125 2023-12-03 23:30:07,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=16066.666666666666, ans=0.0 2023-12-03 23:30:16,276 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.73 vs. limit=9.033333333333333 2023-12-03 23:30:39,889 INFO [train.py:1087] (1/4) Epoch 3, batch 650, loss[loss=0.2803, simple_loss=0.347, pruned_loss=0.1068, over 21697.00 frames. ], tot_loss[loss=0.298, simple_loss=0.359, pruned_loss=0.1183, over 4598054.84 frames. ], batch size: 52, lr: 4.09e-02, grad_scale: 16.0 2023-12-03 23:30:51,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=16333.333333333334, ans=0.0 2023-12-03 23:31:05,313 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=16400.0, ans=0.125 2023-12-03 23:31:10,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.91 vs. limit=13.65 2023-12-03 23:31:18,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=16466.666666666668, ans=0.125 2023-12-03 23:31:18,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=16466.666666666668, ans=13.675 2023-12-03 23:31:22,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=16466.666666666668, ans=0.125 2023-12-03 23:31:30,701 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=6.55 vs. limit=10.613333333333333 2023-12-03 23:31:36,835 INFO [train.py:1087] (1/4) Epoch 3, batch 700, loss[loss=0.2986, simple_loss=0.365, pruned_loss=0.1161, over 24722.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3569, pruned_loss=0.1162, over 4655093.23 frames. ], batch size: 67, lr: 4.08e-02, grad_scale: 16.0 2023-12-03 23:31:44,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=16600.0, ans=0.07 2023-12-03 23:31:53,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.77 vs. limit=20.0 2023-12-03 23:31:55,287 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.863e+02 3.533e+02 4.418e+02 1.002e+03, threshold=7.066e+02, percent-clipped=8.0 2023-12-03 23:31:59,283 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=16733.333333333332, ans=0.0 2023-12-03 23:32:15,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=16800.0, ans=0.31200000000000006 2023-12-03 23:32:21,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=16866.666666666668, ans=0.125 2023-12-03 23:32:32,787 INFO [train.py:1087] (1/4) Epoch 3, batch 750, loss[loss=0.2832, simple_loss=0.347, pruned_loss=0.1097, over 24557.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3562, pruned_loss=0.1157, over 4677276.09 frames. ], batch size: 62, lr: 4.08e-02, grad_scale: 16.0 2023-12-03 23:32:39,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=16933.333333333332, ans=0.13066666666666668 2023-12-03 23:32:50,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=17000.0, ans=0.125 2023-12-03 23:32:50,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=17000.0, ans=0.125 2023-12-03 23:32:53,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17066.666666666668, ans=0.12933333333333333 2023-12-03 23:32:56,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=17066.666666666668, ans=0.12933333333333333 2023-12-03 23:33:01,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=17066.666666666668, ans=0.30266666666666675 2023-12-03 23:33:19,123 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.62 vs. limit=13.95 2023-12-03 23:33:23,036 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=17200.0, ans=0.125 2023-12-03 23:33:28,569 INFO [train.py:1087] (1/4) Epoch 3, batch 800, loss[loss=0.2715, simple_loss=0.3382, pruned_loss=0.1024, over 24724.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3548, pruned_loss=0.1143, over 4704929.69 frames. ], batch size: 67, lr: 4.08e-02, grad_scale: 32.0 2023-12-03 23:33:35,785 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.52 vs. limit=13.975000000000001 2023-12-03 23:33:40,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=17333.333333333332, ans=0.0 2023-12-03 23:33:46,739 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.911e+02 4.063e+02 5.816e+02 1.939e+03, threshold=8.126e+02, percent-clipped=11.0 2023-12-03 23:33:58,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=17466.666666666668, ans=0.025 2023-12-03 23:34:03,398 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=9.366666666666667 2023-12-03 23:34:08,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=17466.666666666668, ans=0.125 2023-12-03 23:34:20,184 INFO [train.py:1087] (1/4) Epoch 3, batch 850, loss[loss=0.2574, simple_loss=0.3344, pruned_loss=0.09017, over 24803.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.354, pruned_loss=0.1134, over 4729975.85 frames. ], batch size: 72, lr: 4.07e-02, grad_scale: 32.0 2023-12-03 23:34:29,520 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=7.02 vs. limit=11.066666666666666 2023-12-03 23:34:33,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=17666.666666666668, ans=0.125 2023-12-03 23:35:22,572 INFO [train.py:1087] (1/4) Epoch 4, batch 0, loss[loss=0.2626, simple_loss=0.3357, pruned_loss=0.09478, over 24544.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3357, pruned_loss=0.09478, over 24544.00 frames. ], batch size: 66, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:35:22,573 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-03 23:35:34,669 INFO [train.py:1119] (1/4) Epoch 4, validation: loss=0.2286, simple_loss=0.324, pruned_loss=0.06665, over 944034.00 frames. 2023-12-03 23:35:34,670 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-03 23:35:56,707 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.18 vs. limit=21.025 2023-12-03 23:35:59,282 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.897e+02 3.931e+02 5.307e+02 8.427e+02, threshold=7.861e+02, percent-clipped=1.0 2023-12-03 23:36:05,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=18033.333333333332, ans=0.125 2023-12-03 23:36:08,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=18100.0, ans=0.125 2023-12-03 23:36:12,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=18100.0, ans=0.125 2023-12-03 23:36:14,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=18100.0, ans=0.125 2023-12-03 23:36:24,111 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=18166.666666666668, ans=0.125 2023-12-03 23:36:28,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=18166.666666666668, ans=0.125 2023-12-03 23:36:29,841 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.67 vs. limit=21.175 2023-12-03 23:36:30,356 INFO [train.py:1087] (1/4) Epoch 4, batch 50, loss[loss=0.2567, simple_loss=0.327, pruned_loss=0.09324, over 24751.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3442, pruned_loss=0.1048, over 1091763.38 frames. ], batch size: 66, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:36:35,110 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.99 vs. limit=21.175 2023-12-03 23:36:44,768 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.49 vs. limit=14.3625 2023-12-03 23:36:47,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=18300.0, ans=0.125 2023-12-03 23:36:55,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=18366.666666666668, ans=0.0 2023-12-03 23:37:05,313 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:37:09,417 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=18433.333333333332, ans=0.125 2023-12-03 23:37:21,539 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=5.775 2023-12-03 23:37:26,347 INFO [train.py:1087] (1/4) Epoch 4, batch 100, loss[loss=0.2807, simple_loss=0.3446, pruned_loss=0.1084, over 24460.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3451, pruned_loss=0.1047, over 1907559.11 frames. ], batch size: 77, lr: 3.80e-02, grad_scale: 32.0 2023-12-03 23:37:29,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.54 vs. limit=14.283333333333335 2023-12-03 23:37:29,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=18566.666666666668, ans=0.125 2023-12-03 23:37:33,186 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=18566.666666666668, ans=0.07 2023-12-03 23:37:36,419 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18633.333333333332, ans=0.0 2023-12-03 23:37:39,665 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=18633.333333333332, ans=0.1136666666666667 2023-12-03 23:37:44,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=18633.333333333332, ans=0.0 2023-12-03 23:37:47,836 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.24 vs. limit=11.48 2023-12-03 23:37:48,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=18700.0, ans=0.125 2023-12-03 23:37:50,377 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 2.445e+02 3.359e+02 4.351e+02 7.950e+02, threshold=6.718e+02, percent-clipped=1.0 2023-12-03 23:37:59,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=18766.666666666668, ans=0.125 2023-12-03 23:38:18,023 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=18833.333333333332, ans=0.125 2023-12-03 23:38:21,118 INFO [train.py:1087] (1/4) Epoch 4, batch 150, loss[loss=0.2729, simple_loss=0.3425, pruned_loss=0.1016, over 23987.00 frames. ], tot_loss[loss=0.277, simple_loss=0.345, pruned_loss=0.1045, over 2541574.78 frames. ], batch size: 87, lr: 3.79e-02, grad_scale: 32.0 2023-12-03 23:38:26,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=18900.0, ans=0.125 2023-12-03 23:38:38,524 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=18966.666666666668, ans=0.125 2023-12-03 23:39:09,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=19166.666666666668, ans=0.0 2023-12-03 23:39:13,847 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19166.666666666668, ans=0.10833333333333334 2023-12-03 23:39:13,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=19166.666666666668, ans=0.125 2023-12-03 23:39:16,828 INFO [train.py:1087] (1/4) Epoch 4, batch 200, loss[loss=0.2555, simple_loss=0.3319, pruned_loss=0.08956, over 24572.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3436, pruned_loss=0.1031, over 3058948.91 frames. ], batch size: 64, lr: 3.79e-02, grad_scale: 32.0 2023-12-03 23:39:23,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=19233.333333333332, ans=0.125 2023-12-03 23:39:27,330 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=14.7375 2023-12-03 23:39:27,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=19300.0, ans=0.006673913043478261 2023-12-03 23:39:36,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=19300.0, ans=0.125 2023-12-03 23:39:41,566 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 2.310e+02 2.925e+02 3.881e+02 7.060e+02, threshold=5.850e+02, percent-clipped=2.0 2023-12-03 23:39:42,793 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:39:42,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=19366.666666666668, ans=0.125 2023-12-03 23:39:45,023 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=19366.666666666668, ans=0.025 2023-12-03 23:39:52,474 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=19433.333333333332, ans=0.125 2023-12-03 23:40:11,286 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=19500.0, ans=0.125 2023-12-03 23:40:13,188 INFO [train.py:1087] (1/4) Epoch 4, batch 250, loss[loss=0.2551, simple_loss=0.3247, pruned_loss=0.0927, over 24758.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3424, pruned_loss=0.1025, over 3449028.71 frames. ], batch size: 64, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:40:37,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=19700.0, ans=0.006586956521739131 2023-12-03 23:41:08,725 INFO [train.py:1087] (1/4) Epoch 4, batch 300, loss[loss=0.2823, simple_loss=0.352, pruned_loss=0.1064, over 24788.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3425, pruned_loss=0.1025, over 3741709.80 frames. ], batch size: 71, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:41:32,927 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.490e+02 3.248e+02 4.505e+02 1.296e+03, threshold=6.496e+02, percent-clipped=13.0 2023-12-03 23:41:48,078 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=20100.0, ans=0.2 2023-12-03 23:41:49,453 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=20100.0, ans=0.95 2023-12-03 23:42:03,850 INFO [train.py:1087] (1/4) Epoch 4, batch 350, loss[loss=0.2752, simple_loss=0.3409, pruned_loss=0.1048, over 23741.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3428, pruned_loss=0.1029, over 3970749.17 frames. ], batch size: 57, lr: 3.78e-02, grad_scale: 32.0 2023-12-03 23:42:20,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=20300.0, ans=0.125 2023-12-03 23:42:30,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=20366.666666666668, ans=0.0 2023-12-03 23:42:30,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=20366.666666666668, ans=0.125 2023-12-03 23:42:34,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=20366.666666666668, ans=0.2 2023-12-03 23:42:36,648 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.72 vs. limit=15.0 2023-12-03 23:42:55,375 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=20500.0, ans=0.125 2023-12-03 23:42:59,462 INFO [train.py:1087] (1/4) Epoch 4, batch 400, loss[loss=0.256, simple_loss=0.3272, pruned_loss=0.0924, over 24562.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3424, pruned_loss=0.1027, over 4158318.26 frames. ], batch size: 62, lr: 3.77e-02, grad_scale: 32.0 2023-12-03 23:43:03,425 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-12-03 23:43:23,417 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=12.0 2023-12-03 23:43:23,880 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.420e+02 3.127e+02 4.015e+02 7.262e+02, threshold=6.254e+02, percent-clipped=2.0 2023-12-03 23:43:31,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=20766.666666666668, ans=0.0 2023-12-03 23:43:32,004 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=20766.666666666668, ans=0.5 2023-12-03 23:43:32,609 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.92 vs. limit=6.0 2023-12-03 23:43:50,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=20833.333333333332, ans=15.0 2023-12-03 23:43:55,275 INFO [train.py:1087] (1/4) Epoch 4, batch 450, loss[loss=0.2608, simple_loss=0.3357, pruned_loss=0.09294, over 24774.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3409, pruned_loss=0.1016, over 4299655.89 frames. ], batch size: 64, lr: 3.77e-02, grad_scale: 32.0 2023-12-03 23:43:58,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=20900.0, ans=0.015 2023-12-03 23:44:25,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=21033.333333333332, ans=0.125 2023-12-03 23:44:29,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21100.0, ans=0.1 2023-12-03 23:44:50,923 INFO [train.py:1087] (1/4) Epoch 4, batch 500, loss[loss=0.2674, simple_loss=0.3398, pruned_loss=0.09753, over 24857.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3399, pruned_loss=0.1006, over 4432151.99 frames. ], batch size: 68, lr: 3.76e-02, grad_scale: 32.0 2023-12-03 23:44:55,419 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=21233.333333333332, ans=0.006253623188405798 2023-12-03 23:44:57,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=21233.333333333332, ans=0.0 2023-12-03 23:44:58,610 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=21233.333333333332, ans=0.125 2023-12-03 23:45:15,020 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.343e+02 3.151e+02 5.447e+02 1.233e+03, threshold=6.302e+02, percent-clipped=16.0 2023-12-03 23:45:20,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21366.666666666668, ans=0.1 2023-12-03 23:45:23,908 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=21433.333333333332, ans=12.0 2023-12-03 23:45:46,093 INFO [train.py:1087] (1/4) Epoch 4, batch 550, loss[loss=0.2533, simple_loss=0.3265, pruned_loss=0.09002, over 24516.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3392, pruned_loss=0.09963, over 4520852.98 frames. ], batch size: 75, lr: 3.76e-02, grad_scale: 32.0 2023-12-03 23:46:00,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=21633.333333333332, ans=0.125 2023-12-03 23:46:04,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=21633.333333333332, ans=0.0 2023-12-03 23:46:16,264 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=21700.0, ans=0.04949747468305833 2023-12-03 23:46:17,720 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=21700.0, ans=0.0 2023-12-03 23:46:20,546 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=21766.666666666668, ans=0.07 2023-12-03 23:46:33,429 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=21833.333333333332, ans=0.125 2023-12-03 23:46:41,301 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=15.0 2023-12-03 23:46:41,668 INFO [train.py:1087] (1/4) Epoch 4, batch 600, loss[loss=0.2857, simple_loss=0.3532, pruned_loss=0.1091, over 24712.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3397, pruned_loss=0.1, over 4577888.73 frames. ], batch size: 69, lr: 3.75e-02, grad_scale: 16.0 2023-12-03 23:46:44,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=21900.0, ans=0.035 2023-12-03 23:46:44,539 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-12-03 23:46:47,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=21900.0, ans=0.125 2023-12-03 23:47:02,215 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=21966.666666666668, ans=0.09899494936611666 2023-12-03 23:47:07,291 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.361e+02 2.942e+02 4.171e+02 8.666e+02, threshold=5.884e+02, percent-clipped=10.0 2023-12-03 23:47:12,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22033.333333333332, ans=0.125 2023-12-03 23:47:27,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=22166.666666666668, ans=0.0 2023-12-03 23:47:32,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=22166.666666666668, ans=0.125 2023-12-03 23:47:37,233 INFO [train.py:1087] (1/4) Epoch 4, batch 650, loss[loss=0.272, simple_loss=0.3473, pruned_loss=0.09832, over 24346.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3396, pruned_loss=0.09984, over 4635145.87 frames. ], batch size: 79, lr: 3.75e-02, grad_scale: 16.0 2023-12-03 23:47:44,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=22233.333333333332, ans=0.04949747468305833 2023-12-03 23:48:10,008 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-12-03 23:48:18,079 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=22433.333333333332, ans=0.125 2023-12-03 23:48:32,504 INFO [train.py:1087] (1/4) Epoch 4, batch 700, loss[loss=0.2441, simple_loss=0.3181, pruned_loss=0.08504, over 24730.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3387, pruned_loss=0.09899, over 4676583.58 frames. ], batch size: 67, lr: 3.74e-02, grad_scale: 16.0 2023-12-03 23:48:44,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=22633.333333333332, ans=0.2 2023-12-03 23:48:44,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=22633.333333333332, ans=0.0 2023-12-03 23:48:46,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=22633.333333333332, ans=0.2 2023-12-03 23:48:49,343 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-12-03 23:48:58,814 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.819e+02 2.583e+02 3.364e+02 4.437e+02 7.652e+02, threshold=6.728e+02, percent-clipped=5.0 2023-12-03 23:49:00,693 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.24 vs. limit=10.0 2023-12-03 23:49:10,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=22766.666666666668, ans=0.005920289855072464 2023-12-03 23:49:25,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-12-03 23:49:28,664 INFO [train.py:1087] (1/4) Epoch 4, batch 750, loss[loss=0.2565, simple_loss=0.3306, pruned_loss=0.09117, over 23737.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3383, pruned_loss=0.0988, over 4705599.10 frames. ], batch size: 57, lr: 3.74e-02, grad_scale: 16.0 2023-12-03 23:49:44,309 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=22966.666666666668, ans=0.125 2023-12-03 23:49:45,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=22966.666666666668, ans=0.2 2023-12-03 23:49:50,757 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=23033.333333333332, ans=0.125 2023-12-03 23:50:08,567 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=8.0 2023-12-03 23:50:18,070 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.05 vs. limit=15.0 2023-12-03 23:50:23,870 INFO [train.py:1087] (1/4) Epoch 4, batch 800, loss[loss=0.2772, simple_loss=0.3472, pruned_loss=0.1036, over 24765.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3383, pruned_loss=0.09934, over 4715519.33 frames. ], batch size: 64, lr: 3.73e-02, grad_scale: 32.0 2023-12-03 23:50:32,837 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=23233.333333333332, ans=0.0 2023-12-03 23:50:33,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=23233.333333333332, ans=0.125 2023-12-03 23:50:37,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=23300.0, ans=0.0 2023-12-03 23:50:42,243 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.29 vs. limit=15.0 2023-12-03 23:50:45,529 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.24 vs. limit=15.0 2023-12-03 23:50:47,615 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.21 vs. limit=22.5 2023-12-03 23:50:49,075 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.403e+02 2.805e+02 3.728e+02 6.833e+02, threshold=5.611e+02, percent-clipped=1.0 2023-12-03 23:50:49,594 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.79 vs. limit=22.5 2023-12-03 23:51:09,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=23500.0, ans=0.125 2023-12-03 23:51:14,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=23500.0, ans=0.125 2023-12-03 23:51:16,404 INFO [train.py:1087] (1/4) Epoch 4, batch 850, loss[loss=0.2471, simple_loss=0.322, pruned_loss=0.08612, over 24500.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.338, pruned_loss=0.09917, over 4733194.81 frames. ], batch size: 75, lr: 3.73e-02, grad_scale: 32.0 2023-12-03 23:51:39,589 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-03 23:51:49,399 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-12-03 23:52:18,342 INFO [train.py:1087] (1/4) Epoch 5, batch 0, loss[loss=0.2553, simple_loss=0.3335, pruned_loss=0.08861, over 24690.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3335, pruned_loss=0.08861, over 24690.00 frames. ], batch size: 74, lr: 3.47e-02, grad_scale: 32.0 2023-12-03 23:52:18,343 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-03 23:52:30,469 INFO [train.py:1119] (1/4) Epoch 5, validation: loss=0.2159, simple_loss=0.3139, pruned_loss=0.05896, over 944034.00 frames. 2023-12-03 23:52:30,469 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-03 23:53:01,281 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.538e+02 3.142e+02 3.972e+02 7.221e+02, threshold=6.284e+02, percent-clipped=5.0 2023-12-03 23:53:02,740 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=24066.666666666668, ans=0.07 2023-12-03 23:53:26,118 INFO [train.py:1087] (1/4) Epoch 5, batch 50, loss[loss=0.2487, simple_loss=0.3224, pruned_loss=0.08749, over 24578.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3347, pruned_loss=0.09619, over 1082164.25 frames. ], batch size: 65, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:53:27,449 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=24200.0, ans=0.005608695652173913 2023-12-03 23:53:48,603 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.11 vs. limit=22.5 2023-12-03 23:53:49,575 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.04 vs. limit=10.0 2023-12-03 23:54:01,546 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=24400.0, ans=0.05 2023-12-03 23:54:04,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=24400.0, ans=0.125 2023-12-03 23:54:06,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=24400.0, ans=0.125 2023-12-03 23:54:09,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=24466.666666666668, ans=0.0055507246376811595 2023-12-03 23:54:13,072 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.81 vs. limit=15.0 2023-12-03 23:54:21,763 INFO [train.py:1087] (1/4) Epoch 5, batch 100, loss[loss=0.2817, simple_loss=0.3465, pruned_loss=0.1085, over 23493.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3333, pruned_loss=0.09452, over 1913225.26 frames. ], batch size: 94, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:54:40,454 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=24600.0, ans=0.04949747468305833 2023-12-03 23:54:46,438 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-12-03 23:54:49,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=24666.666666666668, ans=0.2 2023-12-03 23:54:52,470 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.614e+02 2.253e+02 2.574e+02 3.314e+02 5.061e+02, threshold=5.148e+02, percent-clipped=0.0 2023-12-03 23:55:02,557 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=11.47 vs. limit=15.0 2023-12-03 23:55:17,313 INFO [train.py:1087] (1/4) Epoch 5, batch 150, loss[loss=0.2474, simple_loss=0.3239, pruned_loss=0.08547, over 24732.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3323, pruned_loss=0.0933, over 2573566.03 frames. ], batch size: 63, lr: 3.46e-02, grad_scale: 32.0 2023-12-03 23:55:28,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=24933.333333333332, ans=0.0054492753623188415 2023-12-03 23:55:32,284 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-12-03 23:55:37,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=24933.333333333332, ans=0.125 2023-12-03 23:55:38,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25000.0, ans=0.125 2023-12-03 23:55:51,002 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-12-03 23:55:55,623 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=25066.666666666668, ans=10.0 2023-12-03 23:55:56,525 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=25066.666666666668, ans=0.005420289855072463 2023-12-03 23:56:01,112 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.06 vs. limit=10.0 2023-12-03 23:56:12,427 INFO [train.py:1087] (1/4) Epoch 5, batch 200, loss[loss=0.2599, simple_loss=0.3341, pruned_loss=0.09284, over 24504.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3325, pruned_loss=0.0936, over 3069212.00 frames. ], batch size: 77, lr: 3.45e-02, grad_scale: 32.0 2023-12-03 23:56:22,894 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=25266.666666666668, ans=0.125 2023-12-03 23:56:23,786 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25266.666666666668, ans=0.1 2023-12-03 23:56:24,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=25266.666666666668, ans=0.125 2023-12-03 23:56:30,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=15.0 2023-12-03 23:56:30,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25266.666666666668, ans=0.125 2023-12-03 23:56:31,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=25266.666666666668, ans=0.125 2023-12-03 23:56:43,921 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.577e+02 2.107e+02 2.550e+02 3.488e+02 6.968e+02, threshold=5.101e+02, percent-clipped=3.0 2023-12-03 23:57:02,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=25466.666666666668, ans=0.0 2023-12-03 23:57:08,971 INFO [train.py:1087] (1/4) Epoch 5, batch 250, loss[loss=0.3025, simple_loss=0.3616, pruned_loss=0.1217, over 24187.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3327, pruned_loss=0.09412, over 3447015.92 frames. ], batch size: 82, lr: 3.45e-02, grad_scale: 32.0 2023-12-03 23:57:46,057 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.40 vs. limit=22.5 2023-12-03 23:57:55,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=25800.0, ans=0.0 2023-12-03 23:58:04,573 INFO [train.py:1087] (1/4) Epoch 5, batch 300, loss[loss=0.2598, simple_loss=0.3382, pruned_loss=0.09075, over 24556.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3323, pruned_loss=0.0938, over 3748783.44 frames. ], batch size: 62, lr: 3.44e-02, grad_scale: 32.0 2023-12-03 23:58:13,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=25866.666666666668, ans=0.2 2023-12-03 23:58:26,474 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=26000.0, ans=0.0 2023-12-03 23:58:35,989 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.564e+02 2.281e+02 3.318e+02 4.636e+02 8.422e+02, threshold=6.635e+02, percent-clipped=20.0 2023-12-03 23:58:59,492 INFO [train.py:1087] (1/4) Epoch 5, batch 350, loss[loss=0.2503, simple_loss=0.3258, pruned_loss=0.08737, over 24761.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3318, pruned_loss=0.09327, over 3983733.14 frames. ], batch size: 64, lr: 3.44e-02, grad_scale: 32.0 2023-12-03 23:59:04,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=26200.0, ans=0.005173913043478261 2023-12-03 23:59:16,858 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=26266.666666666668, ans=0.125 2023-12-03 23:59:17,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26266.666666666668, ans=0.125 2023-12-03 23:59:24,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=26333.333333333332, ans=0.0 2023-12-03 23:59:29,934 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.89 vs. limit=15.0 2023-12-03 23:59:31,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=26400.0, ans=0.0 2023-12-03 23:59:35,963 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=15.0 2023-12-03 23:59:43,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=26466.666666666668, ans=0.2 2023-12-03 23:59:54,877 INFO [train.py:1087] (1/4) Epoch 5, batch 400, loss[loss=0.2434, simple_loss=0.3187, pruned_loss=0.08401, over 24714.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3319, pruned_loss=0.09335, over 4167476.61 frames. ], batch size: 74, lr: 3.43e-02, grad_scale: 32.0 2023-12-03 23:59:58,579 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.31 vs. limit=6.0 2023-12-04 00:00:29,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26666.666666666668, ans=0.0 2023-12-04 00:00:30,023 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.259e+02 2.965e+02 3.727e+02 6.747e+02, threshold=5.929e+02, percent-clipped=2.0 2023-12-04 00:00:30,572 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.09 vs. limit=15.0 2023-12-04 00:00:46,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=26800.0, ans=0.0 2023-12-04 00:00:49,111 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.79 vs. limit=22.5 2023-12-04 00:00:53,310 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.32 vs. limit=6.0 2023-12-04 00:00:53,791 INFO [train.py:1087] (1/4) Epoch 5, batch 450, loss[loss=0.2523, simple_loss=0.3254, pruned_loss=0.08964, over 24569.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3302, pruned_loss=0.09196, over 4331012.64 frames. ], batch size: 64, lr: 3.43e-02, grad_scale: 32.0 2023-12-04 00:00:57,162 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=26866.666666666668, ans=0.04949747468305833 2023-12-04 00:01:03,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=26933.333333333332, ans=0.125 2023-12-04 00:01:10,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=26933.333333333332, ans=0.125 2023-12-04 00:01:15,251 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=27000.0, ans=0.0 2023-12-04 00:01:19,857 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.28 vs. limit=6.0 2023-12-04 00:01:39,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=27133.333333333332, ans=0.0 2023-12-04 00:01:49,307 INFO [train.py:1087] (1/4) Epoch 5, batch 500, loss[loss=0.2551, simple_loss=0.327, pruned_loss=0.09158, over 24773.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3304, pruned_loss=0.09239, over 4420355.81 frames. ], batch size: 71, lr: 3.42e-02, grad_scale: 32.0 2023-12-04 00:01:56,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27200.0, ans=0.125 2023-12-04 00:02:03,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27266.666666666668, ans=0.1 2023-12-04 00:02:09,025 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=27266.666666666668, ans=0.0 2023-12-04 00:02:16,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=27333.333333333332, ans=0.0049275362318840586 2023-12-04 00:02:21,289 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.548e+02 2.293e+02 2.878e+02 3.785e+02 5.915e+02, threshold=5.756e+02, percent-clipped=0.0 2023-12-04 00:02:37,382 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-12-04 00:02:41,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=27466.666666666668, ans=0.07 2023-12-04 00:02:43,317 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27533.333333333332, ans=0.1 2023-12-04 00:02:44,174 INFO [train.py:1087] (1/4) Epoch 5, batch 550, loss[loss=0.2369, simple_loss=0.3175, pruned_loss=0.07816, over 24763.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3304, pruned_loss=0.09232, over 4502347.73 frames. ], batch size: 64, lr: 3.42e-02, grad_scale: 32.0 2023-12-04 00:02:50,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=27533.333333333332, ans=0.09899494936611666 2023-12-04 00:02:58,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=27600.0, ans=0.0 2023-12-04 00:03:15,997 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=27666.666666666668, ans=0.125 2023-12-04 00:03:17,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=27733.333333333332, ans=0.004840579710144928 2023-12-04 00:03:22,523 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2023-12-04 00:03:41,069 INFO [train.py:1087] (1/4) Epoch 5, batch 600, loss[loss=0.323, simple_loss=0.3736, pruned_loss=0.1362, over 16803.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3304, pruned_loss=0.09241, over 4555861.18 frames. ], batch size: 178, lr: 3.41e-02, grad_scale: 32.0 2023-12-04 00:03:48,663 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=27866.666666666668, ans=0.07 2023-12-04 00:04:11,249 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2023-12-04 00:04:13,050 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28000.0, ans=0.1 2023-12-04 00:04:13,739 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.232e+02 2.803e+02 3.888e+02 7.159e+02, threshold=5.606e+02, percent-clipped=4.0 2023-12-04 00:04:18,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28066.666666666668, ans=0.125 2023-12-04 00:04:35,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28133.333333333332, ans=0.1 2023-12-04 00:04:37,503 INFO [train.py:1087] (1/4) Epoch 5, batch 650, loss[loss=0.2816, simple_loss=0.3455, pruned_loss=0.1088, over 24771.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3301, pruned_loss=0.09226, over 4604452.50 frames. ], batch size: 71, lr: 3.41e-02, grad_scale: 32.0 2023-12-04 00:04:37,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28200.0, ans=0.1 2023-12-04 00:05:03,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=28333.333333333332, ans=0.95 2023-12-04 00:05:14,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=28400.0, ans=0.125 2023-12-04 00:05:25,737 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=28466.666666666668, ans=0.004681159420289855 2023-12-04 00:05:33,786 INFO [train.py:1087] (1/4) Epoch 5, batch 700, loss[loss=0.2635, simple_loss=0.3395, pruned_loss=0.09376, over 24480.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3288, pruned_loss=0.0912, over 4656095.16 frames. ], batch size: 75, lr: 3.40e-02, grad_scale: 32.0 2023-12-04 00:05:39,264 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28533.333333333332, ans=0.1 2023-12-04 00:05:49,509 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.34 vs. limit=15.0 2023-12-04 00:05:52,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=28600.0, ans=0.1 2023-12-04 00:06:06,254 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.233e+02 2.774e+02 3.473e+02 7.233e+02, threshold=5.549e+02, percent-clipped=3.0 2023-12-04 00:06:15,398 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-12-04 00:06:29,839 INFO [train.py:1087] (1/4) Epoch 5, batch 750, loss[loss=0.2302, simple_loss=0.3041, pruned_loss=0.07818, over 24758.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3279, pruned_loss=0.09065, over 4689505.65 frames. ], batch size: 64, lr: 3.40e-02, grad_scale: 32.0 2023-12-04 00:06:31,211 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=28866.666666666668, ans=0.125 2023-12-04 00:06:33,092 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-12-04 00:06:40,984 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=28933.333333333332, ans=0.125 2023-12-04 00:06:57,081 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.47 vs. limit=22.5 2023-12-04 00:07:17,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=29133.333333333332, ans=0.004536231884057972 2023-12-04 00:07:25,088 INFO [train.py:1087] (1/4) Epoch 5, batch 800, loss[loss=0.2377, simple_loss=0.3139, pruned_loss=0.08076, over 24725.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3277, pruned_loss=0.09036, over 4714856.17 frames. ], batch size: 67, lr: 3.39e-02, grad_scale: 32.0 2023-12-04 00:07:41,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=29266.666666666668, ans=0.0 2023-12-04 00:07:43,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29266.666666666668, ans=0.125 2023-12-04 00:07:48,130 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.39 vs. limit=15.0 2023-12-04 00:07:56,524 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.574e+02 2.263e+02 2.786e+02 3.430e+02 5.590e+02, threshold=5.572e+02, percent-clipped=1.0 2023-12-04 00:08:17,888 INFO [train.py:1087] (1/4) Epoch 5, batch 850, loss[loss=0.2405, simple_loss=0.3203, pruned_loss=0.08039, over 24545.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3276, pruned_loss=0.0902, over 4728747.74 frames. ], batch size: 62, lr: 3.39e-02, grad_scale: 32.0 2023-12-04 00:08:22,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=29533.333333333332, ans=0.1 2023-12-04 00:08:32,401 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=29600.0, ans=0.125 2023-12-04 00:08:50,164 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=29733.333333333332, ans=0.07 2023-12-04 00:08:55,055 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:09:19,640 INFO [train.py:1087] (1/4) Epoch 6, batch 0, loss[loss=0.2572, simple_loss=0.3283, pruned_loss=0.09307, over 24481.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3283, pruned_loss=0.09307, over 24481.00 frames. ], batch size: 75, lr: 3.16e-02, grad_scale: 32.0 2023-12-04 00:09:19,641 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 00:09:31,933 INFO [train.py:1119] (1/4) Epoch 6, validation: loss=0.2086, simple_loss=0.3076, pruned_loss=0.05475, over 944034.00 frames. 2023-12-04 00:09:31,933 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 00:09:41,739 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.08 vs. limit=10.0 2023-12-04 00:10:00,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=29966.666666666668, ans=0.2 2023-12-04 00:10:11,103 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.341e+02 2.934e+02 3.953e+02 6.299e+02, threshold=5.868e+02, percent-clipped=2.0 2023-12-04 00:10:13,819 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.50 vs. limit=15.0 2023-12-04 00:10:16,730 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=30100.0, ans=0.125 2023-12-04 00:10:27,931 INFO [train.py:1087] (1/4) Epoch 6, batch 50, loss[loss=0.2419, simple_loss=0.3213, pruned_loss=0.08121, over 24770.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3259, pruned_loss=0.08906, over 1075291.41 frames. ], batch size: 65, lr: 3.15e-02, grad_scale: 32.0 2023-12-04 00:10:28,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=30166.666666666668, ans=0.125 2023-12-04 00:10:28,451 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.53 vs. limit=15.0 2023-12-04 00:10:30,518 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.40 vs. limit=15.0 2023-12-04 00:10:47,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=30233.333333333332, ans=0.0 2023-12-04 00:10:47,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30233.333333333332, ans=0.1 2023-12-04 00:10:58,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=30300.0, ans=0.125 2023-12-04 00:11:02,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=30366.666666666668, ans=0.0042681159420289855 2023-12-04 00:11:17,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=30433.333333333332, ans=15.0 2023-12-04 00:11:23,792 INFO [train.py:1087] (1/4) Epoch 6, batch 100, loss[loss=0.2406, simple_loss=0.3234, pruned_loss=0.07885, over 24553.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3233, pruned_loss=0.08656, over 1912846.15 frames. ], batch size: 63, lr: 3.15e-02, grad_scale: 32.0 2023-12-04 00:11:25,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=30500.0, ans=0.0 2023-12-04 00:11:41,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=30566.666666666668, ans=0.0 2023-12-04 00:11:43,439 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.86 vs. limit=22.5 2023-12-04 00:11:57,659 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.07 vs. limit=22.5 2023-12-04 00:12:03,696 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.998e+02 2.383e+02 2.874e+02 5.461e+02, threshold=4.765e+02, percent-clipped=0.0 2023-12-04 00:12:03,978 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30700.0, ans=0.0 2023-12-04 00:12:18,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=30766.666666666668, ans=0.0 2023-12-04 00:12:19,876 INFO [train.py:1087] (1/4) Epoch 6, batch 150, loss[loss=0.2352, simple_loss=0.3114, pruned_loss=0.07954, over 24752.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3243, pruned_loss=0.08787, over 2534542.07 frames. ], batch size: 70, lr: 3.14e-02, grad_scale: 32.0 2023-12-04 00:12:25,369 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=30833.333333333332, ans=0.125 2023-12-04 00:12:31,391 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:12:59,110 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=31033.333333333332, ans=0.004123188405797102 2023-12-04 00:13:15,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=31166.666666666668, ans=0.125 2023-12-04 00:13:16,513 INFO [train.py:1087] (1/4) Epoch 6, batch 200, loss[loss=0.2284, simple_loss=0.3024, pruned_loss=0.07713, over 24558.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3235, pruned_loss=0.08721, over 3024802.34 frames. ], batch size: 63, lr: 3.14e-02, grad_scale: 32.0 2023-12-04 00:13:16,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=31166.666666666668, ans=0.0 2023-12-04 00:13:23,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=31166.666666666668, ans=0.05 2023-12-04 00:13:34,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=31233.333333333332, ans=0.2 2023-12-04 00:13:42,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=31300.0, ans=0.125 2023-12-04 00:13:50,256 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31366.666666666668, ans=0.125 2023-12-04 00:13:51,536 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=25.27 vs. limit=22.5 2023-12-04 00:13:55,244 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.132e+02 2.539e+02 3.184e+02 5.927e+02, threshold=5.079e+02, percent-clipped=2.0 2023-12-04 00:14:07,060 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.96 vs. limit=15.0 2023-12-04 00:14:09,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31433.333333333332, ans=0.125 2023-12-04 00:14:12,890 INFO [train.py:1087] (1/4) Epoch 6, batch 250, loss[loss=0.2484, simple_loss=0.3228, pruned_loss=0.08704, over 24479.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3226, pruned_loss=0.08617, over 3424524.46 frames. ], batch size: 77, lr: 3.13e-02, grad_scale: 32.0 2023-12-04 00:14:16,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=31500.0, ans=0.2 2023-12-04 00:14:21,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=31500.0, ans=0.125 2023-12-04 00:14:24,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=31566.666666666668, ans=0.125 2023-12-04 00:14:33,638 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.48 vs. limit=15.0 2023-12-04 00:14:44,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=31633.333333333332, ans=0.0 2023-12-04 00:14:46,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=31700.0, ans=0.125 2023-12-04 00:15:07,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31766.666666666668, ans=0.1 2023-12-04 00:15:09,560 INFO [train.py:1087] (1/4) Epoch 6, batch 300, loss[loss=0.2381, simple_loss=0.3147, pruned_loss=0.08075, over 24752.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3223, pruned_loss=0.08611, over 3726540.46 frames. ], batch size: 63, lr: 3.13e-02, grad_scale: 32.0 2023-12-04 00:15:16,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=31833.333333333332, ans=0.125 2023-12-04 00:15:22,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31900.0, ans=0.125 2023-12-04 00:15:48,953 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 2.044e+02 2.473e+02 2.984e+02 6.405e+02, threshold=4.945e+02, percent-clipped=3.0 2023-12-04 00:15:54,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32100.0, ans=0.1 2023-12-04 00:16:05,113 INFO [train.py:1087] (1/4) Epoch 6, batch 350, loss[loss=0.2212, simple_loss=0.3005, pruned_loss=0.07091, over 24746.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3233, pruned_loss=0.08664, over 3952683.04 frames. ], batch size: 66, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:16:07,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=32166.666666666668, ans=0.125 2023-12-04 00:16:20,940 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=32233.333333333332, ans=0.0 2023-12-04 00:16:42,282 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=32366.666666666668, ans=0.0038333333333333336 2023-12-04 00:16:48,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=32366.666666666668, ans=0.125 2023-12-04 00:16:57,490 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32433.333333333332, ans=0.1 2023-12-04 00:17:01,510 INFO [train.py:1087] (1/4) Epoch 6, batch 400, loss[loss=0.2516, simple_loss=0.3253, pruned_loss=0.08899, over 24160.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3232, pruned_loss=0.08667, over 4129437.07 frames. ], batch size: 82, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:17:03,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=32500.0, ans=0.125 2023-12-04 00:17:31,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=32633.333333333332, ans=0.125 2023-12-04 00:17:36,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=32700.0, ans=0.2 2023-12-04 00:17:40,916 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.474e+02 2.034e+02 2.326e+02 2.944e+02 4.882e+02, threshold=4.653e+02, percent-clipped=0.0 2023-12-04 00:17:57,684 INFO [train.py:1087] (1/4) Epoch 6, batch 450, loss[loss=0.2509, simple_loss=0.324, pruned_loss=0.08888, over 24329.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3225, pruned_loss=0.08584, over 4285749.69 frames. ], batch size: 79, lr: 3.12e-02, grad_scale: 32.0 2023-12-04 00:18:28,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=32966.666666666664, ans=22.5 2023-12-04 00:18:31,861 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-12-04 00:18:37,351 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2023-12-04 00:18:41,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=33100.0, ans=0.125 2023-12-04 00:18:53,927 INFO [train.py:1087] (1/4) Epoch 6, batch 500, loss[loss=0.2402, simple_loss=0.3172, pruned_loss=0.0816, over 24862.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3212, pruned_loss=0.08498, over 4406362.63 frames. ], batch size: 68, lr: 3.11e-02, grad_scale: 32.0 2023-12-04 00:19:05,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=33233.333333333336, ans=0.125 2023-12-04 00:19:28,393 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:19:33,504 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 1.953e+02 2.208e+02 2.592e+02 4.594e+02, threshold=4.417e+02, percent-clipped=0.0 2023-12-04 00:19:34,018 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.85 vs. limit=15.0 2023-12-04 00:19:40,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=33433.333333333336, ans=0.2 2023-12-04 00:19:50,851 INFO [train.py:1087] (1/4) Epoch 6, batch 550, loss[loss=0.2294, simple_loss=0.3092, pruned_loss=0.07479, over 24795.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3214, pruned_loss=0.08498, over 4490556.16 frames. ], batch size: 71, lr: 3.11e-02, grad_scale: 32.0 2023-12-04 00:19:51,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=33500.0, ans=0.0035869565217391307 2023-12-04 00:20:16,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=33633.333333333336, ans=0.1 2023-12-04 00:20:32,016 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.77 vs. limit=15.0 2023-12-04 00:20:33,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=33700.0, ans=0.2 2023-12-04 00:20:40,362 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33766.666666666664, ans=0.1 2023-12-04 00:20:45,532 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=22.5 2023-12-04 00:20:47,011 INFO [train.py:1087] (1/4) Epoch 6, batch 600, loss[loss=0.2313, simple_loss=0.3146, pruned_loss=0.07395, over 24771.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3211, pruned_loss=0.08483, over 4557959.29 frames. ], batch size: 70, lr: 3.10e-02, grad_scale: 32.0 2023-12-04 00:21:01,201 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=33900.0, ans=0.0 2023-12-04 00:21:05,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=33900.0, ans=0.0 2023-12-04 00:21:11,954 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33966.666666666664, ans=0.1 2023-12-04 00:21:22,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=34033.333333333336, ans=0.125 2023-12-04 00:21:26,787 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.521e+02 2.169e+02 2.721e+02 3.398e+02 7.625e+02, threshold=5.441e+02, percent-clipped=15.0 2023-12-04 00:21:29,958 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.53 vs. limit=22.5 2023-12-04 00:21:30,541 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=34033.333333333336, ans=0.0 2023-12-04 00:21:39,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=34100.0, ans=0.125 2023-12-04 00:21:39,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=34100.0, ans=0.2 2023-12-04 00:21:40,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=34100.0, ans=0.0034565217391304354 2023-12-04 00:21:43,040 INFO [train.py:1087] (1/4) Epoch 6, batch 650, loss[loss=0.2396, simple_loss=0.318, pruned_loss=0.08064, over 24729.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3208, pruned_loss=0.08466, over 4610249.60 frames. ], batch size: 67, lr: 3.10e-02, grad_scale: 32.0 2023-12-04 00:21:50,390 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=15.0 2023-12-04 00:22:17,999 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2023-12-04 00:22:26,344 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.91 vs. limit=15.0 2023-12-04 00:22:35,291 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.56 vs. limit=12.0 2023-12-04 00:22:39,146 INFO [train.py:1087] (1/4) Epoch 6, batch 700, loss[loss=0.2447, simple_loss=0.3209, pruned_loss=0.08425, over 24701.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3204, pruned_loss=0.08438, over 4645434.26 frames. ], batch size: 74, lr: 3.09e-02, grad_scale: 32.0 2023-12-04 00:22:49,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=34566.666666666664, ans=0.125 2023-12-04 00:23:14,833 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-12-04 00:23:15,617 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=34700.0, ans=0.125 2023-12-04 00:23:18,442 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.587e+02 2.071e+02 2.301e+02 2.749e+02 5.150e+02, threshold=4.601e+02, percent-clipped=0.0 2023-12-04 00:23:19,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=34700.0, ans=0.0 2023-12-04 00:23:35,740 INFO [train.py:1087] (1/4) Epoch 6, batch 750, loss[loss=0.2705, simple_loss=0.3369, pruned_loss=0.1021, over 21131.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3197, pruned_loss=0.08383, over 4694605.12 frames. ], batch size: 127, lr: 3.09e-02, grad_scale: 32.0 2023-12-04 00:23:40,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=34833.333333333336, ans=0.2 2023-12-04 00:23:53,072 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:23:56,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=34966.666666666664, ans=0.125 2023-12-04 00:24:16,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=35033.333333333336, ans=0.125 2023-12-04 00:24:18,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=35033.333333333336, ans=0.125 2023-12-04 00:24:21,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=35100.0, ans=0.0 2023-12-04 00:24:30,995 INFO [train.py:1087] (1/4) Epoch 6, batch 800, loss[loss=0.3163, simple_loss=0.3646, pruned_loss=0.134, over 17411.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3198, pruned_loss=0.08397, over 4715790.94 frames. ], batch size: 177, lr: 3.08e-02, grad_scale: 32.0 2023-12-04 00:24:54,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=35300.0, ans=0.5 2023-12-04 00:25:07,850 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 2.093e+02 2.608e+02 3.306e+02 8.466e+02, threshold=5.216e+02, percent-clipped=6.0 2023-12-04 00:25:08,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=35366.666666666664, ans=0.125 2023-12-04 00:25:15,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=35433.333333333336, ans=0.125 2023-12-04 00:25:23,104 INFO [train.py:1087] (1/4) Epoch 6, batch 850, loss[loss=0.2403, simple_loss=0.316, pruned_loss=0.08228, over 22974.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3199, pruned_loss=0.08391, over 4745662.32 frames. ], batch size: 106, lr: 3.08e-02, grad_scale: 32.0 2023-12-04 00:25:40,446 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35566.666666666664, ans=0.1 2023-12-04 00:25:42,788 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.61 vs. limit=15.0 2023-12-04 00:26:25,689 INFO [train.py:1087] (1/4) Epoch 7, batch 0, loss[loss=0.2298, simple_loss=0.311, pruned_loss=0.0743, over 24558.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.311, pruned_loss=0.0743, over 24558.00 frames. ], batch size: 66, lr: 2.88e-02, grad_scale: 32.0 2023-12-04 00:26:25,689 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 00:26:37,870 INFO [train.py:1119] (1/4) Epoch 7, validation: loss=0.199, simple_loss=0.2994, pruned_loss=0.0493, over 944034.00 frames. 2023-12-04 00:26:37,870 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 00:26:40,604 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.82 vs. limit=22.5 2023-12-04 00:26:50,782 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=35866.666666666664, ans=0.125 2023-12-04 00:27:15,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=36000.0, ans=0.0 2023-12-04 00:27:21,668 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.486e+02 2.041e+02 2.433e+02 2.875e+02 3.924e+02, threshold=4.865e+02, percent-clipped=0.0 2023-12-04 00:27:33,027 INFO [train.py:1087] (1/4) Epoch 7, batch 50, loss[loss=0.2336, simple_loss=0.314, pruned_loss=0.07658, over 24695.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.315, pruned_loss=0.08013, over 1084443.64 frames. ], batch size: 74, lr: 2.88e-02, grad_scale: 32.0 2023-12-04 00:27:35,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=36133.333333333336, ans=0.2 2023-12-04 00:27:41,618 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-12-04 00:28:13,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=36333.333333333336, ans=0.0 2023-12-04 00:28:28,808 INFO [train.py:1087] (1/4) Epoch 7, batch 100, loss[loss=0.2423, simple_loss=0.3207, pruned_loss=0.08195, over 24755.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3159, pruned_loss=0.0808, over 1896777.56 frames. ], batch size: 61, lr: 2.87e-02, grad_scale: 32.0 2023-12-04 00:28:30,710 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-12-04 00:28:38,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36533.333333333336, ans=0.1 2023-12-04 00:28:46,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=36533.333333333336, ans=6.0 2023-12-04 00:29:04,025 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=12.0 2023-12-04 00:29:05,107 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.16 vs. limit=15.0 2023-12-04 00:29:05,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=36666.666666666664, ans=0.125 2023-12-04 00:29:12,621 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.464e+02 1.963e+02 2.300e+02 3.064e+02 5.989e+02, threshold=4.600e+02, percent-clipped=1.0 2023-12-04 00:29:23,688 INFO [train.py:1087] (1/4) Epoch 7, batch 150, loss[loss=0.2357, simple_loss=0.3152, pruned_loss=0.07813, over 24791.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3168, pruned_loss=0.0813, over 2544125.34 frames. ], batch size: 70, lr: 2.87e-02, grad_scale: 32.0 2023-12-04 00:29:27,592 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-12-04 00:29:28,849 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-12-04 00:29:28,994 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=15.0 2023-12-04 00:29:58,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=37000.0, ans=0.125 2023-12-04 00:30:19,461 INFO [train.py:1087] (1/4) Epoch 7, batch 200, loss[loss=0.2167, simple_loss=0.3026, pruned_loss=0.06544, over 24759.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3165, pruned_loss=0.08122, over 3045336.64 frames. ], batch size: 64, lr: 2.86e-02, grad_scale: 32.0 2023-12-04 00:30:21,277 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-12-04 00:30:21,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=37133.333333333336, ans=0.025 2023-12-04 00:31:01,709 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-12-04 00:31:04,587 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 2.199e+02 2.530e+02 3.188e+02 4.211e+02, threshold=5.060e+02, percent-clipped=0.0 2023-12-04 00:31:04,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=37400.0, ans=0.1 2023-12-04 00:31:15,988 INFO [train.py:1087] (1/4) Epoch 7, batch 250, loss[loss=0.2252, simple_loss=0.3084, pruned_loss=0.07102, over 24605.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.315, pruned_loss=0.0799, over 3450316.57 frames. ], batch size: 68, lr: 2.86e-02, grad_scale: 32.0 2023-12-04 00:31:25,307 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-12-04 00:31:27,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=37533.333333333336, ans=0.04949747468305833 2023-12-04 00:31:55,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37666.666666666664, ans=0.1 2023-12-04 00:32:00,654 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=37733.333333333336, ans=0.002666666666666666 2023-12-04 00:32:08,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=37733.333333333336, ans=0.0 2023-12-04 00:32:11,625 INFO [train.py:1087] (1/4) Epoch 7, batch 300, loss[loss=0.2351, simple_loss=0.3104, pruned_loss=0.07984, over 24765.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.314, pruned_loss=0.07907, over 3758151.31 frames. ], batch size: 64, lr: 2.85e-02, grad_scale: 32.0 2023-12-04 00:32:25,817 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=37866.666666666664, ans=0.125 2023-12-04 00:32:37,566 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.38 vs. limit=22.5 2023-12-04 00:32:51,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=38000.0, ans=0.125 2023-12-04 00:32:54,964 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.970e+02 2.253e+02 2.674e+02 4.249e+02, threshold=4.505e+02, percent-clipped=0.0 2023-12-04 00:33:01,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=38066.666666666664, ans=0.125 2023-12-04 00:33:05,988 INFO [train.py:1087] (1/4) Epoch 7, batch 350, loss[loss=0.2224, simple_loss=0.3015, pruned_loss=0.07162, over 24759.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3139, pruned_loss=0.0791, over 3991221.68 frames. ], batch size: 64, lr: 2.85e-02, grad_scale: 32.0 2023-12-04 00:33:16,651 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.96 vs. limit=12.0 2023-12-04 00:33:38,904 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.47 vs. limit=15.0 2023-12-04 00:33:57,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=38400.0, ans=0.002521739130434783 2023-12-04 00:34:00,825 INFO [train.py:1087] (1/4) Epoch 7, batch 400, loss[loss=0.2456, simple_loss=0.3206, pruned_loss=0.08532, over 24778.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3149, pruned_loss=0.07991, over 4159755.87 frames. ], batch size: 62, lr: 2.84e-02, grad_scale: 32.0 2023-12-04 00:34:09,321 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=38466.666666666664, ans=0.2 2023-12-04 00:34:27,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38600.0, ans=0.1 2023-12-04 00:34:44,516 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=38733.333333333336, ans=0.0024492753623188398 2023-12-04 00:34:44,851 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=12.0 2023-12-04 00:34:45,276 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.431e+02 2.109e+02 2.328e+02 2.740e+02 4.257e+02, threshold=4.655e+02, percent-clipped=0.0 2023-12-04 00:34:55,957 INFO [train.py:1087] (1/4) Epoch 7, batch 450, loss[loss=0.2336, simple_loss=0.3107, pruned_loss=0.07824, over 24505.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3145, pruned_loss=0.07938, over 4319363.79 frames. ], batch size: 77, lr: 2.84e-02, grad_scale: 32.0 2023-12-04 00:34:58,335 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=38800.0, ans=0.0 2023-12-04 00:35:02,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38800.0, ans=0.1 2023-12-04 00:35:06,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=38866.666666666664, ans=0.0 2023-12-04 00:35:15,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38866.666666666664, ans=0.125 2023-12-04 00:35:15,413 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.15 vs. limit=22.5 2023-12-04 00:35:17,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=38933.333333333336, ans=0.025 2023-12-04 00:35:23,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=38933.333333333336, ans=0.125 2023-12-04 00:35:27,148 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-12-04 00:35:29,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=39000.0, ans=0.125 2023-12-04 00:35:32,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=39000.0, ans=0.0 2023-12-04 00:35:33,396 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.62 vs. limit=6.0 2023-12-04 00:35:45,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=39066.666666666664, ans=0.5 2023-12-04 00:35:51,173 INFO [train.py:1087] (1/4) Epoch 7, batch 500, loss[loss=0.2482, simple_loss=0.3252, pruned_loss=0.08559, over 24493.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3143, pruned_loss=0.07951, over 4414092.59 frames. ], batch size: 77, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:36:15,859 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=39266.666666666664, ans=10.0 2023-12-04 00:36:19,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=39266.666666666664, ans=0.2 2023-12-04 00:36:28,274 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.39 vs. limit=22.5 2023-12-04 00:36:35,204 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.950e+02 2.350e+02 2.840e+02 4.583e+02, threshold=4.699e+02, percent-clipped=0.0 2023-12-04 00:36:46,551 INFO [train.py:1087] (1/4) Epoch 7, batch 550, loss[loss=0.2656, simple_loss=0.3326, pruned_loss=0.0993, over 22829.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3144, pruned_loss=0.07964, over 4500049.55 frames. ], batch size: 106, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:36:46,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=39466.666666666664, ans=0.125 2023-12-04 00:36:52,412 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=39466.666666666664, ans=0.125 2023-12-04 00:36:55,892 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-12-04 00:36:58,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=39533.333333333336, ans=0.2 2023-12-04 00:36:59,419 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-12-04 00:37:22,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39666.666666666664, ans=0.1 2023-12-04 00:37:27,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=39666.666666666664, ans=0.0 2023-12-04 00:37:30,383 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-12-04 00:37:35,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=39733.333333333336, ans=0.0022318840579710142 2023-12-04 00:37:41,620 INFO [train.py:1087] (1/4) Epoch 7, batch 600, loss[loss=0.2243, simple_loss=0.3065, pruned_loss=0.07104, over 24716.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3137, pruned_loss=0.07925, over 4566730.68 frames. ], batch size: 67, lr: 2.83e-02, grad_scale: 32.0 2023-12-04 00:37:48,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=39800.0, ans=0.0 2023-12-04 00:37:50,596 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.26 vs. limit=15.0 2023-12-04 00:37:53,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=39866.666666666664, ans=0.125 2023-12-04 00:37:56,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=39866.666666666664, ans=0.1 2023-12-04 00:37:58,089 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=39866.666666666664, ans=0.0022028985507246386 2023-12-04 00:38:26,557 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 2.019e+02 2.331e+02 2.930e+02 4.659e+02, threshold=4.662e+02, percent-clipped=0.0 2023-12-04 00:38:37,661 INFO [train.py:1087] (1/4) Epoch 7, batch 650, loss[loss=0.2322, simple_loss=0.311, pruned_loss=0.07674, over 24767.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3126, pruned_loss=0.07827, over 4637699.15 frames. ], batch size: 65, lr: 2.82e-02, grad_scale: 32.0 2023-12-04 00:38:57,587 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.84 vs. limit=12.0 2023-12-04 00:39:03,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=40266.666666666664, ans=0.125 2023-12-04 00:39:32,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=40400.0, ans=0.125 2023-12-04 00:39:34,257 INFO [train.py:1087] (1/4) Epoch 7, batch 700, loss[loss=0.2324, simple_loss=0.3122, pruned_loss=0.07625, over 24181.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3127, pruned_loss=0.07828, over 4681794.11 frames. ], batch size: 58, lr: 2.82e-02, grad_scale: 32.0 2023-12-04 00:39:42,350 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.79 vs. limit=22.5 2023-12-04 00:39:44,335 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-12-04 00:40:01,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40600.0, ans=0.1 2023-12-04 00:40:09,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=40666.666666666664, ans=0.125 2023-12-04 00:40:18,500 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.925e+02 2.239e+02 2.672e+02 4.779e+02, threshold=4.479e+02, percent-clipped=1.0 2023-12-04 00:40:30,508 INFO [train.py:1087] (1/4) Epoch 7, batch 750, loss[loss=0.2246, simple_loss=0.303, pruned_loss=0.07313, over 24854.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3124, pruned_loss=0.07798, over 4719292.29 frames. ], batch size: 68, lr: 2.81e-02, grad_scale: 32.0 2023-12-04 00:40:32,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=40800.0, ans=0.125 2023-12-04 00:40:45,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten.whitening_limit, batch_count=40866.666666666664, ans=15.0 2023-12-04 00:41:26,407 INFO [train.py:1087] (1/4) Epoch 7, batch 800, loss[loss=0.2362, simple_loss=0.3117, pruned_loss=0.08034, over 24743.00 frames. ], tot_loss[loss=0.234, simple_loss=0.312, pruned_loss=0.07802, over 4730075.32 frames. ], batch size: 61, lr: 2.81e-02, grad_scale: 32.0 2023-12-04 00:41:28,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41133.333333333336, ans=0.1 2023-12-04 00:41:33,017 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41133.333333333336, ans=0.125 2023-12-04 00:41:43,116 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=41200.0, ans=0.125 2023-12-04 00:41:46,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=41266.666666666664, ans=0.001898550724637682 2023-12-04 00:41:50,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=41266.666666666664, ans=0.0 2023-12-04 00:41:55,345 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=41266.666666666664, ans=0.2 2023-12-04 00:41:57,755 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:42:03,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41333.333333333336, ans=0.1 2023-12-04 00:42:06,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41400.0, ans=0.1 2023-12-04 00:42:07,421 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.550e+02 2.055e+02 2.330e+02 2.823e+02 4.114e+02, threshold=4.659e+02, percent-clipped=0.0 2023-12-04 00:42:12,967 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=41400.0, ans=15.0 2023-12-04 00:42:17,743 INFO [train.py:1087] (1/4) Epoch 7, batch 850, loss[loss=0.2294, simple_loss=0.3102, pruned_loss=0.07424, over 24798.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3118, pruned_loss=0.07787, over 4754583.85 frames. ], batch size: 71, lr: 2.80e-02, grad_scale: 32.0 2023-12-04 00:42:23,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=41466.666666666664, ans=0.001855072463768117 2023-12-04 00:42:24,943 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=41466.666666666664, ans=0.0 2023-12-04 00:42:25,905 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=41466.666666666664, ans=0.0 2023-12-04 00:42:27,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=41533.333333333336, ans=0.125 2023-12-04 00:42:27,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=41533.333333333336, ans=0.125 2023-12-04 00:42:30,179 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.95 vs. limit=22.5 2023-12-04 00:42:36,828 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=41600.0, ans=0.125 2023-12-04 00:42:45,017 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=41600.0, ans=0.125 2023-12-04 00:42:51,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=41666.666666666664, ans=0.125 2023-12-04 00:43:20,709 INFO [train.py:1087] (1/4) Epoch 8, batch 0, loss[loss=0.2387, simple_loss=0.3161, pruned_loss=0.0806, over 24021.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3161, pruned_loss=0.0806, over 24021.00 frames. ], batch size: 87, lr: 2.64e-02, grad_scale: 32.0 2023-12-04 00:43:20,710 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 00:43:32,845 INFO [train.py:1119] (1/4) Epoch 8, validation: loss=0.1925, simple_loss=0.2931, pruned_loss=0.04594, over 944034.00 frames. 2023-12-04 00:43:32,845 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 00:44:03,612 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-12-04 00:44:16,379 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=15.0 2023-12-04 00:44:22,024 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.906e+02 2.199e+02 2.514e+02 5.001e+02, threshold=4.398e+02, percent-clipped=1.0 2023-12-04 00:44:27,264 INFO [train.py:1087] (1/4) Epoch 8, batch 50, loss[loss=0.2313, simple_loss=0.3107, pruned_loss=0.07601, over 20887.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3113, pruned_loss=0.07609, over 1100413.85 frames. ], batch size: 50, lr: 2.63e-02, grad_scale: 32.0 2023-12-04 00:44:31,164 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-12-04 00:44:33,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=42100.0, ans=0.2 2023-12-04 00:44:35,989 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:44:43,801 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.86 vs. limit=15.0 2023-12-04 00:45:04,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=42300.0, ans=0.0016739130434782618 2023-12-04 00:45:21,891 INFO [train.py:1087] (1/4) Epoch 8, batch 100, loss[loss=0.3022, simple_loss=0.353, pruned_loss=0.1257, over 16698.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.311, pruned_loss=0.07641, over 1912883.01 frames. ], batch size: 178, lr: 2.63e-02, grad_scale: 32.0 2023-12-04 00:45:33,797 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-12-04 00:45:40,069 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.55 vs. limit=10.0 2023-12-04 00:45:48,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42566.666666666664, ans=0.1 2023-12-04 00:45:58,474 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=12.0 2023-12-04 00:46:02,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=42633.333333333336, ans=0.125 2023-12-04 00:46:08,315 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42700.0, ans=0.1 2023-12-04 00:46:13,387 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.850e+02 2.203e+02 2.662e+02 5.574e+02, threshold=4.406e+02, percent-clipped=2.0 2023-12-04 00:46:15,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=42700.0, ans=0.125 2023-12-04 00:46:16,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42766.666666666664, ans=0.125 2023-12-04 00:46:17,662 INFO [train.py:1087] (1/4) Epoch 8, batch 150, loss[loss=0.2395, simple_loss=0.3197, pruned_loss=0.0797, over 24804.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3094, pruned_loss=0.07484, over 2567745.41 frames. ], batch size: 62, lr: 2.62e-02, grad_scale: 16.0 2023-12-04 00:46:28,648 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42833.333333333336, ans=0.125 2023-12-04 00:46:36,733 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-12-04 00:46:43,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42900.0, ans=0.125 2023-12-04 00:46:55,725 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2023-12-04 00:47:07,756 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43033.333333333336, ans=0.125 2023-12-04 00:47:13,945 INFO [train.py:1087] (1/4) Epoch 8, batch 200, loss[loss=0.2218, simple_loss=0.3064, pruned_loss=0.06858, over 24745.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3089, pruned_loss=0.07484, over 3059387.59 frames. ], batch size: 63, lr: 2.62e-02, grad_scale: 16.0 2023-12-04 00:47:24,259 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.63 vs. limit=15.0 2023-12-04 00:47:34,235 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=43166.666666666664, ans=0.125 2023-12-04 00:47:44,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=43233.333333333336, ans=0.0014710144927536223 2023-12-04 00:47:44,774 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.12 vs. limit=10.0 2023-12-04 00:47:47,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=43300.0, ans=0.0 2023-12-04 00:47:51,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=43300.0, ans=0.5 2023-12-04 00:47:51,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=43300.0, ans=0.125 2023-12-04 00:47:57,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=43366.666666666664, ans=0.05 2023-12-04 00:48:04,957 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.325e+02 1.946e+02 2.243e+02 2.682e+02 4.927e+02, threshold=4.486e+02, percent-clipped=1.0 2023-12-04 00:48:09,311 INFO [train.py:1087] (1/4) Epoch 8, batch 250, loss[loss=0.2118, simple_loss=0.2896, pruned_loss=0.06707, over 24761.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3087, pruned_loss=0.07465, over 3456900.33 frames. ], batch size: 66, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:48:16,595 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.57 vs. limit=22.5 2023-12-04 00:48:36,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=43566.666666666664, ans=0.125 2023-12-04 00:48:38,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.36 vs. limit=10.0 2023-12-04 00:48:44,050 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2023-12-04 00:48:50,709 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=15.0 2023-12-04 00:48:59,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=43700.0, ans=0.0 2023-12-04 00:49:05,043 INFO [train.py:1087] (1/4) Epoch 8, batch 300, loss[loss=0.2336, simple_loss=0.3102, pruned_loss=0.07848, over 23985.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3091, pruned_loss=0.075, over 3757820.04 frames. ], batch size: 87, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:49:06,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=43766.666666666664, ans=0.125 2023-12-04 00:49:30,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=43900.0, ans=0.0 2023-12-04 00:49:36,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=43900.0, ans=0.0 2023-12-04 00:49:54,607 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.44 vs. limit=22.5 2023-12-04 00:49:56,103 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.870e+02 2.164e+02 2.569e+02 7.767e+02, threshold=4.327e+02, percent-clipped=1.0 2023-12-04 00:50:00,876 INFO [train.py:1087] (1/4) Epoch 8, batch 350, loss[loss=0.2057, simple_loss=0.2912, pruned_loss=0.06009, over 24745.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3086, pruned_loss=0.07485, over 3996483.15 frames. ], batch size: 63, lr: 2.61e-02, grad_scale: 16.0 2023-12-04 00:50:04,007 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.36 vs. limit=10.0 2023-12-04 00:50:24,161 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=12.0 2023-12-04 00:50:31,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44233.333333333336, ans=0.125 2023-12-04 00:50:39,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44300.0, ans=0.1 2023-12-04 00:50:46,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=44366.666666666664, ans=0.125 2023-12-04 00:50:56,838 INFO [train.py:1087] (1/4) Epoch 8, batch 400, loss[loss=0.2118, simple_loss=0.2928, pruned_loss=0.06541, over 24752.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3081, pruned_loss=0.07463, over 4173167.08 frames. ], batch size: 66, lr: 2.60e-02, grad_scale: 32.0 2023-12-04 00:51:00,720 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.01 vs. limit=22.5 2023-12-04 00:51:10,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=44500.0, ans=0.125 2023-12-04 00:51:17,580 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.00 vs. limit=15.0 2023-12-04 00:51:20,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=44566.666666666664, ans=0.07 2023-12-04 00:51:21,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=44566.666666666664, ans=0.0 2023-12-04 00:51:34,863 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44633.333333333336, ans=0.1 2023-12-04 00:51:44,368 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=44700.0, ans=0.025 2023-12-04 00:51:48,402 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.509e+02 1.929e+02 2.224e+02 2.693e+02 3.819e+02, threshold=4.448e+02, percent-clipped=0.0 2023-12-04 00:51:52,692 INFO [train.py:1087] (1/4) Epoch 8, batch 450, loss[loss=0.2237, simple_loss=0.3063, pruned_loss=0.07058, over 24735.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3077, pruned_loss=0.07424, over 4317208.73 frames. ], batch size: 61, lr: 2.60e-02, grad_scale: 32.0 2023-12-04 00:52:06,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=44833.333333333336, ans=0.0 2023-12-04 00:52:11,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=44833.333333333336, ans=0.2 2023-12-04 00:52:15,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44900.0, ans=0.1 2023-12-04 00:52:46,410 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.60 vs. limit=22.5 2023-12-04 00:52:48,926 INFO [train.py:1087] (1/4) Epoch 8, batch 500, loss[loss=0.2276, simple_loss=0.3063, pruned_loss=0.07447, over 24571.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3079, pruned_loss=0.07472, over 4416458.29 frames. ], batch size: 65, lr: 2.59e-02, grad_scale: 32.0 2023-12-04 00:52:51,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=45100.0, ans=0.125 2023-12-04 00:52:52,264 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=45100.0, ans=0.125 2023-12-04 00:53:39,030 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.796e+02 2.227e+02 2.542e+02 4.822e+02, threshold=4.453e+02, percent-clipped=1.0 2023-12-04 00:53:44,487 INFO [train.py:1087] (1/4) Epoch 8, batch 550, loss[loss=0.2157, simple_loss=0.2991, pruned_loss=0.06619, over 24708.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3078, pruned_loss=0.07479, over 4499381.70 frames. ], batch size: 69, lr: 2.59e-02, grad_scale: 32.0 2023-12-04 00:53:48,284 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=45433.333333333336, ans=0.125 2023-12-04 00:53:52,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=45433.333333333336, ans=0.1 2023-12-04 00:54:18,221 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45633.333333333336, ans=0.1 2023-12-04 00:54:31,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=45700.0, ans=0.125 2023-12-04 00:54:39,356 INFO [train.py:1087] (1/4) Epoch 8, batch 600, loss[loss=0.2182, simple_loss=0.3024, pruned_loss=0.06702, over 24727.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3077, pruned_loss=0.07456, over 4569481.45 frames. ], batch size: 67, lr: 2.58e-02, grad_scale: 32.0 2023-12-04 00:54:41,906 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-12-04 00:54:52,034 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=15.0 2023-12-04 00:55:08,059 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=15.0 2023-12-04 00:55:31,375 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.467e+02 1.959e+02 2.312e+02 2.796e+02 4.187e+02, threshold=4.624e+02, percent-clipped=0.0 2023-12-04 00:55:35,672 INFO [train.py:1087] (1/4) Epoch 8, batch 650, loss[loss=0.2, simple_loss=0.287, pruned_loss=0.05647, over 24550.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3075, pruned_loss=0.07435, over 4618487.26 frames. ], batch size: 62, lr: 2.58e-02, grad_scale: 16.0 2023-12-04 00:56:19,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=46366.666666666664, ans=0.125 2023-12-04 00:56:32,268 INFO [train.py:1087] (1/4) Epoch 8, batch 700, loss[loss=0.2332, simple_loss=0.3083, pruned_loss=0.07911, over 24483.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3074, pruned_loss=0.07464, over 4655355.71 frames. ], batch size: 77, lr: 2.58e-02, grad_scale: 16.0 2023-12-04 00:56:41,004 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:56:45,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46500.0, ans=0.125 2023-12-04 00:56:45,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=46500.0, ans=0.125 2023-12-04 00:56:50,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=46500.0, ans=0.125 2023-12-04 00:56:52,738 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 00:56:57,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=46566.666666666664, ans=0.5 2023-12-04 00:57:13,520 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=46633.333333333336, ans=0.125 2023-12-04 00:57:24,715 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.828e+02 2.074e+02 2.357e+02 1.042e+03, threshold=4.149e+02, percent-clipped=1.0 2023-12-04 00:57:27,951 INFO [train.py:1087] (1/4) Epoch 8, batch 750, loss[loss=0.2345, simple_loss=0.3145, pruned_loss=0.07727, over 24795.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.308, pruned_loss=0.07485, over 4675066.96 frames. ], batch size: 73, lr: 2.57e-02, grad_scale: 16.0 2023-12-04 00:57:31,479 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=46766.666666666664, ans=0.0007028985507246382 2023-12-04 00:57:49,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=46900.0, ans=0.0 2023-12-04 00:57:50,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.92 vs. limit=15.0 2023-12-04 00:58:07,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46966.666666666664, ans=0.125 2023-12-04 00:58:11,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=47033.333333333336, ans=0.0 2023-12-04 00:58:23,772 INFO [train.py:1087] (1/4) Epoch 8, batch 800, loss[loss=0.253, simple_loss=0.3283, pruned_loss=0.08891, over 23515.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3077, pruned_loss=0.07481, over 4692775.73 frames. ], batch size: 94, lr: 2.57e-02, grad_scale: 32.0 2023-12-04 00:58:30,032 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-12-04 00:58:33,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=47166.666666666664, ans=0.0 2023-12-04 00:59:01,358 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.50 vs. limit=15.0 2023-12-04 00:59:11,669 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.773e+02 2.012e+02 2.288e+02 3.220e+02, threshold=4.025e+02, percent-clipped=0.0 2023-12-04 00:59:13,418 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.14 vs. limit=15.0 2023-12-04 00:59:14,711 INFO [train.py:1087] (1/4) Epoch 8, batch 850, loss[loss=0.233, simple_loss=0.3029, pruned_loss=0.08153, over 24044.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3068, pruned_loss=0.07421, over 4724629.03 frames. ], batch size: 87, lr: 2.56e-02, grad_scale: 32.0 2023-12-04 00:59:14,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=47433.333333333336, ans=0.2 2023-12-04 00:59:32,079 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.31 vs. limit=22.5 2023-12-04 00:59:33,195 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.57 vs. limit=15.0 2023-12-04 00:59:41,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=47566.666666666664, ans=0.0 2023-12-04 00:59:42,988 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=47566.666666666664, ans=0.125 2023-12-04 00:59:44,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=47633.333333333336, ans=0.125 2023-12-04 00:59:51,623 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.89 vs. limit=6.0 2023-12-04 01:00:16,315 INFO [train.py:1087] (1/4) Epoch 9, batch 0, loss[loss=0.2057, simple_loss=0.2946, pruned_loss=0.05841, over 24781.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2946, pruned_loss=0.05841, over 24781.00 frames. ], batch size: 72, lr: 2.42e-02, grad_scale: 32.0 2023-12-04 01:00:16,315 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 01:00:28,728 INFO [train.py:1119] (1/4) Epoch 9, validation: loss=0.1852, simple_loss=0.2876, pruned_loss=0.04143, over 944034.00 frames. 2023-12-04 01:00:28,729 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 01:00:49,448 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=47800.0, ans=0.125 2023-12-04 01:00:58,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=47866.666666666664, ans=0.0 2023-12-04 01:01:23,153 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48000.0, ans=0.125 2023-12-04 01:01:25,134 INFO [train.py:1087] (1/4) Epoch 9, batch 50, loss[loss=0.2369, simple_loss=0.3137, pruned_loss=0.08007, over 24772.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3051, pruned_loss=0.07184, over 1099111.65 frames. ], batch size: 64, lr: 2.42e-02, grad_scale: 32.0 2023-12-04 01:01:27,192 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 1.995e+02 2.275e+02 2.569e+02 4.783e+02, threshold=4.550e+02, percent-clipped=4.0 2023-12-04 01:01:34,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=48133.333333333336, ans=0.2 2023-12-04 01:01:36,850 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48133.333333333336, ans=0.1 2023-12-04 01:01:48,671 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.06 vs. limit=10.0 2023-12-04 01:01:58,543 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48266.666666666664, ans=0.1 2023-12-04 01:02:12,345 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=48333.333333333336, ans=0.0 2023-12-04 01:02:17,972 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=48333.333333333336, ans=0.0 2023-12-04 01:02:20,331 INFO [train.py:1087] (1/4) Epoch 9, batch 100, loss[loss=0.2118, simple_loss=0.2969, pruned_loss=0.06335, over 24556.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3043, pruned_loss=0.0714, over 1938260.69 frames. ], batch size: 63, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:02:31,806 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=48466.666666666664, ans=0.0 2023-12-04 01:02:34,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=48466.666666666664, ans=0.0 2023-12-04 01:02:44,282 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=48533.333333333336, ans=0.125 2023-12-04 01:02:49,716 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48533.333333333336, ans=0.125 2023-12-04 01:02:51,756 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=48600.0, ans=0.0 2023-12-04 01:02:53,204 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:02:53,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=48600.0, ans=0.0 2023-12-04 01:02:57,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=48600.0, ans=0.0 2023-12-04 01:03:04,975 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.99 vs. limit=15.0 2023-12-04 01:03:14,900 INFO [train.py:1087] (1/4) Epoch 9, batch 150, loss[loss=0.2189, simple_loss=0.2997, pruned_loss=0.06905, over 24857.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3037, pruned_loss=0.07069, over 2591859.89 frames. ], batch size: 68, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:03:17,017 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.784e+02 1.989e+02 2.181e+02 3.017e+02, threshold=3.978e+02, percent-clipped=0.0 2023-12-04 01:04:10,643 INFO [train.py:1087] (1/4) Epoch 9, batch 200, loss[loss=0.218, simple_loss=0.2993, pruned_loss=0.06832, over 24726.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3037, pruned_loss=0.07099, over 3075596.92 frames. ], batch size: 61, lr: 2.41e-02, grad_scale: 32.0 2023-12-04 01:04:15,620 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-12-04 01:04:29,633 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:04:36,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=49200.0, ans=0.125 2023-12-04 01:04:49,078 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49266.666666666664, ans=0.1 2023-12-04 01:04:53,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=49333.333333333336, ans=0.125 2023-12-04 01:05:06,624 INFO [train.py:1087] (1/4) Epoch 9, batch 250, loss[loss=0.2237, simple_loss=0.306, pruned_loss=0.07071, over 24081.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3033, pruned_loss=0.07075, over 3461235.34 frames. ], batch size: 87, lr: 2.40e-02, grad_scale: 32.0 2023-12-04 01:05:08,696 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.304e+02 1.787e+02 2.020e+02 2.356e+02 3.922e+02, threshold=4.039e+02, percent-clipped=0.0 2023-12-04 01:05:14,297 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=49400.0, ans=0.2 2023-12-04 01:05:24,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=49466.666666666664, ans=0.125 2023-12-04 01:05:26,586 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2023-12-04 01:05:29,637 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:05:37,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49533.333333333336, ans=0.1 2023-12-04 01:05:39,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=49600.0, ans=0.1 2023-12-04 01:05:53,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=49666.666666666664, ans=0.0 2023-12-04 01:05:53,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=49666.666666666664, ans=0.0 2023-12-04 01:05:56,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=49666.666666666664, ans=0.125 2023-12-04 01:05:57,623 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-12-04 01:06:02,164 INFO [train.py:1087] (1/4) Epoch 9, batch 300, loss[loss=0.2292, simple_loss=0.3103, pruned_loss=0.07405, over 21302.00 frames. ], tot_loss[loss=0.222, simple_loss=0.303, pruned_loss=0.07052, over 3748586.30 frames. ], batch size: 127, lr: 2.40e-02, grad_scale: 32.0 2023-12-04 01:06:05,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49733.333333333336, ans=0.125 2023-12-04 01:06:17,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=49800.0, ans=0.125 2023-12-04 01:06:25,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=49866.666666666664, ans=0.125 2023-12-04 01:06:32,734 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-12-04 01:06:38,345 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=49933.333333333336, ans=0.0 2023-12-04 01:06:47,181 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=50000.0, ans=0.125 2023-12-04 01:06:48,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=50000.0, ans=0.0 2023-12-04 01:06:49,673 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.01 vs. limit=22.5 2023-12-04 01:06:53,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50000.0, ans=0.1 2023-12-04 01:06:57,514 INFO [train.py:1087] (1/4) Epoch 9, batch 350, loss[loss=0.207, simple_loss=0.2936, pruned_loss=0.06021, over 24775.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.303, pruned_loss=0.07068, over 3975732.99 frames. ], batch size: 73, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:06:59,591 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.894e+02 2.166e+02 2.413e+02 4.274e+02, threshold=4.333e+02, percent-clipped=1.0 2023-12-04 01:07:07,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=50066.666666666664, ans=0.125 2023-12-04 01:07:23,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=50200.0, ans=0.0 2023-12-04 01:07:53,561 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2023-12-04 01:07:53,846 INFO [train.py:1087] (1/4) Epoch 9, batch 400, loss[loss=0.2337, simple_loss=0.308, pruned_loss=0.07967, over 23741.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3027, pruned_loss=0.07044, over 4168554.36 frames. ], batch size: 57, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:08:03,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=50466.666666666664, ans=0.125 2023-12-04 01:08:13,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=50466.666666666664, ans=0.0 2023-12-04 01:08:43,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=50666.666666666664, ans=0.125 2023-12-04 01:08:49,403 INFO [train.py:1087] (1/4) Epoch 9, batch 450, loss[loss=0.2205, simple_loss=0.305, pruned_loss=0.06801, over 24774.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3031, pruned_loss=0.07112, over 4284667.72 frames. ], batch size: 71, lr: 2.39e-02, grad_scale: 32.0 2023-12-04 01:08:51,476 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.453e+02 1.723e+02 1.991e+02 2.313e+02 3.429e+02, threshold=3.983e+02, percent-clipped=0.0 2023-12-04 01:08:58,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=50733.333333333336, ans=0.05 2023-12-04 01:08:58,709 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=15.0 2023-12-04 01:09:08,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=50800.0, ans=0.0 2023-12-04 01:09:09,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=50800.0, ans=0.025 2023-12-04 01:09:14,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=50866.666666666664, ans=0.04949747468305833 2023-12-04 01:09:30,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=50933.333333333336, ans=0.025 2023-12-04 01:09:31,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50933.333333333336, ans=0.1 2023-12-04 01:09:41,317 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.59 vs. limit=22.5 2023-12-04 01:09:45,261 INFO [train.py:1087] (1/4) Epoch 9, batch 500, loss[loss=0.2531, simple_loss=0.3291, pruned_loss=0.08854, over 21360.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3022, pruned_loss=0.07038, over 4417000.38 frames. ], batch size: 127, lr: 2.38e-02, grad_scale: 32.0 2023-12-04 01:09:56,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=51133.333333333336, ans=0.125 2023-12-04 01:09:58,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=51133.333333333336, ans=0.125 2023-12-04 01:09:59,262 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=51133.333333333336, ans=0.125 2023-12-04 01:10:19,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51266.666666666664, ans=0.1 2023-12-04 01:10:40,054 INFO [train.py:1087] (1/4) Epoch 9, batch 550, loss[loss=0.2227, simple_loss=0.3059, pruned_loss=0.06976, over 24488.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3032, pruned_loss=0.07105, over 4491698.68 frames. ], batch size: 75, lr: 2.38e-02, grad_scale: 32.0 2023-12-04 01:10:42,509 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.788e+02 2.021e+02 2.273e+02 3.294e+02, threshold=4.042e+02, percent-clipped=0.0 2023-12-04 01:10:46,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=51400.0, ans=0.125 2023-12-04 01:10:55,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51466.666666666664, ans=0.1 2023-12-04 01:11:18,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=51600.0, ans=0.0 2023-12-04 01:11:33,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=51666.666666666664, ans=0.2 2023-12-04 01:11:34,965 INFO [train.py:1087] (1/4) Epoch 9, batch 600, loss[loss=0.2429, simple_loss=0.3218, pruned_loss=0.08204, over 21485.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3028, pruned_loss=0.07065, over 4570819.19 frames. ], batch size: 127, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:12:26,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=52000.0, ans=0.0 2023-12-04 01:12:30,797 INFO [train.py:1087] (1/4) Epoch 9, batch 650, loss[loss=0.2105, simple_loss=0.2953, pruned_loss=0.06285, over 24719.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3025, pruned_loss=0.07063, over 4618544.85 frames. ], batch size: 69, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:12:32,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=52066.666666666664, ans=0.0 2023-12-04 01:12:32,885 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.872e+02 2.101e+02 2.365e+02 3.683e+02, threshold=4.202e+02, percent-clipped=0.0 2023-12-04 01:12:34,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52066.666666666664, ans=0.125 2023-12-04 01:12:35,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=52066.666666666664, ans=0.125 2023-12-04 01:12:45,256 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52133.333333333336, ans=0.1 2023-12-04 01:12:51,730 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.42 vs. limit=15.0 2023-12-04 01:12:55,599 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=52200.0, ans=0.0 2023-12-04 01:13:20,198 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-12-04 01:13:22,323 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=52333.333333333336, ans=0.2 2023-12-04 01:13:26,273 INFO [train.py:1087] (1/4) Epoch 9, batch 700, loss[loss=0.207, simple_loss=0.2829, pruned_loss=0.0655, over 24546.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3014, pruned_loss=0.06974, over 4676180.16 frames. ], batch size: 63, lr: 2.37e-02, grad_scale: 32.0 2023-12-04 01:14:15,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=52666.666666666664, ans=0.125 2023-12-04 01:14:18,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=52666.666666666664, ans=0.0 2023-12-04 01:14:21,185 INFO [train.py:1087] (1/4) Epoch 9, batch 750, loss[loss=0.2148, simple_loss=0.2941, pruned_loss=0.06775, over 24571.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.301, pruned_loss=0.06941, over 4702963.81 frames. ], batch size: 64, lr: 2.36e-02, grad_scale: 32.0 2023-12-04 01:14:23,656 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.379e+02 1.728e+02 1.920e+02 2.174e+02 3.625e+02, threshold=3.841e+02, percent-clipped=0.0 2023-12-04 01:14:48,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=52866.666666666664, ans=0.125 2023-12-04 01:14:56,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=52933.333333333336, ans=0.0 2023-12-04 01:14:58,967 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=52933.333333333336, ans=0.2 2023-12-04 01:15:07,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=53000.0, ans=0.0 2023-12-04 01:15:12,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=53000.0, ans=0.0 2023-12-04 01:15:15,727 INFO [train.py:1087] (1/4) Epoch 9, batch 800, loss[loss=0.2188, simple_loss=0.3055, pruned_loss=0.06608, over 24781.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3008, pruned_loss=0.06913, over 4736437.85 frames. ], batch size: 73, lr: 2.36e-02, grad_scale: 32.0 2023-12-04 01:15:21,779 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.16 vs. limit=22.5 2023-12-04 01:15:35,109 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.76 vs. limit=6.0 2023-12-04 01:16:08,194 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=53333.333333333336, ans=0.125 2023-12-04 01:16:10,151 INFO [train.py:1087] (1/4) Epoch 9, batch 850, loss[loss=0.2153, simple_loss=0.2991, pruned_loss=0.06577, over 24782.00 frames. ], tot_loss[loss=0.22, simple_loss=0.301, pruned_loss=0.06949, over 4753896.88 frames. ], batch size: 71, lr: 2.36e-02, grad_scale: 16.0 2023-12-04 01:16:13,086 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.407e+02 1.782e+02 1.992e+02 2.397e+02 4.275e+02, threshold=3.984e+02, percent-clipped=1.0 2023-12-04 01:16:35,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=53533.333333333336, ans=0.125 2023-12-04 01:17:12,454 INFO [train.py:1087] (1/4) Epoch 10, batch 0, loss[loss=0.2001, simple_loss=0.2879, pruned_loss=0.05619, over 24761.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2879, pruned_loss=0.05619, over 24761.00 frames. ], batch size: 66, lr: 2.24e-02, grad_scale: 32.0 2023-12-04 01:17:12,455 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 01:17:24,572 INFO [train.py:1119] (1/4) Epoch 10, validation: loss=0.1824, simple_loss=0.2846, pruned_loss=0.0401, over 944034.00 frames. 2023-12-04 01:17:24,573 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 01:17:29,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=53700.0, ans=0.0 2023-12-04 01:17:43,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53766.666666666664, ans=0.125 2023-12-04 01:17:44,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=53766.666666666664, ans=0.1 2023-12-04 01:17:53,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=53833.333333333336, ans=0.0 2023-12-04 01:18:09,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53966.666666666664, ans=0.125 2023-12-04 01:18:11,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=53966.666666666664, ans=0.125 2023-12-04 01:18:20,178 INFO [train.py:1087] (1/4) Epoch 10, batch 50, loss[loss=0.2023, simple_loss=0.2862, pruned_loss=0.05925, over 24701.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3009, pruned_loss=0.06988, over 1079863.99 frames. ], batch size: 69, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:18:29,052 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.771e+02 2.016e+02 2.500e+02 4.270e+02, threshold=4.032e+02, percent-clipped=1.0 2023-12-04 01:18:42,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=54166.666666666664, ans=0.09899494936611666 2023-12-04 01:18:42,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=54166.666666666664, ans=0.125 2023-12-04 01:18:46,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=54166.666666666664, ans=0.1 2023-12-04 01:18:59,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=54233.333333333336, ans=0.125 2023-12-04 01:18:59,732 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.11 vs. limit=22.5 2023-12-04 01:19:01,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=54233.333333333336, ans=0.125 2023-12-04 01:19:12,044 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=54300.0, ans=0.2 2023-12-04 01:19:15,725 INFO [train.py:1087] (1/4) Epoch 10, batch 100, loss[loss=0.2265, simple_loss=0.3088, pruned_loss=0.07205, over 24180.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3, pruned_loss=0.06898, over 1897432.66 frames. ], batch size: 82, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:19:43,685 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.51 vs. limit=15.0 2023-12-04 01:19:58,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=54633.333333333336, ans=0.2 2023-12-04 01:20:00,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=54633.333333333336, ans=0.125 2023-12-04 01:20:10,800 INFO [train.py:1087] (1/4) Epoch 10, batch 150, loss[loss=0.1977, simple_loss=0.2817, pruned_loss=0.05682, over 24563.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2992, pruned_loss=0.06846, over 2547918.70 frames. ], batch size: 63, lr: 2.23e-02, grad_scale: 32.0 2023-12-04 01:20:20,274 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.785e+02 2.047e+02 2.385e+02 4.094e+02, threshold=4.093e+02, percent-clipped=1.0 2023-12-04 01:20:27,984 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:20:28,016 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=54766.666666666664, ans=0.0 2023-12-04 01:20:57,713 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.23 vs. limit=22.5 2023-12-04 01:21:05,654 INFO [train.py:1087] (1/4) Epoch 10, batch 200, loss[loss=0.2314, simple_loss=0.3018, pruned_loss=0.08053, over 21768.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2993, pruned_loss=0.06852, over 3039074.36 frames. ], batch size: 128, lr: 2.22e-02, grad_scale: 32.0 2023-12-04 01:21:11,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55033.333333333336, ans=0.1 2023-12-04 01:21:22,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=55100.0, ans=0.05 2023-12-04 01:21:28,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55166.666666666664, ans=0.125 2023-12-04 01:21:37,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55166.666666666664, ans=0.1 2023-12-04 01:21:38,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=55233.333333333336, ans=0.125 2023-12-04 01:21:41,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=55233.333333333336, ans=0.0 2023-12-04 01:21:49,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=55300.0, ans=0.1 2023-12-04 01:21:51,470 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2023-12-04 01:22:01,736 INFO [train.py:1087] (1/4) Epoch 10, batch 250, loss[loss=0.2108, simple_loss=0.2932, pruned_loss=0.06427, over 24729.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2992, pruned_loss=0.0682, over 3435725.90 frames. ], batch size: 61, lr: 2.22e-02, grad_scale: 32.0 2023-12-04 01:22:09,418 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=55366.666666666664, ans=0.0 2023-12-04 01:22:10,176 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.344e+02 1.785e+02 2.066e+02 2.411e+02 3.389e+02, threshold=4.132e+02, percent-clipped=0.0 2023-12-04 01:22:11,396 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=55433.333333333336, ans=0.5 2023-12-04 01:22:13,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=55433.333333333336, ans=0.0 2023-12-04 01:22:15,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=55433.333333333336, ans=0.0 2023-12-04 01:22:16,984 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=55433.333333333336, ans=0.125 2023-12-04 01:22:18,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=55433.333333333336, ans=0.0 2023-12-04 01:22:29,862 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.77 vs. limit=6.0 2023-12-04 01:22:35,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=55566.666666666664, ans=0.0 2023-12-04 01:22:50,347 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=12.0 2023-12-04 01:22:55,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=55633.333333333336, ans=0.0 2023-12-04 01:22:57,213 INFO [train.py:1087] (1/4) Epoch 10, batch 300, loss[loss=0.1987, simple_loss=0.2864, pruned_loss=0.05555, over 24757.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.299, pruned_loss=0.06828, over 3724657.68 frames. ], batch size: 65, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:23:02,759 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=55700.0, ans=0.125 2023-12-04 01:23:23,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=55833.333333333336, ans=0.2 2023-12-04 01:23:23,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=55833.333333333336, ans=0.0 2023-12-04 01:23:36,750 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:23:38,863 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:23:42,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55966.666666666664, ans=0.1 2023-12-04 01:23:51,966 INFO [train.py:1087] (1/4) Epoch 10, batch 350, loss[loss=0.2138, simple_loss=0.2985, pruned_loss=0.06459, over 24548.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2985, pruned_loss=0.06775, over 3975573.97 frames. ], batch size: 62, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:23:56,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=56033.333333333336, ans=0.0 2023-12-04 01:24:00,755 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.739e+02 1.916e+02 2.181e+02 3.253e+02, threshold=3.831e+02, percent-clipped=0.0 2023-12-04 01:24:11,792 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=56100.0, ans=0.0 2023-12-04 01:24:15,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=56166.666666666664, ans=0.125 2023-12-04 01:24:28,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=56233.333333333336, ans=0.1 2023-12-04 01:24:33,896 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-12-04 01:24:46,836 INFO [train.py:1087] (1/4) Epoch 10, batch 400, loss[loss=0.2161, simple_loss=0.2999, pruned_loss=0.06618, over 24809.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2981, pruned_loss=0.06725, over 4168006.52 frames. ], batch size: 62, lr: 2.21e-02, grad_scale: 32.0 2023-12-04 01:25:18,028 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=56500.0, ans=0.125 2023-12-04 01:25:18,117 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=56500.0, ans=0.125 2023-12-04 01:25:19,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56566.666666666664, ans=0.1 2023-12-04 01:25:25,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=56566.666666666664, ans=0.07 2023-12-04 01:25:42,330 INFO [train.py:1087] (1/4) Epoch 10, batch 450, loss[loss=0.1972, simple_loss=0.2814, pruned_loss=0.05646, over 24611.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2986, pruned_loss=0.06779, over 4301269.54 frames. ], batch size: 68, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:25:43,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=56700.0, ans=0.125 2023-12-04 01:25:50,718 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.719e+02 1.900e+02 2.177e+02 5.504e+02, threshold=3.801e+02, percent-clipped=1.0 2023-12-04 01:25:55,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=56766.666666666664, ans=0.0 2023-12-04 01:26:10,549 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.37 vs. limit=15.0 2023-12-04 01:26:23,997 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=56900.0, ans=0.0 2023-12-04 01:26:30,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=56966.666666666664, ans=0.2 2023-12-04 01:26:31,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=56966.666666666664, ans=0.2 2023-12-04 01:26:32,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=56966.666666666664, ans=0.125 2023-12-04 01:26:37,598 INFO [train.py:1087] (1/4) Epoch 10, batch 500, loss[loss=0.2154, simple_loss=0.3016, pruned_loss=0.06462, over 24554.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2983, pruned_loss=0.06746, over 4424893.66 frames. ], batch size: 62, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:26:41,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=57033.333333333336, ans=0.125 2023-12-04 01:26:51,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=57100.0, ans=0.0 2023-12-04 01:26:56,092 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=57100.0, ans=0.04949747468305833 2023-12-04 01:27:05,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=57166.666666666664, ans=0.0 2023-12-04 01:27:05,665 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=57166.666666666664, ans=0.125 2023-12-04 01:27:07,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57166.666666666664, ans=0.1 2023-12-04 01:27:11,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=57233.333333333336, ans=0.125 2023-12-04 01:27:11,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=57233.333333333336, ans=0.2 2023-12-04 01:27:22,116 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=15.0 2023-12-04 01:27:23,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=57300.0, ans=6.0 2023-12-04 01:27:29,569 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=57300.0, ans=0.0 2023-12-04 01:27:33,197 INFO [train.py:1087] (1/4) Epoch 10, batch 550, loss[loss=0.2141, simple_loss=0.2977, pruned_loss=0.06522, over 24575.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.298, pruned_loss=0.06721, over 4521924.45 frames. ], batch size: 65, lr: 2.20e-02, grad_scale: 32.0 2023-12-04 01:27:35,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=57366.666666666664, ans=22.5 2023-12-04 01:27:41,656 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 1.828e+02 1.979e+02 2.335e+02 3.865e+02, threshold=3.957e+02, percent-clipped=2.0 2023-12-04 01:27:48,610 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57433.333333333336, ans=0.1 2023-12-04 01:27:56,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=57500.0, ans=0.1 2023-12-04 01:28:07,097 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=57566.666666666664, ans=0.04949747468305833 2023-12-04 01:28:19,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=57633.333333333336, ans=0.04949747468305833 2023-12-04 01:28:28,434 INFO [train.py:1087] (1/4) Epoch 10, batch 600, loss[loss=0.2015, simple_loss=0.2902, pruned_loss=0.05646, over 24773.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2981, pruned_loss=0.06722, over 4582049.13 frames. ], batch size: 70, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:28:51,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=57833.333333333336, ans=0.2 2023-12-04 01:28:53,692 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=15.0 2023-12-04 01:28:53,918 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-12-04 01:29:03,509 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:29:15,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=57966.666666666664, ans=0.0 2023-12-04 01:29:19,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=57966.666666666664, ans=0.05 2023-12-04 01:29:24,548 INFO [train.py:1087] (1/4) Epoch 10, batch 650, loss[loss=0.218, simple_loss=0.2946, pruned_loss=0.07066, over 24546.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2974, pruned_loss=0.06685, over 4640337.40 frames. ], batch size: 66, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:29:33,305 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.737e+02 1.939e+02 2.220e+02 3.387e+02, threshold=3.878e+02, percent-clipped=0.0 2023-12-04 01:29:39,812 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-12-04 01:29:57,198 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.60 vs. limit=15.0 2023-12-04 01:29:58,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-12-04 01:30:06,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=58233.333333333336, ans=0.125 2023-12-04 01:30:09,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=58300.0, ans=0.0 2023-12-04 01:30:15,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=58300.0, ans=0.125 2023-12-04 01:30:20,539 INFO [train.py:1087] (1/4) Epoch 10, batch 700, loss[loss=0.2028, simple_loss=0.2852, pruned_loss=0.06022, over 24796.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2974, pruned_loss=0.06676, over 4670201.52 frames. ], batch size: 71, lr: 2.19e-02, grad_scale: 32.0 2023-12-04 01:30:24,270 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.96 vs. limit=15.0 2023-12-04 01:30:40,669 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.97 vs. limit=6.0 2023-12-04 01:30:50,375 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-12-04 01:31:12,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=58633.333333333336, ans=0.125 2023-12-04 01:31:17,318 INFO [train.py:1087] (1/4) Epoch 10, batch 750, loss[loss=0.2073, simple_loss=0.2957, pruned_loss=0.05942, over 24752.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2969, pruned_loss=0.06639, over 4713232.02 frames. ], batch size: 63, lr: 2.18e-02, grad_scale: 32.0 2023-12-04 01:31:25,741 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.393e+02 1.642e+02 1.916e+02 2.183e+02 2.930e+02, threshold=3.832e+02, percent-clipped=0.0 2023-12-04 01:31:28,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=58766.666666666664, ans=0.0 2023-12-04 01:31:29,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=13.04 vs. limit=15.0 2023-12-04 01:31:30,430 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.03 vs. limit=22.5 2023-12-04 01:31:42,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=58833.333333333336, ans=0.0 2023-12-04 01:31:42,583 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=8.0 2023-12-04 01:31:45,545 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-12-04 01:32:11,830 INFO [train.py:1087] (1/4) Epoch 10, batch 800, loss[loss=0.2135, simple_loss=0.2961, pruned_loss=0.06542, over 24720.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2969, pruned_loss=0.06645, over 4728539.33 frames. ], batch size: 69, lr: 2.18e-02, grad_scale: 32.0 2023-12-04 01:32:51,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59233.333333333336, ans=0.125 2023-12-04 01:33:02,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=59366.666666666664, ans=0.1 2023-12-04 01:33:03,322 INFO [train.py:1087] (1/4) Epoch 10, batch 850, loss[loss=0.2701, simple_loss=0.3332, pruned_loss=0.1035, over 17460.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2973, pruned_loss=0.06686, over 4736270.33 frames. ], batch size: 177, lr: 2.17e-02, grad_scale: 32.0 2023-12-04 01:33:06,807 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-12-04 01:33:08,636 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:33:11,378 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.283e+02 1.702e+02 1.889e+02 2.161e+02 3.150e+02, threshold=3.778e+02, percent-clipped=0.0 2023-12-04 01:33:12,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=59433.333333333336, ans=0.0 2023-12-04 01:34:04,729 INFO [train.py:1087] (1/4) Epoch 11, batch 0, loss[loss=0.2208, simple_loss=0.2974, pruned_loss=0.07216, over 21413.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2974, pruned_loss=0.07216, over 21413.00 frames. ], batch size: 128, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:34:04,730 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 01:34:16,790 INFO [train.py:1119] (1/4) Epoch 11, validation: loss=0.1777, simple_loss=0.28, pruned_loss=0.03772, over 944034.00 frames. 2023-12-04 01:34:16,791 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 01:34:23,752 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=5.51 vs. limit=12.0 2023-12-04 01:34:40,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=59800.0, ans=0.125 2023-12-04 01:34:42,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=59800.0, ans=0.2 2023-12-04 01:34:43,484 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.46 vs. limit=15.0 2023-12-04 01:34:48,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=59866.666666666664, ans=0.0 2023-12-04 01:34:55,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=59866.666666666664, ans=0.0 2023-12-04 01:35:12,542 INFO [train.py:1087] (1/4) Epoch 11, batch 50, loss[loss=0.207, simple_loss=0.2919, pruned_loss=0.06108, over 24807.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2977, pruned_loss=0.06646, over 1084503.93 frames. ], batch size: 73, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:35:20,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=60000.0, ans=0.0 2023-12-04 01:35:20,564 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:35:21,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=60000.0, ans=0.125 2023-12-04 01:35:26,543 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.726e+02 1.897e+02 2.276e+02 3.906e+02, threshold=3.795e+02, percent-clipped=1.0 2023-12-04 01:35:31,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=60066.666666666664, ans=0.125 2023-12-04 01:35:44,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=60200.0, ans=0.125 2023-12-04 01:35:49,138 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.20 vs. limit=22.5 2023-12-04 01:35:57,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=60266.666666666664, ans=0.125 2023-12-04 01:36:05,663 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.97 vs. limit=10.0 2023-12-04 01:36:07,145 INFO [train.py:1087] (1/4) Epoch 11, batch 100, loss[loss=0.2114, simple_loss=0.2971, pruned_loss=0.06288, over 24555.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2963, pruned_loss=0.06536, over 1920098.32 frames. ], batch size: 62, lr: 2.07e-02, grad_scale: 32.0 2023-12-04 01:36:25,505 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-12-04 01:36:30,436 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-12-04 01:36:37,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.31 vs. limit=15.0 2023-12-04 01:36:56,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=60600.0, ans=0.0 2023-12-04 01:37:03,001 INFO [train.py:1087] (1/4) Epoch 11, batch 150, loss[loss=0.221, simple_loss=0.3056, pruned_loss=0.06822, over 24444.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.296, pruned_loss=0.06522, over 2564424.68 frames. ], batch size: 77, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:37:14,367 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.52 vs. limit=10.0 2023-12-04 01:37:17,695 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.387e+02 1.734e+02 1.958e+02 2.233e+02 3.565e+02, threshold=3.917e+02, percent-clipped=0.0 2023-12-04 01:37:22,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=60733.333333333336, ans=0.07 2023-12-04 01:37:34,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60800.0, ans=0.1 2023-12-04 01:37:39,315 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-12-04 01:37:55,468 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.39 vs. limit=10.0 2023-12-04 01:37:56,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=60933.333333333336, ans=0.1 2023-12-04 01:37:57,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=61000.0, ans=0.0 2023-12-04 01:37:58,244 INFO [train.py:1087] (1/4) Epoch 11, batch 200, loss[loss=0.2023, simple_loss=0.2868, pruned_loss=0.0589, over 24789.00 frames. ], tot_loss[loss=0.212, simple_loss=0.295, pruned_loss=0.06455, over 3064149.10 frames. ], batch size: 72, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:38:26,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=61133.333333333336, ans=0.0 2023-12-04 01:38:28,008 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-12-04 01:38:31,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=61200.0, ans=0.125 2023-12-04 01:38:32,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=61200.0, ans=0.125 2023-12-04 01:38:41,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=61266.666666666664, ans=0.125 2023-12-04 01:38:54,375 INFO [train.py:1087] (1/4) Epoch 11, batch 250, loss[loss=0.2344, simple_loss=0.3076, pruned_loss=0.08061, over 24494.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2946, pruned_loss=0.06458, over 3451616.81 frames. ], batch size: 77, lr: 2.06e-02, grad_scale: 32.0 2023-12-04 01:38:57,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=61333.333333333336, ans=0.125 2023-12-04 01:39:02,279 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-12-04 01:39:08,130 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.716e+02 2.037e+02 2.457e+02 3.688e+02, threshold=4.073e+02, percent-clipped=0.0 2023-12-04 01:39:15,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=61466.666666666664, ans=0.125 2023-12-04 01:39:18,047 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:39:23,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=61466.666666666664, ans=0.125 2023-12-04 01:39:30,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61533.333333333336, ans=0.1 2023-12-04 01:39:30,519 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-12-04 01:39:49,380 INFO [train.py:1087] (1/4) Epoch 11, batch 300, loss[loss=0.2064, simple_loss=0.2852, pruned_loss=0.06381, over 24557.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2944, pruned_loss=0.06451, over 3756894.62 frames. ], batch size: 66, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:39:51,064 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=61666.666666666664, ans=0.0 2023-12-04 01:39:52,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=61666.666666666664, ans=0.125 2023-12-04 01:40:25,053 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.72 vs. limit=10.0 2023-12-04 01:40:44,748 INFO [train.py:1087] (1/4) Epoch 11, batch 350, loss[loss=0.2048, simple_loss=0.2892, pruned_loss=0.06022, over 24791.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2935, pruned_loss=0.06403, over 3991013.88 frames. ], batch size: 71, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:40:59,579 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.779e+02 2.043e+02 2.515e+02 5.209e+02, threshold=4.086e+02, percent-clipped=2.0 2023-12-04 01:41:11,794 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=62133.333333333336, ans=0.125 2023-12-04 01:41:15,302 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-12-04 01:41:21,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=62200.0, ans=0.125 2023-12-04 01:41:37,510 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.03 vs. limit=22.5 2023-12-04 01:41:40,159 INFO [train.py:1087] (1/4) Epoch 11, batch 400, loss[loss=0.1904, simple_loss=0.2754, pruned_loss=0.05272, over 24557.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.294, pruned_loss=0.06433, over 4165609.90 frames. ], batch size: 62, lr: 2.05e-02, grad_scale: 32.0 2023-12-04 01:41:40,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=62333.333333333336, ans=0.0 2023-12-04 01:41:55,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=62400.0, ans=0.125 2023-12-04 01:42:10,672 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=62466.666666666664, ans=0.0 2023-12-04 01:42:26,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=62600.0, ans=0.125 2023-12-04 01:42:30,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=62600.0, ans=0.125 2023-12-04 01:42:36,132 INFO [train.py:1087] (1/4) Epoch 11, batch 450, loss[loss=0.276, simple_loss=0.3378, pruned_loss=0.107, over 17181.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2942, pruned_loss=0.06409, over 4310540.33 frames. ], batch size: 178, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:42:49,893 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.649e+02 1.867e+02 2.132e+02 3.251e+02, threshold=3.733e+02, percent-clipped=0.0 2023-12-04 01:42:52,338 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=62733.333333333336, ans=0.09899494936611666 2023-12-04 01:43:02,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=62800.0, ans=0.2 2023-12-04 01:43:04,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=62800.0, ans=0.125 2023-12-04 01:43:06,777 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.51 vs. limit=22.5 2023-12-04 01:43:19,271 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-12-04 01:43:26,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=62933.333333333336, ans=0.125 2023-12-04 01:43:31,099 INFO [train.py:1087] (1/4) Epoch 11, batch 500, loss[loss=0.2141, simple_loss=0.3006, pruned_loss=0.06378, over 24868.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2947, pruned_loss=0.06456, over 4419590.78 frames. ], batch size: 68, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:43:36,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=63000.0, ans=0.0 2023-12-04 01:43:59,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=63133.333333333336, ans=0.2 2023-12-04 01:44:25,524 INFO [train.py:1087] (1/4) Epoch 11, batch 550, loss[loss=0.2191, simple_loss=0.3042, pruned_loss=0.06697, over 24162.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2952, pruned_loss=0.06486, over 4501177.90 frames. ], batch size: 58, lr: 2.04e-02, grad_scale: 32.0 2023-12-04 01:44:26,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=63333.333333333336, ans=0.125 2023-12-04 01:44:36,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=63400.0, ans=0.125 2023-12-04 01:44:37,841 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=63400.0, ans=0.125 2023-12-04 01:44:39,892 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.675e+02 1.832e+02 2.050e+02 3.249e+02, threshold=3.664e+02, percent-clipped=0.0 2023-12-04 01:44:46,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:44:47,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:44:50,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:44:52,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:44:55,435 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:44:56,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=63466.666666666664, ans=0.125 2023-12-04 01:45:21,181 INFO [train.py:1087] (1/4) Epoch 11, batch 600, loss[loss=0.2175, simple_loss=0.298, pruned_loss=0.06847, over 21258.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2947, pruned_loss=0.06466, over 4572345.07 frames. ], batch size: 127, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:45:26,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=63666.666666666664, ans=0.125 2023-12-04 01:45:27,345 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-12-04 01:45:30,384 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.61 vs. limit=12.0 2023-12-04 01:45:47,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=63800.0, ans=0.125 2023-12-04 01:45:51,052 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.53 vs. limit=15.0 2023-12-04 01:45:56,131 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-12-04 01:46:00,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63866.666666666664, ans=0.1 2023-12-04 01:46:04,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=63866.666666666664, ans=0.0 2023-12-04 01:46:17,158 INFO [train.py:1087] (1/4) Epoch 11, batch 650, loss[loss=0.2087, simple_loss=0.292, pruned_loss=0.06275, over 24762.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2938, pruned_loss=0.06408, over 4634368.61 frames. ], batch size: 66, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:46:31,351 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.714e+02 1.930e+02 2.208e+02 4.063e+02, threshold=3.860e+02, percent-clipped=1.0 2023-12-04 01:46:45,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=64133.333333333336, ans=0.125 2023-12-04 01:47:10,694 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=64266.666666666664, ans=0.125 2023-12-04 01:47:12,595 INFO [train.py:1087] (1/4) Epoch 11, batch 700, loss[loss=0.1864, simple_loss=0.275, pruned_loss=0.0489, over 24713.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2937, pruned_loss=0.06433, over 4649289.95 frames. ], batch size: 69, lr: 2.03e-02, grad_scale: 32.0 2023-12-04 01:47:41,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=64466.666666666664, ans=0.0 2023-12-04 01:47:48,143 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2023-12-04 01:48:05,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=64600.0, ans=0.0 2023-12-04 01:48:07,711 INFO [train.py:1087] (1/4) Epoch 11, batch 750, loss[loss=0.2104, simple_loss=0.2963, pruned_loss=0.06223, over 24733.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2938, pruned_loss=0.06438, over 4676710.86 frames. ], batch size: 61, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:48:14,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=64666.666666666664, ans=0.0 2023-12-04 01:48:21,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=64733.333333333336, ans=0.125 2023-12-04 01:48:21,836 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.352e+02 1.768e+02 1.995e+02 2.303e+02 4.532e+02, threshold=3.990e+02, percent-clipped=1.0 2023-12-04 01:48:48,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=64866.666666666664, ans=0.07 2023-12-04 01:48:51,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=64933.333333333336, ans=0.0 2023-12-04 01:48:53,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=64933.333333333336, ans=0.2 2023-12-04 01:48:58,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64933.333333333336, ans=0.125 2023-12-04 01:49:02,174 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=65000.0, ans=0.0 2023-12-04 01:49:02,920 INFO [train.py:1087] (1/4) Epoch 11, batch 800, loss[loss=0.1867, simple_loss=0.2686, pruned_loss=0.05243, over 24773.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2934, pruned_loss=0.06397, over 4702683.56 frames. ], batch size: 65, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:49:03,595 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2023-12-04 01:49:15,385 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.10 vs. limit=15.0 2023-12-04 01:49:15,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65066.666666666664, ans=0.1 2023-12-04 01:49:17,736 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=65066.666666666664, ans=0.125 2023-12-04 01:49:20,854 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65066.666666666664, ans=0.1 2023-12-04 01:49:29,795 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=65133.333333333336, ans=0.0 2023-12-04 01:49:31,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65133.333333333336, ans=0.0 2023-12-04 01:49:38,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65200.0, ans=0.1 2023-12-04 01:49:54,692 INFO [train.py:1087] (1/4) Epoch 11, batch 850, loss[loss=0.1973, simple_loss=0.2854, pruned_loss=0.05463, over 24783.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2935, pruned_loss=0.06386, over 4742186.88 frames. ], batch size: 71, lr: 2.02e-02, grad_scale: 32.0 2023-12-04 01:50:04,994 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.96 vs. limit=6.0 2023-12-04 01:50:07,562 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.714e+02 1.902e+02 2.245e+02 3.869e+02, threshold=3.803e+02, percent-clipped=0.0 2023-12-04 01:50:37,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65600.0, ans=0.1 2023-12-04 01:50:56,260 INFO [train.py:1087] (1/4) Epoch 12, batch 0, loss[loss=0.1933, simple_loss=0.2832, pruned_loss=0.05168, over 24783.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2832, pruned_loss=0.05168, over 24783.00 frames. ], batch size: 62, lr: 1.93e-02, grad_scale: 32.0 2023-12-04 01:50:56,261 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 01:51:08,730 INFO [train.py:1119] (1/4) Epoch 12, validation: loss=0.1762, simple_loss=0.2782, pruned_loss=0.03709, over 944034.00 frames. 2023-12-04 01:51:08,731 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 01:51:21,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=65700.0, ans=0.0 2023-12-04 01:51:33,125 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=65766.66666666667, ans=0.125 2023-12-04 01:51:50,013 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=65833.33333333333, ans=0.07 2023-12-04 01:51:56,090 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=65900.0, ans=0.125 2023-12-04 01:52:03,488 INFO [train.py:1087] (1/4) Epoch 12, batch 50, loss[loss=0.2063, simple_loss=0.2914, pruned_loss=0.06056, over 24769.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.294, pruned_loss=0.06406, over 1083805.98 frames. ], batch size: 65, lr: 1.93e-02, grad_scale: 32.0 2023-12-04 01:52:12,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=65966.66666666667, ans=0.125 2023-12-04 01:52:15,431 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=66033.33333333333, ans=0.125 2023-12-04 01:52:19,952 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=66033.33333333333, ans=0.2 2023-12-04 01:52:22,116 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66033.33333333333, ans=0.125 2023-12-04 01:52:22,784 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.364e+02 1.664e+02 1.876e+02 2.080e+02 3.636e+02, threshold=3.752e+02, percent-clipped=0.0 2023-12-04 01:52:58,353 INFO [train.py:1087] (1/4) Epoch 12, batch 100, loss[loss=0.2042, simple_loss=0.2836, pruned_loss=0.06242, over 24691.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2925, pruned_loss=0.06355, over 1898666.64 frames. ], batch size: 74, lr: 1.92e-02, grad_scale: 32.0 2023-12-04 01:53:03,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=66300.0, ans=0.5 2023-12-04 01:53:15,956 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-12-04 01:53:20,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=66433.33333333333, ans=0.025 2023-12-04 01:53:23,851 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.20 vs. limit=15.0 2023-12-04 01:53:45,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=66566.66666666667, ans=0.125 2023-12-04 01:53:46,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=66566.66666666667, ans=15.0 2023-12-04 01:53:52,741 INFO [train.py:1087] (1/4) Epoch 12, batch 150, loss[loss=0.2471, simple_loss=0.3119, pruned_loss=0.09117, over 16771.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2919, pruned_loss=0.06279, over 2525747.45 frames. ], batch size: 177, lr: 1.92e-02, grad_scale: 64.0 2023-12-04 01:54:13,085 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.311e+02 1.585e+02 1.776e+02 2.069e+02 2.709e+02, threshold=3.552e+02, percent-clipped=0.0 2023-12-04 01:54:19,280 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=15.0 2023-12-04 01:54:30,209 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:54:31,331 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=66833.33333333333, ans=0.0 2023-12-04 01:54:48,683 INFO [train.py:1087] (1/4) Epoch 12, batch 200, loss[loss=0.2045, simple_loss=0.29, pruned_loss=0.0595, over 24575.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2916, pruned_loss=0.06245, over 3035678.27 frames. ], batch size: 64, lr: 1.92e-02, grad_scale: 64.0 2023-12-04 01:54:58,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=67033.33333333333, ans=0.2 2023-12-04 01:55:08,756 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:55:08,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=67033.33333333333, ans=0.125 2023-12-04 01:55:10,239 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.35 vs. limit=15.0 2023-12-04 01:55:43,627 INFO [train.py:1087] (1/4) Epoch 12, batch 250, loss[loss=0.2059, simple_loss=0.2893, pruned_loss=0.06129, over 24797.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2914, pruned_loss=0.06254, over 3434730.19 frames. ], batch size: 72, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:55:44,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=67300.0, ans=0.0 2023-12-04 01:55:47,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=67300.0, ans=0.125 2023-12-04 01:56:03,353 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.359e+02 1.629e+02 1.883e+02 2.332e+02 4.483e+02, threshold=3.767e+02, percent-clipped=5.0 2023-12-04 01:56:11,518 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:56:17,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=67500.0, ans=0.0 2023-12-04 01:56:17,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=67500.0, ans=0.0 2023-12-04 01:56:25,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=67500.0, ans=0.0 2023-12-04 01:56:26,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=67566.66666666667, ans=0.2 2023-12-04 01:56:26,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67566.66666666667, ans=0.1 2023-12-04 01:56:38,736 INFO [train.py:1087] (1/4) Epoch 12, batch 300, loss[loss=0.1861, simple_loss=0.2734, pruned_loss=0.04941, over 24791.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2905, pruned_loss=0.06173, over 3752129.40 frames. ], batch size: 62, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:56:39,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=67633.33333333333, ans=0.2 2023-12-04 01:57:01,986 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-12-04 01:57:13,454 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.49 vs. limit=6.0 2023-12-04 01:57:33,588 INFO [train.py:1087] (1/4) Epoch 12, batch 350, loss[loss=0.1984, simple_loss=0.2852, pruned_loss=0.05573, over 24738.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2898, pruned_loss=0.06093, over 4002172.15 frames. ], batch size: 63, lr: 1.91e-02, grad_scale: 64.0 2023-12-04 01:57:40,694 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=22.5 2023-12-04 01:57:43,700 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=15.0 2023-12-04 01:57:53,621 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.303e+02 1.644e+02 1.847e+02 2.070e+02 4.257e+02, threshold=3.695e+02, percent-clipped=1.0 2023-12-04 01:58:27,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=68300.0, ans=0.0 2023-12-04 01:58:28,590 INFO [train.py:1087] (1/4) Epoch 12, batch 400, loss[loss=0.2133, simple_loss=0.2967, pruned_loss=0.06492, over 24504.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2896, pruned_loss=0.06083, over 4194162.17 frames. ], batch size: 77, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 01:58:28,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=68300.0, ans=0.125 2023-12-04 01:58:44,816 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 01:58:44,882 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=68366.66666666667, ans=0.1 2023-12-04 01:59:17,526 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68566.66666666667, ans=0.1 2023-12-04 01:59:23,527 INFO [train.py:1087] (1/4) Epoch 12, batch 450, loss[loss=0.2242, simple_loss=0.3062, pruned_loss=0.07114, over 21406.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2899, pruned_loss=0.06087, over 4333797.49 frames. ], batch size: 127, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 01:59:34,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=68700.0, ans=0.125 2023-12-04 01:59:43,214 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.320e+02 1.590e+02 1.776e+02 1.985e+02 3.300e+02, threshold=3.553e+02, percent-clipped=0.0 2023-12-04 01:59:50,975 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=68766.66666666667, ans=0.125 2023-12-04 02:00:04,428 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-12-04 02:00:18,731 INFO [train.py:1087] (1/4) Epoch 12, batch 500, loss[loss=0.2102, simple_loss=0.2948, pruned_loss=0.06279, over 24758.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2901, pruned_loss=0.0608, over 4429896.83 frames. ], batch size: 66, lr: 1.90e-02, grad_scale: 64.0 2023-12-04 02:00:32,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=69033.33333333333, ans=0.125 2023-12-04 02:00:39,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69100.0, ans=0.125 2023-12-04 02:00:48,018 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.53 vs. limit=15.0 2023-12-04 02:01:01,069 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.17 vs. limit=15.0 2023-12-04 02:01:02,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=69233.33333333333, ans=0.125 2023-12-04 02:01:08,535 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.24 vs. limit=15.0 2023-12-04 02:01:12,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=69300.0, ans=0.1 2023-12-04 02:01:13,284 INFO [train.py:1087] (1/4) Epoch 12, batch 550, loss[loss=0.2197, simple_loss=0.3066, pruned_loss=0.06637, over 24715.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2906, pruned_loss=0.06115, over 4513842.53 frames. ], batch size: 69, lr: 1.90e-02, grad_scale: 32.0 2023-12-04 02:01:23,332 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=69300.0, ans=0.125 2023-12-04 02:01:35,170 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.408e+02 1.724e+02 2.043e+02 2.317e+02 3.835e+02, threshold=4.086e+02, percent-clipped=1.0 2023-12-04 02:01:35,669 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-12-04 02:01:58,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=69566.66666666667, ans=0.125 2023-12-04 02:02:00,357 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-12-04 02:02:01,266 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.94 vs. limit=10.0 2023-12-04 02:02:09,168 INFO [train.py:1087] (1/4) Epoch 12, batch 600, loss[loss=0.1888, simple_loss=0.2748, pruned_loss=0.0514, over 24767.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2901, pruned_loss=0.06109, over 4582173.39 frames. ], batch size: 65, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:02:28,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=69700.0, ans=0.5 2023-12-04 02:02:29,593 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=69700.0, ans=0.125 2023-12-04 02:02:32,721 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=69766.66666666667, ans=0.125 2023-12-04 02:02:47,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69833.33333333333, ans=0.1 2023-12-04 02:02:50,016 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=69833.33333333333, ans=0.0 2023-12-04 02:02:55,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=69900.0, ans=0.09899494936611666 2023-12-04 02:03:01,741 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.78 vs. limit=5.0 2023-12-04 02:03:04,490 INFO [train.py:1087] (1/4) Epoch 12, batch 650, loss[loss=0.2175, simple_loss=0.2983, pruned_loss=0.06837, over 24746.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2903, pruned_loss=0.06137, over 4634107.51 frames. ], batch size: 65, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:03:07,159 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.46 vs. limit=10.0 2023-12-04 02:03:18,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70033.33333333333, ans=0.125 2023-12-04 02:03:25,724 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.683e+02 1.982e+02 2.241e+02 4.584e+02, threshold=3.964e+02, percent-clipped=1.0 2023-12-04 02:03:31,377 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:03:32,401 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:03:36,050 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=70100.0, ans=0.0 2023-12-04 02:03:39,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=70166.66666666667, ans=0.0 2023-12-04 02:03:45,410 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=70166.66666666667, ans=0.0 2023-12-04 02:03:54,894 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-12-04 02:04:00,249 INFO [train.py:1087] (1/4) Epoch 12, batch 700, loss[loss=0.2019, simple_loss=0.2845, pruned_loss=0.05967, over 24785.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2899, pruned_loss=0.06102, over 4678682.13 frames. ], batch size: 70, lr: 1.89e-02, grad_scale: 32.0 2023-12-04 02:04:02,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70300.0, ans=0.125 2023-12-04 02:04:10,322 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=70366.66666666667, ans=0.125 2023-12-04 02:04:17,714 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=70366.66666666667, ans=0.0 2023-12-04 02:04:17,720 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=70366.66666666667, ans=0.2 2023-12-04 02:04:18,844 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70366.66666666667, ans=0.125 2023-12-04 02:04:21,895 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=70433.33333333333, ans=0.0 2023-12-04 02:04:40,364 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70500.0, ans=0.125 2023-12-04 02:04:45,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=70566.66666666667, ans=0.035 2023-12-04 02:04:54,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=70633.33333333333, ans=0.125 2023-12-04 02:04:55,260 INFO [train.py:1087] (1/4) Epoch 12, batch 750, loss[loss=0.1977, simple_loss=0.2823, pruned_loss=0.0566, over 24783.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2893, pruned_loss=0.06076, over 4706261.55 frames. ], batch size: 71, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:05:10,785 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70700.0, ans=0.125 2023-12-04 02:05:13,288 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=70700.0, ans=0.2 2023-12-04 02:05:16,172 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.388e+02 1.641e+02 1.850e+02 2.061e+02 3.864e+02, threshold=3.700e+02, percent-clipped=0.0 2023-12-04 02:05:17,476 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=70766.66666666667, ans=0.0 2023-12-04 02:05:19,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=70766.66666666667, ans=0.125 2023-12-04 02:05:29,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=70833.33333333333, ans=0.125 2023-12-04 02:05:33,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=70833.33333333333, ans=0.0 2023-12-04 02:05:47,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=70900.0, ans=0.1 2023-12-04 02:05:49,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70966.66666666667, ans=0.1 2023-12-04 02:05:50,306 INFO [train.py:1087] (1/4) Epoch 12, batch 800, loss[loss=0.2099, simple_loss=0.294, pruned_loss=0.06291, over 24484.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2888, pruned_loss=0.06041, over 4739851.94 frames. ], batch size: 75, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:06:23,754 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=71166.66666666667, ans=0.0 2023-12-04 02:06:26,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=71166.66666666667, ans=0.2 2023-12-04 02:06:35,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=71233.33333333333, ans=0.125 2023-12-04 02:06:35,736 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2023-12-04 02:06:41,336 INFO [train.py:1087] (1/4) Epoch 12, batch 850, loss[loss=0.1996, simple_loss=0.2881, pruned_loss=0.05558, over 24713.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2885, pruned_loss=0.06017, over 4762572.80 frames. ], batch size: 74, lr: 1.88e-02, grad_scale: 32.0 2023-12-04 02:06:44,834 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2023-12-04 02:07:00,560 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.337e+02 1.606e+02 1.785e+02 2.120e+02 3.462e+02, threshold=3.569e+02, percent-clipped=0.0 2023-12-04 02:07:02,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=71433.33333333333, ans=0.09899494936611666 2023-12-04 02:07:13,279 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.07 vs. limit=6.0 2023-12-04 02:07:42,593 INFO [train.py:1087] (1/4) Epoch 13, batch 0, loss[loss=0.1892, simple_loss=0.2788, pruned_loss=0.04976, over 24492.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2788, pruned_loss=0.04976, over 24492.00 frames. ], batch size: 75, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:07:42,594 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 02:07:54,986 INFO [train.py:1119] (1/4) Epoch 13, validation: loss=0.173, simple_loss=0.2751, pruned_loss=0.03551, over 944034.00 frames. 2023-12-04 02:07:54,987 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 02:07:58,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=71600.0, ans=0.125 2023-12-04 02:07:59,049 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-12-04 02:07:59,899 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.16 vs. limit=10.0 2023-12-04 02:08:07,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=71666.66666666667, ans=0.0 2023-12-04 02:08:10,609 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-12-04 02:08:14,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=71666.66666666667, ans=0.0 2023-12-04 02:08:49,944 INFO [train.py:1087] (1/4) Epoch 13, batch 50, loss[loss=0.2081, simple_loss=0.2914, pruned_loss=0.06237, over 24715.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2889, pruned_loss=0.06046, over 1083898.12 frames. ], batch size: 69, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:08:50,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=71933.33333333333, ans=0.0 2023-12-04 02:08:52,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=71933.33333333333, ans=0.0 2023-12-04 02:09:11,106 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=72066.66666666667, ans=0.125 2023-12-04 02:09:16,453 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.399e+02 1.642e+02 1.871e+02 2.165e+02 4.330e+02, threshold=3.741e+02, percent-clipped=2.0 2023-12-04 02:09:16,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=72066.66666666667, ans=0.2 2023-12-04 02:09:26,688 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-12-04 02:09:44,744 INFO [train.py:1087] (1/4) Epoch 13, batch 100, loss[loss=0.205, simple_loss=0.289, pruned_loss=0.06046, over 23964.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2878, pruned_loss=0.05958, over 1909610.97 frames. ], batch size: 87, lr: 1.80e-02, grad_scale: 32.0 2023-12-04 02:09:49,194 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=72266.66666666667, ans=0.125 2023-12-04 02:09:53,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=72266.66666666667, ans=0.0 2023-12-04 02:09:53,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=72266.66666666667, ans=0.125 2023-12-04 02:09:56,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=72333.33333333333, ans=0.0 2023-12-04 02:10:03,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=72333.33333333333, ans=0.0 2023-12-04 02:10:07,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72400.0, ans=0.125 2023-12-04 02:10:10,802 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.61 vs. limit=22.5 2023-12-04 02:10:10,816 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-12-04 02:10:19,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=72466.66666666667, ans=0.125 2023-12-04 02:10:24,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=72466.66666666667, ans=0.2 2023-12-04 02:10:29,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=72533.33333333333, ans=0.0 2023-12-04 02:10:33,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=72533.33333333333, ans=0.0 2023-12-04 02:10:38,242 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.88 vs. limit=6.0 2023-12-04 02:10:38,679 INFO [train.py:1087] (1/4) Epoch 13, batch 150, loss[loss=0.1963, simple_loss=0.2789, pruned_loss=0.05684, over 24764.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2886, pruned_loss=0.06049, over 2535813.50 frames. ], batch size: 64, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:10:40,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=72600.0, ans=0.0 2023-12-04 02:10:42,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=72600.0, ans=0.5 2023-12-04 02:10:50,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=72666.66666666667, ans=0.125 2023-12-04 02:11:04,798 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.351e+02 1.578e+02 1.747e+02 1.985e+02 3.078e+02, threshold=3.494e+02, percent-clipped=0.0 2023-12-04 02:11:11,412 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=72800.0, ans=0.125 2023-12-04 02:11:32,889 INFO [train.py:1087] (1/4) Epoch 13, batch 200, loss[loss=0.1889, simple_loss=0.2717, pruned_loss=0.05304, over 24733.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2875, pruned_loss=0.0594, over 3056492.80 frames. ], batch size: 67, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:11:40,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=72933.33333333333, ans=0.125 2023-12-04 02:11:49,127 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.03 vs. limit=22.5 2023-12-04 02:11:54,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=73066.66666666667, ans=0.125 2023-12-04 02:11:59,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=73066.66666666667, ans=0.0 2023-12-04 02:12:00,534 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73066.66666666667, ans=0.1 2023-12-04 02:12:02,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=73066.66666666667, ans=0.125 2023-12-04 02:12:15,143 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.77 vs. limit=22.5 2023-12-04 02:12:27,855 INFO [train.py:1087] (1/4) Epoch 13, batch 250, loss[loss=0.2123, simple_loss=0.2914, pruned_loss=0.0666, over 24055.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2884, pruned_loss=0.06036, over 3431483.21 frames. ], batch size: 87, lr: 1.79e-02, grad_scale: 32.0 2023-12-04 02:12:39,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=73333.33333333333, ans=0.0 2023-12-04 02:12:44,794 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=73333.33333333333, ans=0.125 2023-12-04 02:12:54,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=73400.0, ans=0.2 2023-12-04 02:12:55,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=73400.0, ans=0.0 2023-12-04 02:12:55,827 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.605e+02 1.874e+02 2.151e+02 3.142e+02, threshold=3.748e+02, percent-clipped=0.0 2023-12-04 02:13:00,510 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73466.66666666667, ans=0.1 2023-12-04 02:13:09,768 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=73466.66666666667, ans=0.2 2023-12-04 02:13:13,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=73533.33333333333, ans=0.0 2023-12-04 02:13:16,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=73533.33333333333, ans=0.0 2023-12-04 02:13:20,051 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=73533.33333333333, ans=0.125 2023-12-04 02:13:23,424 INFO [train.py:1087] (1/4) Epoch 13, batch 300, loss[loss=0.2014, simple_loss=0.2833, pruned_loss=0.05972, over 24445.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2883, pruned_loss=0.06028, over 3723910.82 frames. ], batch size: 77, lr: 1.78e-02, grad_scale: 16.0 2023-12-04 02:13:52,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=73733.33333333333, ans=0.0 2023-12-04 02:14:14,772 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.97 vs. limit=22.5 2023-12-04 02:14:17,921 INFO [train.py:1087] (1/4) Epoch 13, batch 350, loss[loss=0.2069, simple_loss=0.294, pruned_loss=0.0599, over 24451.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2878, pruned_loss=0.05999, over 3965296.47 frames. ], batch size: 77, lr: 1.78e-02, grad_scale: 16.0 2023-12-04 02:14:23,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=73933.33333333333, ans=0.125 2023-12-04 02:14:38,356 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.79 vs. limit=5.0 2023-12-04 02:14:39,835 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=74066.66666666667, ans=0.125 2023-12-04 02:14:43,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=74066.66666666667, ans=0.125 2023-12-04 02:14:44,780 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.629e+02 1.793e+02 1.917e+02 2.643e+02, threshold=3.587e+02, percent-clipped=0.0 2023-12-04 02:14:49,845 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-12-04 02:15:04,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=74200.0, ans=0.2 2023-12-04 02:15:12,035 INFO [train.py:1087] (1/4) Epoch 13, batch 400, loss[loss=0.2209, simple_loss=0.3011, pruned_loss=0.07036, over 20918.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2877, pruned_loss=0.05982, over 4158816.65 frames. ], batch size: 50, lr: 1.78e-02, grad_scale: 32.0 2023-12-04 02:15:31,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=74333.33333333333, ans=0.125 2023-12-04 02:15:33,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=74400.0, ans=0.0 2023-12-04 02:15:36,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:39,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:40,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=74400.0, ans=0.125 2023-12-04 02:15:44,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=74466.66666666667, ans=0.0 2023-12-04 02:16:02,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=74533.33333333333, ans=10.0 2023-12-04 02:16:04,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=74533.33333333333, ans=0.0 2023-12-04 02:16:04,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=74533.33333333333, ans=0.125 2023-12-04 02:16:06,901 INFO [train.py:1087] (1/4) Epoch 13, batch 450, loss[loss=0.1885, simple_loss=0.2799, pruned_loss=0.04858, over 24779.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2874, pruned_loss=0.05955, over 4300018.40 frames. ], batch size: 73, lr: 1.78e-02, grad_scale: 32.0 2023-12-04 02:16:09,253 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74600.0, ans=0.125 2023-12-04 02:16:10,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=74600.0, ans=0.0 2023-12-04 02:16:34,977 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.339e+02 1.725e+02 1.922e+02 2.219e+02 4.035e+02, threshold=3.844e+02, percent-clipped=3.0 2023-12-04 02:16:40,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=74800.0, ans=0.125 2023-12-04 02:16:52,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=74866.66666666667, ans=0.125 2023-12-04 02:16:59,757 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:17:02,973 INFO [train.py:1087] (1/4) Epoch 13, batch 500, loss[loss=0.2075, simple_loss=0.2929, pruned_loss=0.06108, over 24113.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2873, pruned_loss=0.05938, over 4419202.56 frames. ], batch size: 87, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:17:09,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=74933.33333333333, ans=0.125 2023-12-04 02:17:18,341 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.85 vs. limit=10.0 2023-12-04 02:17:28,391 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75066.66666666667, ans=0.0 2023-12-04 02:17:47,806 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75200.0, ans=0.1 2023-12-04 02:17:47,817 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75200.0, ans=0.125 2023-12-04 02:17:57,589 INFO [train.py:1087] (1/4) Epoch 13, batch 550, loss[loss=0.1907, simple_loss=0.2773, pruned_loss=0.05206, over 24695.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2875, pruned_loss=0.05967, over 4496921.43 frames. ], batch size: 74, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:18:02,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75266.66666666667, ans=0.1 2023-12-04 02:18:08,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=75333.33333333333, ans=0.0 2023-12-04 02:18:11,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=75333.33333333333, ans=0.0 2023-12-04 02:18:25,349 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.288e+02 1.590e+02 1.790e+02 2.037e+02 3.276e+02, threshold=3.580e+02, percent-clipped=0.0 2023-12-04 02:18:49,230 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.99 vs. limit=22.5 2023-12-04 02:18:49,341 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-12-04 02:18:53,011 INFO [train.py:1087] (1/4) Epoch 13, batch 600, loss[loss=0.1999, simple_loss=0.2797, pruned_loss=0.06009, over 24718.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2872, pruned_loss=0.05951, over 4557655.33 frames. ], batch size: 67, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:18:57,987 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-12-04 02:19:25,216 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=75800.0, ans=0.5 2023-12-04 02:19:27,302 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=75800.0, ans=0.125 2023-12-04 02:19:41,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=75866.66666666667, ans=0.125 2023-12-04 02:19:48,270 INFO [train.py:1087] (1/4) Epoch 13, batch 650, loss[loss=0.22, simple_loss=0.2988, pruned_loss=0.07059, over 24482.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.287, pruned_loss=0.05925, over 4609390.30 frames. ], batch size: 75, lr: 1.77e-02, grad_scale: 32.0 2023-12-04 02:19:53,681 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:19:53,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=75933.33333333333, ans=0.0 2023-12-04 02:19:55,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=75933.33333333333, ans=0.125 2023-12-04 02:19:57,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=76000.0, ans=0.2 2023-12-04 02:20:07,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=76000.0, ans=0.04949747468305833 2023-12-04 02:20:13,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=76066.66666666667, ans=0.0 2023-12-04 02:20:16,057 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.248e+02 1.571e+02 1.754e+02 1.973e+02 3.590e+02, threshold=3.508e+02, percent-clipped=1.0 2023-12-04 02:20:20,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=76066.66666666667, ans=0.07 2023-12-04 02:20:37,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=76200.0, ans=0.5 2023-12-04 02:20:41,381 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-12-04 02:20:43,852 INFO [train.py:1087] (1/4) Epoch 13, batch 700, loss[loss=0.1942, simple_loss=0.2812, pruned_loss=0.05364, over 24717.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2864, pruned_loss=0.05889, over 4662765.23 frames. ], batch size: 67, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:21:04,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=76400.0, ans=0.125 2023-12-04 02:21:20,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=76466.66666666667, ans=0.0 2023-12-04 02:21:30,078 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=76533.33333333333, ans=0.0 2023-12-04 02:21:32,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=76533.33333333333, ans=0.125 2023-12-04 02:21:35,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76533.33333333333, ans=0.1 2023-12-04 02:21:39,149 INFO [train.py:1087] (1/4) Epoch 13, batch 750, loss[loss=0.1953, simple_loss=0.284, pruned_loss=0.05334, over 24560.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2861, pruned_loss=0.05865, over 4703481.42 frames. ], batch size: 66, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:21:48,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=76666.66666666667, ans=0.125 2023-12-04 02:21:48,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=76666.66666666667, ans=0.0 2023-12-04 02:22:05,603 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.679e+02 1.812e+02 2.035e+02 3.313e+02, threshold=3.625e+02, percent-clipped=0.0 2023-12-04 02:22:30,311 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=76866.66666666667, ans=0.07 2023-12-04 02:22:33,251 INFO [train.py:1087] (1/4) Epoch 13, batch 800, loss[loss=0.1978, simple_loss=0.2845, pruned_loss=0.05558, over 24611.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2855, pruned_loss=0.05841, over 4731112.81 frames. ], batch size: 68, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:22:33,570 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=76933.33333333333, ans=0.125 2023-12-04 02:22:44,635 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=77000.0, ans=0.0 2023-12-04 02:22:45,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=77000.0, ans=0.1 2023-12-04 02:22:56,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=77066.66666666667, ans=0.0 2023-12-04 02:23:00,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=77066.66666666667, ans=0.07 2023-12-04 02:23:01,634 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=77066.66666666667, ans=0.125 2023-12-04 02:23:21,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=77200.0, ans=0.125 2023-12-04 02:23:24,366 INFO [train.py:1087] (1/4) Epoch 13, batch 850, loss[loss=0.1849, simple_loss=0.2696, pruned_loss=0.05015, over 24555.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2858, pruned_loss=0.05888, over 4738526.30 frames. ], batch size: 66, lr: 1.76e-02, grad_scale: 32.0 2023-12-04 02:23:26,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=77266.66666666667, ans=0.125 2023-12-04 02:23:27,074 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-12-04 02:23:39,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=77333.33333333333, ans=0.2 2023-12-04 02:23:50,109 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.257e+02 1.571e+02 1.695e+02 1.880e+02 3.580e+02, threshold=3.390e+02, percent-clipped=0.0 2023-12-04 02:23:50,926 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.31 vs. limit=10.0 2023-12-04 02:23:59,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=77466.66666666667, ans=0.2 2023-12-04 02:24:26,555 INFO [train.py:1087] (1/4) Epoch 14, batch 0, loss[loss=0.1776, simple_loss=0.2667, pruned_loss=0.04424, over 24704.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2667, pruned_loss=0.04424, over 24704.00 frames. ], batch size: 74, lr: 1.69e-02, grad_scale: 32.0 2023-12-04 02:24:26,556 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 02:24:38,623 INFO [train.py:1119] (1/4) Epoch 14, validation: loss=0.1708, simple_loss=0.273, pruned_loss=0.03427, over 944034.00 frames. 2023-12-04 02:24:38,624 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 02:24:48,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=77633.33333333333, ans=0.0 2023-12-04 02:24:53,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=77633.33333333333, ans=0.0 2023-12-04 02:25:08,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=77700.0, ans=0.125 2023-12-04 02:25:08,833 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.49 vs. limit=22.5 2023-12-04 02:25:15,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=77766.66666666667, ans=0.2 2023-12-04 02:25:29,275 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77833.33333333333, ans=0.1 2023-12-04 02:25:33,300 INFO [train.py:1087] (1/4) Epoch 14, batch 50, loss[loss=0.2571, simple_loss=0.3228, pruned_loss=0.09567, over 16598.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.284, pruned_loss=0.05692, over 1084870.18 frames. ], batch size: 177, lr: 1.69e-02, grad_scale: 32.0 2023-12-04 02:25:38,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=77900.0, ans=0.2 2023-12-04 02:25:51,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=77966.66666666667, ans=0.07 2023-12-04 02:26:02,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=78033.33333333333, ans=0.0 2023-12-04 02:26:02,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=78033.33333333333, ans=0.125 2023-12-04 02:26:06,277 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.648e+02 1.784e+02 1.938e+02 4.407e+02, threshold=3.569e+02, percent-clipped=1.0 2023-12-04 02:26:11,540 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.70 vs. limit=15.0 2023-12-04 02:26:16,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=78166.66666666667, ans=0.125 2023-12-04 02:26:28,101 INFO [train.py:1087] (1/4) Epoch 14, batch 100, loss[loss=0.1956, simple_loss=0.2804, pruned_loss=0.05542, over 24762.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2847, pruned_loss=0.05696, over 1904291.28 frames. ], batch size: 70, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:27:03,813 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-12-04 02:27:23,497 INFO [train.py:1087] (1/4) Epoch 14, batch 150, loss[loss=0.2149, simple_loss=0.3017, pruned_loss=0.06407, over 23530.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2855, pruned_loss=0.05788, over 2522026.53 frames. ], batch size: 94, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:27:25,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=78566.66666666667, ans=0.125 2023-12-04 02:27:26,843 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=78566.66666666667, ans=0.1 2023-12-04 02:27:40,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=78633.33333333333, ans=0.0 2023-12-04 02:27:44,094 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-12-04 02:27:56,504 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.581e+02 1.793e+02 2.072e+02 3.114e+02, threshold=3.586e+02, percent-clipped=0.0 2023-12-04 02:27:59,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=78766.66666666667, ans=0.125 2023-12-04 02:28:05,938 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=78766.66666666667, ans=0.0 2023-12-04 02:28:10,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=78833.33333333333, ans=0.125 2023-12-04 02:28:11,451 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2023-12-04 02:28:12,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=78833.33333333333, ans=0.125 2023-12-04 02:28:14,905 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.70 vs. limit=22.5 2023-12-04 02:28:18,742 INFO [train.py:1087] (1/4) Epoch 14, batch 200, loss[loss=0.1972, simple_loss=0.2816, pruned_loss=0.05637, over 24292.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2861, pruned_loss=0.05854, over 3015519.13 frames. ], batch size: 79, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:29:14,402 INFO [train.py:1087] (1/4) Epoch 14, batch 250, loss[loss=0.1829, simple_loss=0.2752, pruned_loss=0.04534, over 24721.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2863, pruned_loss=0.05889, over 3395323.24 frames. ], batch size: 67, lr: 1.68e-02, grad_scale: 32.0 2023-12-04 02:29:17,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=79233.33333333333, ans=0.0 2023-12-04 02:29:21,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-12-04 02:29:47,208 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.189e+02 1.553e+02 1.699e+02 1.898e+02 2.927e+02, threshold=3.398e+02, percent-clipped=0.0 2023-12-04 02:29:48,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=79433.33333333333, ans=0.125 2023-12-04 02:29:58,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=79500.0, ans=0.125 2023-12-04 02:29:58,667 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.88 vs. limit=15.0 2023-12-04 02:30:09,766 INFO [train.py:1087] (1/4) Epoch 14, batch 300, loss[loss=0.1807, simple_loss=0.2704, pruned_loss=0.04551, over 24767.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2857, pruned_loss=0.05808, over 3721087.87 frames. ], batch size: 64, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:30:11,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=79566.66666666667, ans=0.125 2023-12-04 02:30:11,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=79566.66666666667, ans=0.125 2023-12-04 02:30:16,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=79566.66666666667, ans=0.0 2023-12-04 02:30:18,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=79566.66666666667, ans=0.125 2023-12-04 02:30:54,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=79833.33333333333, ans=0.0 2023-12-04 02:30:59,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=79833.33333333333, ans=0.125 2023-12-04 02:31:02,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=79833.33333333333, ans=0.2 2023-12-04 02:31:04,244 INFO [train.py:1087] (1/4) Epoch 14, batch 350, loss[loss=0.1988, simple_loss=0.2857, pruned_loss=0.0559, over 24568.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2858, pruned_loss=0.05824, over 3950069.76 frames. ], batch size: 64, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:31:17,720 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79966.66666666667, ans=0.125 2023-12-04 02:31:19,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=79966.66666666667, ans=0.125 2023-12-04 02:31:31,330 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.08 vs. limit=15.0 2023-12-04 02:31:39,265 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.287e+02 1.517e+02 1.760e+02 2.062e+02 3.143e+02, threshold=3.521e+02, percent-clipped=0.0 2023-12-04 02:31:58,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=80166.66666666667, ans=0.05 2023-12-04 02:32:00,874 INFO [train.py:1087] (1/4) Epoch 14, batch 400, loss[loss=0.1941, simple_loss=0.2797, pruned_loss=0.05426, over 24782.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2854, pruned_loss=0.0578, over 4146916.22 frames. ], batch size: 71, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:32:16,663 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=80300.0, ans=0.0 2023-12-04 02:32:27,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=80366.66666666667, ans=0.025 2023-12-04 02:32:36,257 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=80433.33333333333, ans=0.125 2023-12-04 02:32:37,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=80433.33333333333, ans=0.0 2023-12-04 02:32:39,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=80433.33333333333, ans=0.0 2023-12-04 02:32:50,193 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-12-04 02:32:54,333 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-12-04 02:32:55,647 INFO [train.py:1087] (1/4) Epoch 14, batch 450, loss[loss=0.2039, simple_loss=0.2886, pruned_loss=0.05963, over 24798.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2854, pruned_loss=0.05783, over 4285141.95 frames. ], batch size: 62, lr: 1.67e-02, grad_scale: 32.0 2023-12-04 02:33:04,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=80566.66666666667, ans=0.2 2023-12-04 02:33:14,960 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=80633.33333333333, ans=0.125 2023-12-04 02:33:18,819 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=22.5 2023-12-04 02:33:28,840 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.570e+02 1.760e+02 1.959e+02 2.604e+02, threshold=3.520e+02, percent-clipped=0.0 2023-12-04 02:33:43,751 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.30 vs. limit=15.0 2023-12-04 02:33:51,207 INFO [train.py:1087] (1/4) Epoch 14, batch 500, loss[loss=0.1872, simple_loss=0.2718, pruned_loss=0.05133, over 24553.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2846, pruned_loss=0.05748, over 4410391.64 frames. ], batch size: 62, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:33:58,810 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:34:13,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=81033.33333333333, ans=0.2 2023-12-04 02:34:20,172 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.24 vs. limit=10.0 2023-12-04 02:34:37,954 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81166.66666666667, ans=0.125 2023-12-04 02:34:45,064 INFO [train.py:1087] (1/4) Epoch 14, batch 550, loss[loss=0.2067, simple_loss=0.293, pruned_loss=0.06018, over 24806.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2845, pruned_loss=0.05761, over 4488617.96 frames. ], batch size: 62, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:34:47,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81233.33333333333, ans=0.1 2023-12-04 02:34:51,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81233.33333333333, ans=0.1 2023-12-04 02:34:52,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=81233.33333333333, ans=0.2 2023-12-04 02:34:55,651 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:35:13,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81366.66666666667, ans=0.125 2023-12-04 02:35:14,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=81366.66666666667, ans=0.125 2023-12-04 02:35:17,826 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.319e+02 1.552e+02 1.720e+02 1.872e+02 2.900e+02, threshold=3.440e+02, percent-clipped=0.0 2023-12-04 02:35:24,001 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.74 vs. limit=15.0 2023-12-04 02:35:40,209 INFO [train.py:1087] (1/4) Epoch 14, batch 600, loss[loss=0.2036, simple_loss=0.2908, pruned_loss=0.05814, over 21359.00 frames. ], tot_loss[loss=0.199, simple_loss=0.284, pruned_loss=0.057, over 4559694.65 frames. ], batch size: 127, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:35:58,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=81633.33333333333, ans=0.0 2023-12-04 02:36:04,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=81700.0, ans=0.0 2023-12-04 02:36:07,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=81700.0, ans=0.04949747468305833 2023-12-04 02:36:13,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=81766.66666666667, ans=0.125 2023-12-04 02:36:22,901 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=15.0 2023-12-04 02:36:35,521 INFO [train.py:1087] (1/4) Epoch 14, batch 650, loss[loss=0.1949, simple_loss=0.2823, pruned_loss=0.05376, over 24745.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2837, pruned_loss=0.05674, over 4626206.41 frames. ], batch size: 63, lr: 1.66e-02, grad_scale: 32.0 2023-12-04 02:37:00,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=82033.33333333333, ans=0.125 2023-12-04 02:37:07,927 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.536e+02 1.690e+02 1.852e+02 2.525e+02, threshold=3.381e+02, percent-clipped=0.0 2023-12-04 02:37:30,034 INFO [train.py:1087] (1/4) Epoch 14, batch 700, loss[loss=0.1853, simple_loss=0.2741, pruned_loss=0.04822, over 24707.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2833, pruned_loss=0.05635, over 4677267.01 frames. ], batch size: 69, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:37:30,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=82233.33333333333, ans=0.125 2023-12-04 02:37:31,397 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:37:43,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82300.0, ans=0.1 2023-12-04 02:37:44,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82300.0, ans=0.1 2023-12-04 02:37:45,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=82300.0, ans=0.125 2023-12-04 02:37:48,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=82300.0, ans=0.125 2023-12-04 02:37:55,309 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=82366.66666666667, ans=0.0 2023-12-04 02:38:11,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=82433.33333333333, ans=0.125 2023-12-04 02:38:14,238 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=82500.0, ans=0.125 2023-12-04 02:38:16,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=82500.0, ans=0.2 2023-12-04 02:38:19,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=82500.0, ans=0.5 2023-12-04 02:38:24,632 INFO [train.py:1087] (1/4) Epoch 14, batch 750, loss[loss=0.2014, simple_loss=0.2863, pruned_loss=0.0583, over 24765.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2837, pruned_loss=0.05668, over 4699767.22 frames. ], batch size: 65, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:38:28,963 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-12-04 02:38:32,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=82566.66666666667, ans=0.125 2023-12-04 02:38:33,481 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:38:34,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=82566.66666666667, ans=0.125 2023-12-04 02:38:48,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=82700.0, ans=0.2 2023-12-04 02:38:58,922 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.291e+02 1.618e+02 1.933e+02 2.376e+02 4.506e+02, threshold=3.866e+02, percent-clipped=2.0 2023-12-04 02:38:59,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=82766.66666666667, ans=0.0 2023-12-04 02:39:03,230 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-12-04 02:39:03,782 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=82766.66666666667, ans=0.125 2023-12-04 02:39:06,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=82766.66666666667, ans=0.125 2023-12-04 02:39:20,998 INFO [train.py:1087] (1/4) Epoch 14, batch 800, loss[loss=0.1964, simple_loss=0.2835, pruned_loss=0.05469, over 24415.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2841, pruned_loss=0.05706, over 4714277.46 frames. ], batch size: 77, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:39:37,210 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=82966.66666666667, ans=0.0 2023-12-04 02:39:37,439 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-12-04 02:40:12,642 INFO [train.py:1087] (1/4) Epoch 14, batch 850, loss[loss=0.2023, simple_loss=0.2904, pruned_loss=0.05708, over 24160.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2842, pruned_loss=0.05696, over 4731451.22 frames. ], batch size: 58, lr: 1.65e-02, grad_scale: 32.0 2023-12-04 02:40:19,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=83233.33333333333, ans=0.0 2023-12-04 02:40:28,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=83300.0, ans=0.125 2023-12-04 02:40:42,923 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.548e+02 1.697e+02 1.848e+02 3.947e+02, threshold=3.394e+02, percent-clipped=1.0 2023-12-04 02:40:47,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=83433.33333333333, ans=0.0 2023-12-04 02:40:55,324 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.71 vs. limit=10.0 2023-12-04 02:41:14,495 INFO [train.py:1087] (1/4) Epoch 15, batch 0, loss[loss=0.1824, simple_loss=0.2709, pruned_loss=0.04692, over 24767.00 frames. ], tot_loss[loss=0.1824, simple_loss=0.2709, pruned_loss=0.04692, over 24767.00 frames. ], batch size: 66, lr: 1.59e-02, grad_scale: 32.0 2023-12-04 02:41:14,496 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 02:41:26,628 INFO [train.py:1119] (1/4) Epoch 15, validation: loss=0.1681, simple_loss=0.2702, pruned_loss=0.03297, over 944034.00 frames. 2023-12-04 02:41:26,629 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 02:41:27,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=83533.33333333333, ans=0.0 2023-12-04 02:41:34,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=83533.33333333333, ans=0.125 2023-12-04 02:41:36,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=83600.0, ans=0.0 2023-12-04 02:41:48,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=83666.66666666667, ans=0.07 2023-12-04 02:41:49,156 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.86 vs. limit=15.0 2023-12-04 02:41:53,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=83666.66666666667, ans=0.0 2023-12-04 02:41:55,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83666.66666666667, ans=0.1 2023-12-04 02:42:06,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=83733.33333333333, ans=0.125 2023-12-04 02:42:09,590 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.91 vs. limit=12.0 2023-12-04 02:42:12,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=83800.0, ans=0.0 2023-12-04 02:42:19,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=83800.0, ans=0.0 2023-12-04 02:42:21,688 INFO [train.py:1087] (1/4) Epoch 15, batch 50, loss[loss=0.1815, simple_loss=0.2638, pruned_loss=0.04956, over 24369.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2848, pruned_loss=0.05793, over 1090536.51 frames. ], batch size: 79, lr: 1.59e-02, grad_scale: 32.0 2023-12-04 02:43:00,525 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.341e+02 1.523e+02 1.656e+02 1.880e+02 3.765e+02, threshold=3.311e+02, percent-clipped=1.0 2023-12-04 02:43:07,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=84133.33333333333, ans=0.125 2023-12-04 02:43:16,502 INFO [train.py:1087] (1/4) Epoch 15, batch 100, loss[loss=0.184, simple_loss=0.2677, pruned_loss=0.05018, over 24608.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2842, pruned_loss=0.05707, over 1911000.74 frames. ], batch size: 68, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:43:35,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84266.66666666667, ans=0.1 2023-12-04 02:43:35,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=84266.66666666667, ans=0.2 2023-12-04 02:43:42,941 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84333.33333333333, ans=0.1 2023-12-04 02:44:11,284 INFO [train.py:1087] (1/4) Epoch 15, batch 150, loss[loss=0.2357, simple_loss=0.3184, pruned_loss=0.07652, over 21716.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.284, pruned_loss=0.05694, over 2548300.33 frames. ], batch size: 128, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:44:12,832 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.36 vs. limit=15.0 2023-12-04 02:44:24,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=84600.0, ans=0.125 2023-12-04 02:44:48,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=84733.33333333333, ans=0.0 2023-12-04 02:44:49,726 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.330e+02 1.528e+02 1.638e+02 1.857e+02 3.053e+02, threshold=3.277e+02, percent-clipped=0.0 2023-12-04 02:44:52,960 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=84733.33333333333, ans=0.125 2023-12-04 02:45:06,739 INFO [train.py:1087] (1/4) Epoch 15, batch 200, loss[loss=0.1955, simple_loss=0.2872, pruned_loss=0.05186, over 24119.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2831, pruned_loss=0.05642, over 3044780.52 frames. ], batch size: 82, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:45:12,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=84866.66666666667, ans=0.125 2023-12-04 02:45:33,886 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.02 vs. limit=22.5 2023-12-04 02:45:38,474 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=12.0 2023-12-04 02:46:02,251 INFO [train.py:1087] (1/4) Epoch 15, batch 250, loss[loss=0.2109, simple_loss=0.2933, pruned_loss=0.0643, over 24475.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2836, pruned_loss=0.05679, over 3424974.16 frames. ], batch size: 77, lr: 1.58e-02, grad_scale: 32.0 2023-12-04 02:46:28,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=85333.33333333333, ans=0.0 2023-12-04 02:46:38,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=85400.0, ans=0.125 2023-12-04 02:46:40,745 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.213e+02 1.454e+02 1.592e+02 1.753e+02 2.965e+02, threshold=3.184e+02, percent-clipped=0.0 2023-12-04 02:46:43,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=85400.0, ans=0.2 2023-12-04 02:46:53,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=85466.66666666667, ans=0.07 2023-12-04 02:46:57,803 INFO [train.py:1087] (1/4) Epoch 15, batch 300, loss[loss=0.1912, simple_loss=0.2824, pruned_loss=0.05001, over 24587.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2831, pruned_loss=0.05643, over 3722972.48 frames. ], batch size: 65, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:46:59,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85533.33333333333, ans=0.125 2023-12-04 02:47:01,593 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-12-04 02:47:06,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=85533.33333333333, ans=0.05 2023-12-04 02:47:16,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=85600.0, ans=0.125 2023-12-04 02:47:23,419 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.19 vs. limit=10.0 2023-12-04 02:47:28,870 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.83 vs. limit=10.0 2023-12-04 02:47:34,864 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.38 vs. limit=22.5 2023-12-04 02:47:52,371 INFO [train.py:1087] (1/4) Epoch 15, batch 350, loss[loss=0.1938, simple_loss=0.2767, pruned_loss=0.05549, over 24579.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.283, pruned_loss=0.05621, over 3959314.47 frames. ], batch size: 65, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:47:53,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=85866.66666666667, ans=0.125 2023-12-04 02:48:04,288 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=85933.33333333333, ans=0.125 2023-12-04 02:48:27,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=86066.66666666667, ans=0.125 2023-12-04 02:48:31,952 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.237e+02 1.522e+02 1.709e+02 1.914e+02 2.589e+02, threshold=3.419e+02, percent-clipped=0.0 2023-12-04 02:48:34,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=86066.66666666667, ans=0.125 2023-12-04 02:48:39,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=86133.33333333333, ans=0.125 2023-12-04 02:48:47,872 INFO [train.py:1087] (1/4) Epoch 15, batch 400, loss[loss=0.1846, simple_loss=0.2685, pruned_loss=0.05039, over 24572.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2826, pruned_loss=0.05583, over 4158031.06 frames. ], batch size: 65, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:49:09,780 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=86333.33333333333, ans=0.02 2023-12-04 02:49:14,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=86333.33333333333, ans=0.0 2023-12-04 02:49:20,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=86400.0, ans=0.07 2023-12-04 02:49:40,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=86466.66666666667, ans=0.0 2023-12-04 02:49:43,579 INFO [train.py:1087] (1/4) Epoch 15, batch 450, loss[loss=0.2298, simple_loss=0.3051, pruned_loss=0.07723, over 21553.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2827, pruned_loss=0.0561, over 4288411.21 frames. ], batch size: 127, lr: 1.57e-02, grad_scale: 32.0 2023-12-04 02:49:50,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=86533.33333333333, ans=0.125 2023-12-04 02:49:56,768 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-12-04 02:50:06,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=86666.66666666667, ans=0.125 2023-12-04 02:50:13,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=86666.66666666667, ans=0.0 2023-12-04 02:50:18,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=86733.33333333333, ans=0.0 2023-12-04 02:50:21,845 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.235e+02 1.620e+02 1.806e+02 1.993e+02 3.052e+02, threshold=3.612e+02, percent-clipped=0.0 2023-12-04 02:50:38,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=86866.66666666667, ans=0.09899494936611666 2023-12-04 02:50:39,162 INFO [train.py:1087] (1/4) Epoch 15, batch 500, loss[loss=0.1786, simple_loss=0.2658, pruned_loss=0.0457, over 24565.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2825, pruned_loss=0.05595, over 4393862.33 frames. ], batch size: 63, lr: 1.57e-02, grad_scale: 64.0 2023-12-04 02:51:10,216 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=87000.0, ans=0.125 2023-12-04 02:51:19,940 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=87066.66666666667, ans=0.125 2023-12-04 02:51:29,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=87133.33333333333, ans=0.0 2023-12-04 02:51:33,482 INFO [train.py:1087] (1/4) Epoch 15, batch 550, loss[loss=0.206, simple_loss=0.2894, pruned_loss=0.06129, over 24071.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2824, pruned_loss=0.05587, over 4480869.65 frames. ], batch size: 87, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:51:50,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=87266.66666666667, ans=0.0 2023-12-04 02:51:57,331 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=87333.33333333333, ans=0.125 2023-12-04 02:52:07,365 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=87400.0, ans=0.0 2023-12-04 02:52:12,754 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.308e+02 1.596e+02 1.795e+02 1.987e+02 2.603e+02, threshold=3.589e+02, percent-clipped=0.0 2023-12-04 02:52:29,061 INFO [train.py:1087] (1/4) Epoch 15, batch 600, loss[loss=0.1869, simple_loss=0.2784, pruned_loss=0.0477, over 24789.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2817, pruned_loss=0.05554, over 4564402.78 frames. ], batch size: 71, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:52:30,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=87533.33333333333, ans=0.1 2023-12-04 02:52:36,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=87533.33333333333, ans=0.0 2023-12-04 02:52:41,986 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=15.0 2023-12-04 02:52:54,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=87666.66666666667, ans=0.09899494936611666 2023-12-04 02:53:18,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=87800.0, ans=0.0 2023-12-04 02:53:19,385 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-12-04 02:53:23,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=87866.66666666667, ans=0.2 2023-12-04 02:53:24,769 INFO [train.py:1087] (1/4) Epoch 15, batch 650, loss[loss=0.2139, simple_loss=0.2911, pruned_loss=0.06831, over 24196.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2818, pruned_loss=0.05562, over 4604728.77 frames. ], batch size: 82, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:53:27,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=87866.66666666667, ans=0.125 2023-12-04 02:53:44,042 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2023-12-04 02:54:02,918 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.86 vs. limit=15.0 2023-12-04 02:54:03,473 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.576e+02 1.749e+02 1.959e+02 2.517e+02, threshold=3.499e+02, percent-clipped=0.0 2023-12-04 02:54:05,801 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=88066.66666666667, ans=0.125 2023-12-04 02:54:05,850 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=88066.66666666667, ans=0.2 2023-12-04 02:54:10,086 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=88133.33333333333, ans=0.0 2023-12-04 02:54:13,468 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.47 vs. limit=15.0 2023-12-04 02:54:20,523 INFO [train.py:1087] (1/4) Epoch 15, batch 700, loss[loss=0.1942, simple_loss=0.2776, pruned_loss=0.05539, over 24573.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2811, pruned_loss=0.05514, over 4654321.76 frames. ], batch size: 65, lr: 1.56e-02, grad_scale: 64.0 2023-12-04 02:54:22,273 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.92 vs. limit=10.0 2023-12-04 02:54:24,898 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=88200.0, ans=0.125 2023-12-04 02:54:31,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=88266.66666666667, ans=0.2 2023-12-04 02:54:52,953 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=88400.0, ans=0.2 2023-12-04 02:55:09,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=88466.66666666667, ans=0.0 2023-12-04 02:55:10,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88466.66666666667, ans=0.125 2023-12-04 02:55:16,303 INFO [train.py:1087] (1/4) Epoch 15, batch 750, loss[loss=0.1806, simple_loss=0.2676, pruned_loss=0.04679, over 24735.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.281, pruned_loss=0.05497, over 4693284.88 frames. ], batch size: 67, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:55:16,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=88533.33333333333, ans=0.125 2023-12-04 02:55:17,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=88533.33333333333, ans=0.125 2023-12-04 02:55:22,948 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=88533.33333333333, ans=0.025 2023-12-04 02:55:27,161 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=88600.0, ans=0.0 2023-12-04 02:55:33,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=88600.0, ans=0.125 2023-12-04 02:55:35,188 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.35 vs. limit=22.5 2023-12-04 02:55:50,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=88733.33333333333, ans=0.09899494936611666 2023-12-04 02:55:54,580 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.298e+02 1.502e+02 1.659e+02 1.932e+02 2.669e+02, threshold=3.318e+02, percent-clipped=0.0 2023-12-04 02:55:54,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=88733.33333333333, ans=0.0 2023-12-04 02:56:03,334 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88800.0, ans=0.1 2023-12-04 02:56:10,937 INFO [train.py:1087] (1/4) Epoch 15, batch 800, loss[loss=0.1895, simple_loss=0.2786, pruned_loss=0.05026, over 24807.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2805, pruned_loss=0.05467, over 4720911.91 frames. ], batch size: 73, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:56:12,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=88866.66666666667, ans=0.125 2023-12-04 02:56:20,830 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.94 vs. limit=10.0 2023-12-04 02:57:02,422 INFO [train.py:1087] (1/4) Epoch 15, batch 850, loss[loss=0.1841, simple_loss=0.2745, pruned_loss=0.0468, over 24803.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2803, pruned_loss=0.05447, over 4750912.99 frames. ], batch size: 62, lr: 1.55e-02, grad_scale: 64.0 2023-12-04 02:57:09,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=89200.0, ans=0.125 2023-12-04 02:57:33,947 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=89400.0, ans=0.125 2023-12-04 02:57:38,068 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=10.44 vs. limit=15.0 2023-12-04 02:57:38,601 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.272e+02 1.620e+02 1.776e+02 2.057e+02 2.968e+02, threshold=3.552e+02, percent-clipped=0.0 2023-12-04 02:58:03,697 INFO [train.py:1087] (1/4) Epoch 16, batch 0, loss[loss=0.2322, simple_loss=0.3042, pruned_loss=0.08012, over 17086.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3042, pruned_loss=0.08012, over 17086.00 frames. ], batch size: 177, lr: 1.50e-02, grad_scale: 32.0 2023-12-04 02:58:03,697 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 02:58:15,871 INFO [train.py:1119] (1/4) Epoch 16, validation: loss=0.1672, simple_loss=0.2691, pruned_loss=0.03271, over 944034.00 frames. 2023-12-04 02:58:15,872 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 02:58:16,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=89500.0, ans=0.0 2023-12-04 02:58:28,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=89566.66666666667, ans=0.0 2023-12-04 02:58:38,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=89633.33333333333, ans=0.0 2023-12-04 02:59:00,524 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=22.5 2023-12-04 02:59:11,228 INFO [train.py:1087] (1/4) Epoch 16, batch 50, loss[loss=0.1792, simple_loss=0.2675, pruned_loss=0.04545, over 24613.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2798, pruned_loss=0.05467, over 1065591.89 frames. ], batch size: 68, lr: 1.50e-02, grad_scale: 32.0 2023-12-04 02:59:13,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=89833.33333333333, ans=0.0 2023-12-04 02:59:21,343 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 02:59:44,567 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=90033.33333333333, ans=0.125 2023-12-04 02:59:56,090 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.481e+02 1.701e+02 1.981e+02 2.994e+02, threshold=3.402e+02, percent-clipped=0.0 2023-12-04 03:00:05,586 INFO [train.py:1087] (1/4) Epoch 16, batch 100, loss[loss=0.1931, simple_loss=0.2823, pruned_loss=0.05194, over 24715.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2791, pruned_loss=0.05342, over 1908133.43 frames. ], batch size: 74, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:00:08,048 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.66 vs. limit=6.0 2023-12-04 03:00:11,251 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-12-04 03:00:13,514 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-12-04 03:00:30,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=90300.0, ans=0.125 2023-12-04 03:00:32,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=90300.0, ans=0.0 2023-12-04 03:00:32,603 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=90300.0, ans=0.0 2023-12-04 03:00:41,359 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=90366.66666666667, ans=0.0 2023-12-04 03:00:46,803 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-12-04 03:01:00,646 INFO [train.py:1087] (1/4) Epoch 16, batch 150, loss[loss=0.2088, simple_loss=0.2887, pruned_loss=0.06447, over 24730.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.279, pruned_loss=0.0534, over 2558324.52 frames. ], batch size: 63, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:01:18,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=90566.66666666667, ans=0.1 2023-12-04 03:01:19,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=90566.66666666667, ans=0.125 2023-12-04 03:01:21,247 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.21 vs. limit=15.0 2023-12-04 03:01:23,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=90633.33333333333, ans=0.0 2023-12-04 03:01:37,670 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90700.0, ans=0.1 2023-12-04 03:01:46,641 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.563e+02 1.755e+02 1.966e+02 2.864e+02, threshold=3.511e+02, percent-clipped=0.0 2023-12-04 03:01:56,346 INFO [train.py:1087] (1/4) Epoch 16, batch 200, loss[loss=0.1914, simple_loss=0.2792, pruned_loss=0.05184, over 24567.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2804, pruned_loss=0.05453, over 3028024.55 frames. ], batch size: 65, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:02:08,462 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=90900.0, ans=0.125 2023-12-04 03:02:15,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=90900.0, ans=0.1 2023-12-04 03:02:17,748 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=22.5 2023-12-04 03:02:24,665 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:02:38,098 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.13 vs. limit=15.0 2023-12-04 03:02:52,145 INFO [train.py:1087] (1/4) Epoch 16, batch 250, loss[loss=0.2151, simple_loss=0.3003, pruned_loss=0.06492, over 22917.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2798, pruned_loss=0.05399, over 3429035.62 frames. ], batch size: 106, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:02:59,615 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=91166.66666666667, ans=0.125 2023-12-04 03:03:07,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=91233.33333333333, ans=0.125 2023-12-04 03:03:11,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=91233.33333333333, ans=0.1 2023-12-04 03:03:19,210 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=91300.0, ans=0.04949747468305833 2023-12-04 03:03:33,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91366.66666666667, ans=0.1 2023-12-04 03:03:37,190 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.310e+02 1.524e+02 1.658e+02 1.918e+02 2.871e+02, threshold=3.316e+02, percent-clipped=0.0 2023-12-04 03:03:42,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91433.33333333333, ans=0.1 2023-12-04 03:03:44,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=91433.33333333333, ans=0.0 2023-12-04 03:03:48,443 INFO [train.py:1087] (1/4) Epoch 16, batch 300, loss[loss=0.1808, simple_loss=0.2708, pruned_loss=0.04534, over 24722.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2794, pruned_loss=0.05378, over 3742745.38 frames. ], batch size: 67, lr: 1.49e-02, grad_scale: 32.0 2023-12-04 03:03:49,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=91500.0, ans=0.125 2023-12-04 03:03:54,203 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=91500.0, ans=0.2 2023-12-04 03:04:10,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=91633.33333333333, ans=0.0 2023-12-04 03:04:41,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=91833.33333333333, ans=0.0 2023-12-04 03:04:42,563 INFO [train.py:1087] (1/4) Epoch 16, batch 350, loss[loss=0.178, simple_loss=0.2616, pruned_loss=0.04726, over 24768.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2788, pruned_loss=0.05369, over 3975073.98 frames. ], batch size: 64, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:04:42,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=91833.33333333333, ans=0.125 2023-12-04 03:04:58,185 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=91900.0, ans=0.0 2023-12-04 03:05:21,042 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=19.53 vs. limit=15.0 2023-12-04 03:05:28,194 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.299e+02 1.477e+02 1.566e+02 1.706e+02 2.525e+02, threshold=3.132e+02, percent-clipped=0.0 2023-12-04 03:05:32,910 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-12-04 03:05:36,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92166.66666666667, ans=0.125 2023-12-04 03:05:37,752 INFO [train.py:1087] (1/4) Epoch 16, batch 400, loss[loss=0.2091, simple_loss=0.2879, pruned_loss=0.06513, over 24493.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2788, pruned_loss=0.05356, over 4154218.66 frames. ], batch size: 77, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:05:49,699 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92233.33333333333, ans=0.125 2023-12-04 03:05:52,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=92233.33333333333, ans=0.5 2023-12-04 03:06:22,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=92433.33333333333, ans=0.1 2023-12-04 03:06:23,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=92433.33333333333, ans=0.125 2023-12-04 03:06:29,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=92433.33333333333, ans=0.125 2023-12-04 03:06:30,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=92433.33333333333, ans=0.125 2023-12-04 03:06:33,027 INFO [train.py:1087] (1/4) Epoch 16, batch 450, loss[loss=0.1975, simple_loss=0.2884, pruned_loss=0.05325, over 24846.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2786, pruned_loss=0.05349, over 4301587.38 frames. ], batch size: 68, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:06:39,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=92500.0, ans=0.0 2023-12-04 03:06:44,819 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=92566.66666666667, ans=0.2 2023-12-04 03:06:49,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=92566.66666666667, ans=0.125 2023-12-04 03:07:06,568 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-12-04 03:07:17,580 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.269e+02 1.537e+02 1.693e+02 1.843e+02 2.757e+02, threshold=3.387e+02, percent-clipped=0.0 2023-12-04 03:07:18,124 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.00 vs. limit=22.5 2023-12-04 03:07:18,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92766.66666666667, ans=0.1 2023-12-04 03:07:28,238 INFO [train.py:1087] (1/4) Epoch 16, batch 500, loss[loss=0.1993, simple_loss=0.2851, pruned_loss=0.05679, over 24536.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2789, pruned_loss=0.05371, over 4414132.26 frames. ], batch size: 62, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:07:35,184 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=92833.33333333333, ans=0.1 2023-12-04 03:07:54,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=92966.66666666667, ans=0.125 2023-12-04 03:08:08,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=93033.33333333333, ans=0.125 2023-12-04 03:08:22,773 INFO [train.py:1087] (1/4) Epoch 16, batch 550, loss[loss=0.2117, simple_loss=0.2991, pruned_loss=0.06218, over 24786.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2789, pruned_loss=0.05359, over 4503642.68 frames. ], batch size: 62, lr: 1.48e-02, grad_scale: 32.0 2023-12-04 03:09:03,478 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=93366.66666666667, ans=0.125 2023-12-04 03:09:08,444 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.166e+02 1.494e+02 1.570e+02 1.787e+02 2.496e+02, threshold=3.140e+02, percent-clipped=0.0 2023-12-04 03:09:15,379 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=93433.33333333333, ans=0.125 2023-12-04 03:09:18,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.40 vs. limit=5.0 2023-12-04 03:09:18,320 INFO [train.py:1087] (1/4) Epoch 16, batch 600, loss[loss=0.2087, simple_loss=0.293, pruned_loss=0.0622, over 21344.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2794, pruned_loss=0.05392, over 4550401.32 frames. ], batch size: 128, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:09:33,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=93566.66666666667, ans=0.125 2023-12-04 03:09:40,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=93633.33333333333, ans=0.2 2023-12-04 03:09:50,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=93700.0, ans=0.2 2023-12-04 03:10:02,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=93766.66666666667, ans=0.0 2023-12-04 03:10:06,064 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=93766.66666666667, ans=0.125 2023-12-04 03:10:13,596 INFO [train.py:1087] (1/4) Epoch 16, batch 650, loss[loss=0.1892, simple_loss=0.2749, pruned_loss=0.05175, over 24293.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2794, pruned_loss=0.05362, over 4614325.38 frames. ], batch size: 79, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:10:18,186 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-12-04 03:10:25,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=93900.0, ans=0.0 2023-12-04 03:10:26,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=93900.0, ans=0.125 2023-12-04 03:10:57,840 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.251e+02 1.496e+02 1.627e+02 1.957e+02 3.259e+02, threshold=3.254e+02, percent-clipped=1.0 2023-12-04 03:11:06,539 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.62 vs. limit=22.5 2023-12-04 03:11:08,068 INFO [train.py:1087] (1/4) Epoch 16, batch 700, loss[loss=0.2235, simple_loss=0.2996, pruned_loss=0.07363, over 22967.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2792, pruned_loss=0.05358, over 4659910.25 frames. ], batch size: 106, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:11:41,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=94366.66666666667, ans=0.2 2023-12-04 03:11:41,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=94366.66666666667, ans=0.2 2023-12-04 03:11:52,894 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.43 vs. limit=15.0 2023-12-04 03:11:54,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=94433.33333333333, ans=0.0 2023-12-04 03:12:03,067 INFO [train.py:1087] (1/4) Epoch 16, batch 750, loss[loss=0.1788, simple_loss=0.2682, pruned_loss=0.04471, over 24699.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2791, pruned_loss=0.0535, over 4690261.66 frames. ], batch size: 74, lr: 1.47e-02, grad_scale: 32.0 2023-12-04 03:12:18,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=94566.66666666667, ans=0.2 2023-12-04 03:12:44,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=94700.0, ans=0.05 2023-12-04 03:12:48,111 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.571e+02 1.796e+02 2.145e+02 3.958e+02, threshold=3.592e+02, percent-clipped=2.0 2023-12-04 03:12:57,868 INFO [train.py:1087] (1/4) Epoch 16, batch 800, loss[loss=0.1861, simple_loss=0.2755, pruned_loss=0.04834, over 24212.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2791, pruned_loss=0.05332, over 4716050.95 frames. ], batch size: 82, lr: 1.46e-02, grad_scale: 32.0 2023-12-04 03:13:30,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95033.33333333333, ans=0.1 2023-12-04 03:13:44,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=95100.0, ans=0.125 2023-12-04 03:13:49,137 INFO [train.py:1087] (1/4) Epoch 16, batch 850, loss[loss=0.1931, simple_loss=0.2806, pruned_loss=0.05279, over 24791.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2788, pruned_loss=0.05301, over 4731492.16 frames. ], batch size: 73, lr: 1.46e-02, grad_scale: 32.0 2023-12-04 03:14:14,474 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=95300.0, ans=0.0 2023-12-04 03:14:14,496 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:14:30,476 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.524e+02 1.746e+02 2.004e+02 3.223e+02, threshold=3.492e+02, percent-clipped=0.0 2023-12-04 03:14:49,885 INFO [train.py:1087] (1/4) Epoch 17, batch 0, loss[loss=0.1854, simple_loss=0.2797, pruned_loss=0.04555, over 24595.00 frames. ], tot_loss[loss=0.1854, simple_loss=0.2797, pruned_loss=0.04555, over 24595.00 frames. ], batch size: 68, lr: 1.42e-02, grad_scale: 32.0 2023-12-04 03:14:49,886 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 03:14:59,787 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.1770, 4.3687, 3.9256, 4.8381], device='cuda:1') 2023-12-04 03:15:02,234 INFO [train.py:1119] (1/4) Epoch 17, validation: loss=0.165, simple_loss=0.267, pruned_loss=0.03149, over 944034.00 frames. 2023-12-04 03:15:02,235 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 03:15:12,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=95533.33333333333, ans=0.0 2023-12-04 03:15:24,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=95600.0, ans=0.125 2023-12-04 03:15:24,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=95600.0, ans=15.0 2023-12-04 03:15:44,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=95733.33333333333, ans=0.0 2023-12-04 03:15:57,565 INFO [train.py:1087] (1/4) Epoch 17, batch 50, loss[loss=0.2016, simple_loss=0.2851, pruned_loss=0.05904, over 24703.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2795, pruned_loss=0.05377, over 1085406.52 frames. ], batch size: 69, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:16:11,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=95866.66666666667, ans=0.125 2023-12-04 03:16:28,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95933.33333333333, ans=0.125 2023-12-04 03:16:28,566 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=95933.33333333333, ans=0.125 2023-12-04 03:16:38,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=96000.0, ans=0.125 2023-12-04 03:16:47,282 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=96066.66666666667, ans=0.1 2023-12-04 03:16:48,006 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.540e+02 1.659e+02 1.795e+02 3.449e+02, threshold=3.318e+02, percent-clipped=0.0 2023-12-04 03:16:52,984 INFO [train.py:1087] (1/4) Epoch 17, batch 100, loss[loss=0.181, simple_loss=0.2703, pruned_loss=0.04579, over 24555.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2784, pruned_loss=0.05292, over 1904081.55 frames. ], batch size: 63, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:17:07,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=96200.0, ans=0.09899494936611666 2023-12-04 03:17:07,711 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=96200.0, ans=0.125 2023-12-04 03:17:11,001 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.50 vs. limit=22.5 2023-12-04 03:17:41,066 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.89 vs. limit=22.5 2023-12-04 03:17:47,770 INFO [train.py:1087] (1/4) Epoch 17, batch 150, loss[loss=0.2, simple_loss=0.2814, pruned_loss=0.05927, over 22853.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2775, pruned_loss=0.05241, over 2538933.40 frames. ], batch size: 106, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:17:52,472 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.38 vs. limit=15.0 2023-12-04 03:17:54,830 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-12-04 03:18:14,349 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2023-12-04 03:18:16,399 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.71 vs. limit=12.0 2023-12-04 03:18:36,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=96733.33333333333, ans=0.2 2023-12-04 03:18:37,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=96733.33333333333, ans=0.1 2023-12-04 03:18:39,123 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.196e+02 1.417e+02 1.576e+02 1.746e+02 2.655e+02, threshold=3.153e+02, percent-clipped=0.0 2023-12-04 03:18:43,416 INFO [train.py:1087] (1/4) Epoch 17, batch 200, loss[loss=0.2046, simple_loss=0.2879, pruned_loss=0.06062, over 22990.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2778, pruned_loss=0.05266, over 3025896.84 frames. ], batch size: 106, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:18:58,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=96866.66666666667, ans=0.0 2023-12-04 03:18:59,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=96866.66666666667, ans=0.125 2023-12-04 03:19:01,273 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=96866.66666666667, ans=0.0 2023-12-04 03:19:01,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=96866.66666666667, ans=22.5 2023-12-04 03:19:13,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-12-04 03:19:15,662 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.29 vs. limit=15.0 2023-12-04 03:19:37,983 INFO [train.py:1087] (1/4) Epoch 17, batch 250, loss[loss=0.2445, simple_loss=0.3142, pruned_loss=0.08738, over 16941.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2778, pruned_loss=0.05293, over 3408955.93 frames. ], batch size: 177, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:19:38,544 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=15.0 2023-12-04 03:19:40,395 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=97133.33333333333, ans=0.125 2023-12-04 03:19:44,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=97133.33333333333, ans=0.0 2023-12-04 03:20:02,223 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=12.0 2023-12-04 03:20:08,130 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=97266.66666666667, ans=0.2 2023-12-04 03:20:29,367 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.456e+02 1.611e+02 1.755e+02 2.197e+02, threshold=3.223e+02, percent-clipped=0.0 2023-12-04 03:20:33,560 INFO [train.py:1087] (1/4) Epoch 17, batch 300, loss[loss=0.19, simple_loss=0.2792, pruned_loss=0.05039, over 24748.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2786, pruned_loss=0.0532, over 3710338.27 frames. ], batch size: 63, lr: 1.41e-02, grad_scale: 32.0 2023-12-04 03:20:42,290 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=97466.66666666667, ans=0.2 2023-12-04 03:20:55,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=97600.0, ans=0.0 2023-12-04 03:20:55,673 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=22.5 2023-12-04 03:21:00,949 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-12-04 03:21:28,766 INFO [train.py:1087] (1/4) Epoch 17, batch 350, loss[loss=0.1943, simple_loss=0.2817, pruned_loss=0.05344, over 24738.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2786, pruned_loss=0.05333, over 3941934.96 frames. ], batch size: 63, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:21:31,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=97800.0, ans=0.1 2023-12-04 03:21:57,848 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.88 vs. limit=10.0 2023-12-04 03:22:19,379 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.204e+02 1.476e+02 1.651e+02 1.851e+02 2.677e+02, threshold=3.301e+02, percent-clipped=0.0 2023-12-04 03:22:23,652 INFO [train.py:1087] (1/4) Epoch 17, batch 400, loss[loss=0.1653, simple_loss=0.256, pruned_loss=0.03732, over 24746.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2782, pruned_loss=0.05295, over 4140122.27 frames. ], batch size: 63, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:22:31,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=98133.33333333333, ans=0.0 2023-12-04 03:22:39,002 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.90 vs. limit=15.0 2023-12-04 03:22:40,799 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=98200.0, ans=0.0 2023-12-04 03:22:40,863 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=98200.0, ans=0.0 2023-12-04 03:22:42,269 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.00 vs. limit=15.0 2023-12-04 03:22:43,876 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:22:46,163 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=98266.66666666667, ans=0.0 2023-12-04 03:22:57,153 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.98 vs. limit=15.0 2023-12-04 03:22:58,319 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-12-04 03:23:18,718 INFO [train.py:1087] (1/4) Epoch 17, batch 450, loss[loss=0.1849, simple_loss=0.275, pruned_loss=0.04745, over 24151.00 frames. ], tot_loss[loss=0.1907, simple_loss=0.2772, pruned_loss=0.05213, over 4302038.10 frames. ], batch size: 82, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:23:21,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98466.66666666667, ans=0.1 2023-12-04 03:23:31,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=98533.33333333333, ans=0.2 2023-12-04 03:23:34,225 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.97 vs. limit=15.0 2023-12-04 03:24:03,891 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.24 vs. limit=22.5 2023-12-04 03:24:09,763 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.473e+02 1.580e+02 1.739e+02 2.730e+02, threshold=3.159e+02, percent-clipped=0.0 2023-12-04 03:24:14,102 INFO [train.py:1087] (1/4) Epoch 17, batch 500, loss[loss=0.1821, simple_loss=0.2677, pruned_loss=0.04825, over 24570.00 frames. ], tot_loss[loss=0.1902, simple_loss=0.2769, pruned_loss=0.05172, over 4417077.06 frames. ], batch size: 65, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:24:24,766 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=98866.66666666667, ans=0.125 2023-12-04 03:24:28,956 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:24:31,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=98866.66666666667, ans=0.125 2023-12-04 03:24:52,425 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.13 vs. limit=6.0 2023-12-04 03:24:55,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99000.0, ans=0.0 2023-12-04 03:25:04,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=99066.66666666667, ans=0.125 2023-12-04 03:25:08,078 INFO [train.py:1087] (1/4) Epoch 17, batch 550, loss[loss=0.1948, simple_loss=0.2803, pruned_loss=0.05467, over 24720.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2771, pruned_loss=0.05183, over 4501085.66 frames. ], batch size: 61, lr: 1.40e-02, grad_scale: 32.0 2023-12-04 03:25:29,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=99266.66666666667, ans=10.0 2023-12-04 03:25:39,534 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=99266.66666666667, ans=6.0 2023-12-04 03:25:45,959 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:25:46,002 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=99333.33333333333, ans=0.2 2023-12-04 03:25:58,813 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.285e+02 1.529e+02 1.713e+02 1.854e+02 2.755e+02, threshold=3.426e+02, percent-clipped=0.0 2023-12-04 03:26:03,149 INFO [train.py:1087] (1/4) Epoch 17, batch 600, loss[loss=0.1962, simple_loss=0.28, pruned_loss=0.05614, over 24553.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2763, pruned_loss=0.05141, over 4579138.09 frames. ], batch size: 62, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:26:05,613 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:26:08,741 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=99466.66666666667, ans=0.125 2023-12-04 03:26:22,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=99533.33333333333, ans=0.125 2023-12-04 03:26:22,925 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=99533.33333333333, ans=0.2 2023-12-04 03:26:23,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=99533.33333333333, ans=0.0 2023-12-04 03:26:27,144 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:26:28,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=99600.0, ans=0.0 2023-12-04 03:26:59,243 INFO [train.py:1087] (1/4) Epoch 17, batch 650, loss[loss=0.1929, simple_loss=0.2836, pruned_loss=0.05112, over 24052.00 frames. ], tot_loss[loss=0.1901, simple_loss=0.2766, pruned_loss=0.05176, over 4633715.88 frames. ], batch size: 87, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:27:03,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99800.0, ans=0.1 2023-12-04 03:27:11,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=99866.66666666667, ans=0.0 2023-12-04 03:27:38,831 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.57 vs. limit=15.0 2023-12-04 03:27:47,540 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:27:50,898 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.216e+02 1.486e+02 1.651e+02 1.836e+02 2.522e+02, threshold=3.302e+02, percent-clipped=0.0 2023-12-04 03:27:54,651 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.21 vs. limit=22.5 2023-12-04 03:27:55,160 INFO [train.py:1087] (1/4) Epoch 17, batch 700, loss[loss=0.1774, simple_loss=0.2658, pruned_loss=0.04444, over 24748.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.276, pruned_loss=0.05133, over 4684684.56 frames. ], batch size: 66, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:28:00,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=100133.33333333333, ans=0.0 2023-12-04 03:28:14,975 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=100200.0, ans=0.0 2023-12-04 03:28:29,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=100333.33333333333, ans=0.0 2023-12-04 03:28:48,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=100400.0, ans=0.125 2023-12-04 03:28:50,318 INFO [train.py:1087] (1/4) Epoch 17, batch 750, loss[loss=0.1855, simple_loss=0.2724, pruned_loss=0.04929, over 24803.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2759, pruned_loss=0.05124, over 4707927.39 frames. ], batch size: 72, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:28:56,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=100466.66666666667, ans=0.0 2023-12-04 03:29:03,422 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.01 vs. limit=6.0 2023-12-04 03:29:18,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=100600.0, ans=0.125 2023-12-04 03:29:41,094 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.245e+02 1.557e+02 1.699e+02 1.928e+02 3.020e+02, threshold=3.398e+02, percent-clipped=0.0 2023-12-04 03:29:45,371 INFO [train.py:1087] (1/4) Epoch 17, batch 800, loss[loss=0.1789, simple_loss=0.2706, pruned_loss=0.04363, over 24720.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2756, pruned_loss=0.05095, over 4743334.05 frames. ], batch size: 69, lr: 1.39e-02, grad_scale: 32.0 2023-12-04 03:29:49,979 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-12-04 03:29:53,052 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-12-04 03:29:54,893 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-12-04 03:30:04,766 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.28 vs. limit=12.0 2023-12-04 03:30:26,070 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.74 vs. limit=15.0 2023-12-04 03:30:30,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=101066.66666666667, ans=0.125 2023-12-04 03:30:36,416 INFO [train.py:1087] (1/4) Epoch 17, batch 850, loss[loss=0.179, simple_loss=0.2702, pruned_loss=0.04387, over 24793.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2759, pruned_loss=0.05119, over 4758540.32 frames. ], batch size: 62, lr: 1.38e-02, grad_scale: 32.0 2023-12-04 03:30:41,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=101133.33333333333, ans=0.125 2023-12-04 03:30:43,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=101133.33333333333, ans=0.125 2023-12-04 03:30:44,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=101133.33333333333, ans=0.0 2023-12-04 03:31:13,469 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=101333.33333333333, ans=0.125 2023-12-04 03:31:15,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=101333.33333333333, ans=0.2 2023-12-04 03:31:38,642 INFO [train.py:1087] (1/4) Epoch 18, batch 0, loss[loss=0.1888, simple_loss=0.2773, pruned_loss=0.05011, over 24758.00 frames. ], tot_loss[loss=0.1888, simple_loss=0.2773, pruned_loss=0.05011, over 24758.00 frames. ], batch size: 70, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:31:38,643 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 03:31:50,769 INFO [train.py:1119] (1/4) Epoch 18, validation: loss=0.1646, simple_loss=0.2659, pruned_loss=0.03165, over 944034.00 frames. 2023-12-04 03:31:50,770 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 03:31:51,846 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.509e+02 1.678e+02 1.874e+02 3.730e+02, threshold=3.357e+02, percent-clipped=2.0 2023-12-04 03:31:54,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101433.33333333333, ans=0.1 2023-12-04 03:32:01,925 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-12-04 03:32:13,374 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=101566.66666666667, ans=0.125 2023-12-04 03:32:13,736 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.34 vs. limit=15.0 2023-12-04 03:32:19,067 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-12-04 03:32:46,120 INFO [train.py:1087] (1/4) Epoch 18, batch 50, loss[loss=0.1837, simple_loss=0.2719, pruned_loss=0.04772, over 24726.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2769, pruned_loss=0.05067, over 1086445.35 frames. ], batch size: 67, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:32:46,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=101766.66666666667, ans=0.125 2023-12-04 03:33:04,076 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.90 vs. limit=15.0 2023-12-04 03:33:09,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=101900.0, ans=0.0 2023-12-04 03:33:17,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=101900.0, ans=0.125 2023-12-04 03:33:22,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=101966.66666666667, ans=0.125 2023-12-04 03:33:37,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=102033.33333333333, ans=0.2 2023-12-04 03:33:41,201 INFO [train.py:1087] (1/4) Epoch 18, batch 100, loss[loss=0.178, simple_loss=0.2683, pruned_loss=0.04387, over 24704.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2759, pruned_loss=0.05092, over 1900008.06 frames. ], batch size: 74, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:33:42,275 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.270e+02 1.448e+02 1.568e+02 1.739e+02 2.957e+02, threshold=3.137e+02, percent-clipped=0.0 2023-12-04 03:34:09,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=102233.33333333333, ans=0.125 2023-12-04 03:34:11,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102233.33333333333, ans=0.1 2023-12-04 03:34:15,745 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=102300.0, ans=0.125 2023-12-04 03:34:35,702 INFO [train.py:1087] (1/4) Epoch 18, batch 150, loss[loss=0.1963, simple_loss=0.2783, pruned_loss=0.0572, over 24774.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.2759, pruned_loss=0.05097, over 2544724.81 frames. ], batch size: 70, lr: 1.34e-02, grad_scale: 32.0 2023-12-04 03:34:52,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=102500.0, ans=0.0 2023-12-04 03:34:57,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=102566.66666666667, ans=0.125 2023-12-04 03:35:00,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102566.66666666667, ans=0.125 2023-12-04 03:35:10,699 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=102633.33333333333, ans=0.2 2023-12-04 03:35:11,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=102633.33333333333, ans=0.0 2023-12-04 03:35:20,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102700.0, ans=0.125 2023-12-04 03:35:25,089 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-12-04 03:35:25,856 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.14 vs. limit=15.0 2023-12-04 03:35:29,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=102766.66666666667, ans=0.2 2023-12-04 03:35:30,492 INFO [train.py:1087] (1/4) Epoch 18, batch 200, loss[loss=0.1844, simple_loss=0.2681, pruned_loss=0.05035, over 24547.00 frames. ], tot_loss[loss=0.1893, simple_loss=0.2762, pruned_loss=0.05124, over 3044080.30 frames. ], batch size: 62, lr: 1.34e-02, grad_scale: 64.0 2023-12-04 03:35:31,511 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.294e+02 1.477e+02 1.619e+02 1.796e+02 3.001e+02, threshold=3.237e+02, percent-clipped=0.0 2023-12-04 03:35:48,108 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=12.0 2023-12-04 03:35:50,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=102833.33333333333, ans=0.035 2023-12-04 03:36:10,615 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.51 vs. limit=15.0 2023-12-04 03:36:13,271 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=103033.33333333333, ans=0.125 2023-12-04 03:36:20,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=103033.33333333333, ans=6.0 2023-12-04 03:36:25,786 INFO [train.py:1087] (1/4) Epoch 18, batch 250, loss[loss=0.1774, simple_loss=0.267, pruned_loss=0.04388, over 24558.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2762, pruned_loss=0.05149, over 3432776.70 frames. ], batch size: 66, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:36:48,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=103233.33333333333, ans=0.07 2023-12-04 03:36:53,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=103233.33333333333, ans=0.125 2023-12-04 03:36:54,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.47 vs. limit=15.0 2023-12-04 03:37:08,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=103366.66666666667, ans=0.025 2023-12-04 03:37:19,218 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=103366.66666666667, ans=0.0 2023-12-04 03:37:21,847 INFO [train.py:1087] (1/4) Epoch 18, batch 300, loss[loss=0.1904, simple_loss=0.2778, pruned_loss=0.05147, over 24205.00 frames. ], tot_loss[loss=0.19, simple_loss=0.2767, pruned_loss=0.05169, over 3718555.78 frames. ], batch size: 82, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:37:22,885 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.255e+02 1.409e+02 1.517e+02 1.696e+02 2.292e+02, threshold=3.035e+02, percent-clipped=0.0 2023-12-04 03:37:30,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=103433.33333333333, ans=0.0 2023-12-04 03:37:42,633 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-12-04 03:37:57,682 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:38:11,761 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.84 vs. limit=15.0 2023-12-04 03:38:14,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=103700.0, ans=0.125 2023-12-04 03:38:16,496 INFO [train.py:1087] (1/4) Epoch 18, batch 350, loss[loss=0.1826, simple_loss=0.2726, pruned_loss=0.04632, over 24794.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2764, pruned_loss=0.05148, over 3953336.52 frames. ], batch size: 71, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:38:28,013 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.43 vs. limit=15.0 2023-12-04 03:38:30,254 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.38 vs. limit=15.0 2023-12-04 03:38:44,228 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-12-04 03:39:09,469 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=104033.33333333333, ans=0.0 2023-12-04 03:39:12,304 INFO [train.py:1087] (1/4) Epoch 18, batch 400, loss[loss=0.1943, simple_loss=0.2782, pruned_loss=0.05519, over 24752.00 frames. ], tot_loss[loss=0.1891, simple_loss=0.2759, pruned_loss=0.05121, over 4150580.21 frames. ], batch size: 63, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:39:12,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=104100.0, ans=0.125 2023-12-04 03:39:13,354 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.517e+02 1.661e+02 1.855e+02 2.860e+02, threshold=3.323e+02, percent-clipped=0.0 2023-12-04 03:39:21,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=104100.0, ans=0.125 2023-12-04 03:40:07,630 INFO [train.py:1087] (1/4) Epoch 18, batch 450, loss[loss=0.1807, simple_loss=0.2678, pruned_loss=0.04685, over 24323.00 frames. ], tot_loss[loss=0.1887, simple_loss=0.2755, pruned_loss=0.05098, over 4289838.61 frames. ], batch size: 79, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:40:25,569 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=104500.0, ans=0.125 2023-12-04 03:41:02,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=104766.66666666667, ans=0.0 2023-12-04 03:41:03,345 INFO [train.py:1087] (1/4) Epoch 18, batch 500, loss[loss=0.1876, simple_loss=0.2732, pruned_loss=0.05098, over 24724.00 frames. ], tot_loss[loss=0.1881, simple_loss=0.2751, pruned_loss=0.05053, over 4407865.85 frames. ], batch size: 61, lr: 1.33e-02, grad_scale: 64.0 2023-12-04 03:41:04,774 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.191e+02 1.505e+02 1.621e+02 1.794e+02 2.576e+02, threshold=3.242e+02, percent-clipped=0.0 2023-12-04 03:41:06,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=104766.66666666667, ans=0.0 2023-12-04 03:41:40,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=104966.66666666667, ans=0.125 2023-12-04 03:41:44,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=104966.66666666667, ans=0.125 2023-12-04 03:41:59,016 INFO [train.py:1087] (1/4) Epoch 18, batch 550, loss[loss=0.1856, simple_loss=0.276, pruned_loss=0.0476, over 24755.00 frames. ], tot_loss[loss=0.1882, simple_loss=0.2751, pruned_loss=0.05061, over 4479761.57 frames. ], batch size: 66, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:41:59,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=105100.0, ans=0.125 2023-12-04 03:42:19,036 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=105166.66666666667, ans=0.2 2023-12-04 03:42:31,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=105300.0, ans=0.05 2023-12-04 03:42:44,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=105366.66666666667, ans=0.125 2023-12-04 03:42:54,097 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=105433.33333333333, ans=0.2 2023-12-04 03:42:54,875 INFO [train.py:1087] (1/4) Epoch 18, batch 600, loss[loss=0.1814, simple_loss=0.2675, pruned_loss=0.04765, over 24755.00 frames. ], tot_loss[loss=0.1883, simple_loss=0.2752, pruned_loss=0.05067, over 4552687.24 frames. ], batch size: 65, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:42:57,010 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.209e+02 1.490e+02 1.631e+02 1.843e+02 3.006e+02, threshold=3.263e+02, percent-clipped=0.0 2023-12-04 03:43:12,410 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=105500.0, ans=0.2 2023-12-04 03:43:12,718 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.38 vs. limit=22.5 2023-12-04 03:43:18,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=105566.66666666667, ans=0.0 2023-12-04 03:43:24,439 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=105566.66666666667, ans=0.125 2023-12-04 03:43:46,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-12-04 03:43:50,981 INFO [train.py:1087] (1/4) Epoch 18, batch 650, loss[loss=0.178, simple_loss=0.2687, pruned_loss=0.04367, over 24797.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2754, pruned_loss=0.05096, over 4593820.84 frames. ], batch size: 72, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:43:56,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=105766.66666666667, ans=0.125 2023-12-04 03:44:25,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105966.66666666667, ans=0.0 2023-12-04 03:44:34,586 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-12-04 03:44:46,360 INFO [train.py:1087] (1/4) Epoch 18, batch 700, loss[loss=0.1826, simple_loss=0.2675, pruned_loss=0.04887, over 24611.00 frames. ], tot_loss[loss=0.1878, simple_loss=0.2747, pruned_loss=0.05045, over 4651767.17 frames. ], batch size: 68, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:44:48,450 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.438e+02 1.544e+02 1.710e+02 2.632e+02, threshold=3.088e+02, percent-clipped=0.0 2023-12-04 03:44:55,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=106100.0, ans=0.0 2023-12-04 03:44:59,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=106166.66666666667, ans=0.2 2023-12-04 03:45:03,744 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=106166.66666666667, ans=0.04949747468305833 2023-12-04 03:45:20,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=106300.0, ans=0.0 2023-12-04 03:45:21,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=106300.0, ans=0.2 2023-12-04 03:45:25,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=106300.0, ans=0.2 2023-12-04 03:45:26,287 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106300.0, ans=0.0 2023-12-04 03:45:42,182 INFO [train.py:1087] (1/4) Epoch 18, batch 750, loss[loss=0.1794, simple_loss=0.2695, pruned_loss=0.04468, over 24765.00 frames. ], tot_loss[loss=0.188, simple_loss=0.2749, pruned_loss=0.05058, over 4683992.24 frames. ], batch size: 65, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:45:44,041 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-12-04 03:45:47,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106433.33333333333, ans=0.1 2023-12-04 03:45:52,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106500.0, ans=0.1 2023-12-04 03:46:00,815 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=106500.0, ans=0.125 2023-12-04 03:46:13,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=106566.66666666667, ans=0.0 2023-12-04 03:46:30,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=106700.0, ans=0.125 2023-12-04 03:46:39,669 INFO [train.py:1087] (1/4) Epoch 18, batch 800, loss[loss=0.1998, simple_loss=0.2866, pruned_loss=0.05654, over 23398.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2746, pruned_loss=0.05033, over 4702016.36 frames. ], batch size: 94, lr: 1.32e-02, grad_scale: 32.0 2023-12-04 03:46:41,755 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.203e+02 1.420e+02 1.554e+02 1.731e+02 2.690e+02, threshold=3.108e+02, percent-clipped=0.0 2023-12-04 03:46:44,792 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.83 vs. limit=10.0 2023-12-04 03:46:51,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=106833.33333333333, ans=0.05 2023-12-04 03:46:57,301 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106833.33333333333, ans=0.1 2023-12-04 03:47:07,656 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-12-04 03:47:13,282 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=106966.66666666667, ans=0.125 2023-12-04 03:47:24,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=107033.33333333333, ans=0.0 2023-12-04 03:47:30,740 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-12-04 03:47:31,260 INFO [train.py:1087] (1/4) Epoch 18, batch 850, loss[loss=0.176, simple_loss=0.2641, pruned_loss=0.04393, over 24558.00 frames. ], tot_loss[loss=0.1872, simple_loss=0.2742, pruned_loss=0.05004, over 4728926.07 frames. ], batch size: 66, lr: 1.31e-02, grad_scale: 32.0 2023-12-04 03:47:31,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107100.0, ans=0.1 2023-12-04 03:47:32,750 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.69 vs. limit=22.5 2023-12-04 03:47:50,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=107233.33333333333, ans=0.2 2023-12-04 03:47:59,826 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=107233.33333333333, ans=0.0 2023-12-04 03:48:13,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=107366.66666666667, ans=0.125 2023-12-04 03:48:33,728 INFO [train.py:1087] (1/4) Epoch 19, batch 0, loss[loss=0.1834, simple_loss=0.2712, pruned_loss=0.04777, over 23554.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2712, pruned_loss=0.04777, over 23554.00 frames. ], batch size: 95, lr: 1.28e-02, grad_scale: 32.0 2023-12-04 03:48:33,729 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 03:48:45,997 INFO [train.py:1119] (1/4) Epoch 19, validation: loss=0.1614, simple_loss=0.2634, pruned_loss=0.02973, over 944034.00 frames. 2023-12-04 03:48:45,998 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 03:48:47,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=107400.0, ans=0.125 2023-12-04 03:48:47,616 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.01 vs. limit=10.0 2023-12-04 03:48:53,374 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.474e+02 1.635e+02 1.770e+02 2.891e+02, threshold=3.271e+02, percent-clipped=0.0 2023-12-04 03:48:56,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=107466.66666666667, ans=0.125 2023-12-04 03:49:04,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=107466.66666666667, ans=0.125 2023-12-04 03:49:06,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=107533.33333333333, ans=0.0 2023-12-04 03:49:22,159 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 03:49:25,129 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=107600.0, ans=0.2 2023-12-04 03:49:34,479 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.82 vs. limit=15.0 2023-12-04 03:49:40,990 INFO [train.py:1087] (1/4) Epoch 19, batch 50, loss[loss=0.1938, simple_loss=0.2828, pruned_loss=0.05241, over 24258.00 frames. ], tot_loss[loss=0.1896, simple_loss=0.2766, pruned_loss=0.05127, over 1078372.47 frames. ], batch size: 79, lr: 1.28e-02, grad_scale: 32.0 2023-12-04 03:49:42,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=107733.33333333333, ans=0.125 2023-12-04 03:49:48,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107733.33333333333, ans=0.125 2023-12-04 03:49:50,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=107800.0, ans=0.125 2023-12-04 03:49:50,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=107800.0, ans=0.2 2023-12-04 03:49:59,375 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=107800.0, ans=0.0 2023-12-04 03:50:00,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107800.0, ans=0.1 2023-12-04 03:50:00,492 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=107800.0, ans=0.0 2023-12-04 03:50:14,257 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=107933.33333333333, ans=0.0 2023-12-04 03:50:17,288 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=107933.33333333333, ans=0.1 2023-12-04 03:50:34,518 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-12-04 03:50:35,778 INFO [train.py:1087] (1/4) Epoch 19, batch 100, loss[loss=0.1757, simple_loss=0.2706, pruned_loss=0.04046, over 24251.00 frames. ], tot_loss[loss=0.1873, simple_loss=0.2748, pruned_loss=0.04987, over 1906712.50 frames. ], batch size: 79, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:50:44,314 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.474e+02 1.630e+02 1.863e+02 2.336e+02, threshold=3.260e+02, percent-clipped=0.0 2023-12-04 03:50:44,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=108066.66666666667, ans=0.0 2023-12-04 03:50:49,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=108133.33333333333, ans=0.2 2023-12-04 03:50:51,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=108133.33333333333, ans=0.04949747468305833 2023-12-04 03:50:55,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=108133.33333333333, ans=0.0 2023-12-04 03:51:01,836 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108200.0, ans=0.1 2023-12-04 03:51:02,823 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=108200.0, ans=0.2 2023-12-04 03:51:15,191 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=108266.66666666667, ans=0.0 2023-12-04 03:51:16,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=108266.66666666667, ans=0.2 2023-12-04 03:51:25,197 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-12-04 03:51:25,308 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.18 vs. limit=15.0 2023-12-04 03:51:30,832 INFO [train.py:1087] (1/4) Epoch 19, batch 150, loss[loss=0.1833, simple_loss=0.2716, pruned_loss=0.04747, over 24755.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2738, pruned_loss=0.04937, over 2568955.13 frames. ], batch size: 66, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:51:45,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=108466.66666666667, ans=0.0 2023-12-04 03:51:46,984 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=108466.66666666667, ans=0.2 2023-12-04 03:52:06,410 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108600.0, ans=0.1 2023-12-04 03:52:10,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=108600.0, ans=0.2 2023-12-04 03:52:11,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=108600.0, ans=0.0 2023-12-04 03:52:16,654 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=108666.66666666667, ans=0.0 2023-12-04 03:52:23,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108666.66666666667, ans=0.1 2023-12-04 03:52:25,847 INFO [train.py:1087] (1/4) Epoch 19, batch 200, loss[loss=0.1828, simple_loss=0.2664, pruned_loss=0.04959, over 24463.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2733, pruned_loss=0.04912, over 3066890.19 frames. ], batch size: 75, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:52:27,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108733.33333333333, ans=0.125 2023-12-04 03:52:32,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=108733.33333333333, ans=0.125 2023-12-04 03:52:33,215 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.228e+02 1.475e+02 1.584e+02 1.733e+02 2.643e+02, threshold=3.168e+02, percent-clipped=0.0 2023-12-04 03:52:39,139 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=108800.0, ans=0.125 2023-12-04 03:52:40,168 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=108800.0, ans=0.0 2023-12-04 03:52:59,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108933.33333333333, ans=0.1 2023-12-04 03:53:04,405 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=108933.33333333333, ans=0.025 2023-12-04 03:53:11,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=109000.0, ans=10.0 2023-12-04 03:53:21,530 INFO [train.py:1087] (1/4) Epoch 19, batch 250, loss[loss=0.1873, simple_loss=0.2741, pruned_loss=0.05018, over 24770.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2728, pruned_loss=0.04892, over 3460563.03 frames. ], batch size: 70, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:53:25,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=109066.66666666667, ans=0.2 2023-12-04 03:53:30,821 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-12-04 03:53:34,573 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=109133.33333333333, ans=0.2 2023-12-04 03:53:47,623 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-12-04 03:54:02,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=109266.66666666667, ans=0.0 2023-12-04 03:54:10,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=109333.33333333333, ans=0.0 2023-12-04 03:54:13,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=109333.33333333333, ans=0.0 2023-12-04 03:54:15,325 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=109333.33333333333, ans=0.0 2023-12-04 03:54:17,667 INFO [train.py:1087] (1/4) Epoch 19, batch 300, loss[loss=0.18, simple_loss=0.2689, pruned_loss=0.04558, over 24579.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2738, pruned_loss=0.04926, over 3744127.97 frames. ], batch size: 64, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:54:25,404 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.230e+02 1.469e+02 1.582e+02 1.881e+02 2.761e+02, threshold=3.164e+02, percent-clipped=0.0 2023-12-04 03:54:38,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=109533.33333333333, ans=0.09899494936611666 2023-12-04 03:54:58,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=109600.0, ans=0.0 2023-12-04 03:55:10,238 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=109666.66666666667, ans=0.125 2023-12-04 03:55:12,209 INFO [train.py:1087] (1/4) Epoch 19, batch 350, loss[loss=0.2019, simple_loss=0.2865, pruned_loss=0.05864, over 23931.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2738, pruned_loss=0.04935, over 3982085.88 frames. ], batch size: 87, lr: 1.27e-02, grad_scale: 32.0 2023-12-04 03:55:26,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=109800.0, ans=0.2 2023-12-04 03:55:47,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=109933.33333333333, ans=0.0 2023-12-04 03:55:57,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=110000.0, ans=0.025 2023-12-04 03:56:07,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110066.66666666667, ans=0.1 2023-12-04 03:56:07,447 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.23 vs. limit=15.0 2023-12-04 03:56:07,948 INFO [train.py:1087] (1/4) Epoch 19, batch 400, loss[loss=0.1738, simple_loss=0.2586, pruned_loss=0.04444, over 24794.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2738, pruned_loss=0.04969, over 4146930.31 frames. ], batch size: 71, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:56:13,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=110066.66666666667, ans=0.125 2023-12-04 03:56:15,316 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.278e+02 1.532e+02 1.716e+02 1.999e+02 2.876e+02, threshold=3.431e+02, percent-clipped=0.0 2023-12-04 03:56:23,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=110133.33333333333, ans=0.125 2023-12-04 03:56:49,093 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.78 vs. limit=15.0 2023-12-04 03:56:50,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110333.33333333333, ans=0.125 2023-12-04 03:57:03,727 INFO [train.py:1087] (1/4) Epoch 19, batch 450, loss[loss=0.187, simple_loss=0.2689, pruned_loss=0.05252, over 24189.00 frames. ], tot_loss[loss=0.1866, simple_loss=0.2739, pruned_loss=0.04968, over 4291101.96 frames. ], batch size: 82, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:57:09,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=110400.0, ans=0.04949747468305833 2023-12-04 03:57:13,141 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.03 vs. limit=10.0 2023-12-04 03:57:23,817 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.35 vs. limit=15.0 2023-12-04 03:57:31,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=110533.33333333333, ans=0.0 2023-12-04 03:57:32,101 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.50 vs. limit=15.0 2023-12-04 03:57:42,617 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=110600.0, ans=0.125 2023-12-04 03:57:42,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=110600.0, ans=0.125 2023-12-04 03:57:49,421 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=110666.66666666667, ans=0.125 2023-12-04 03:57:55,345 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.03 vs. limit=22.5 2023-12-04 03:57:59,709 INFO [train.py:1087] (1/4) Epoch 19, batch 500, loss[loss=0.1698, simple_loss=0.2583, pruned_loss=0.0406, over 24763.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2738, pruned_loss=0.04951, over 4409035.92 frames. ], batch size: 70, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:58:07,435 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.246e+02 1.421e+02 1.568e+02 1.756e+02 2.852e+02, threshold=3.136e+02, percent-clipped=0.0 2023-12-04 03:58:11,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110800.0, ans=0.1 2023-12-04 03:58:16,532 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.47 vs. limit=15.0 2023-12-04 03:58:17,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110800.0, ans=0.125 2023-12-04 03:58:32,358 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.98 vs. limit=22.5 2023-12-04 03:58:43,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=111000.0, ans=0.2 2023-12-04 03:58:46,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=111000.0, ans=0.0 2023-12-04 03:58:50,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=111000.0, ans=0.2 2023-12-04 03:58:54,928 INFO [train.py:1087] (1/4) Epoch 19, batch 550, loss[loss=0.1851, simple_loss=0.2736, pruned_loss=0.04832, over 24746.00 frames. ], tot_loss[loss=0.1862, simple_loss=0.2736, pruned_loss=0.04941, over 4500370.71 frames. ], batch size: 66, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:58:56,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=111066.66666666667, ans=0.1 2023-12-04 03:59:01,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=111066.66666666667, ans=0.2 2023-12-04 03:59:05,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111133.33333333333, ans=0.1 2023-12-04 03:59:09,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=111133.33333333333, ans=0.125 2023-12-04 03:59:12,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=111133.33333333333, ans=0.0 2023-12-04 03:59:18,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111200.0, ans=0.1 2023-12-04 03:59:41,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=111333.33333333333, ans=0.2 2023-12-04 03:59:47,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111333.33333333333, ans=0.125 2023-12-04 03:59:47,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=111333.33333333333, ans=0.04949747468305833 2023-12-04 03:59:50,941 INFO [train.py:1087] (1/4) Epoch 19, batch 600, loss[loss=0.1721, simple_loss=0.262, pruned_loss=0.04109, over 24589.00 frames. ], tot_loss[loss=0.1864, simple_loss=0.2738, pruned_loss=0.04949, over 4582112.03 frames. ], batch size: 68, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 03:59:52,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=111400.0, ans=0.95 2023-12-04 03:59:58,417 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.173e+02 1.428e+02 1.598e+02 1.767e+02 2.995e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:00:02,556 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.69 vs. limit=22.5 2023-12-04 04:00:14,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=111533.33333333333, ans=0.05 2023-12-04 04:00:28,904 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.69 vs. limit=15.0 2023-12-04 04:00:38,314 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=111666.66666666667, ans=0.1 2023-12-04 04:00:38,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=111666.66666666667, ans=0.1 2023-12-04 04:00:46,498 INFO [train.py:1087] (1/4) Epoch 19, batch 650, loss[loss=0.1924, simple_loss=0.2881, pruned_loss=0.04837, over 24764.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2728, pruned_loss=0.04888, over 4648949.07 frames. ], batch size: 64, lr: 1.26e-02, grad_scale: 32.0 2023-12-04 04:00:54,391 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=111733.33333333333, ans=0.125 2023-12-04 04:00:54,738 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=22.5 2023-12-04 04:01:02,830 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=111800.0, ans=0.0 2023-12-04 04:01:26,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=111933.33333333333, ans=0.125 2023-12-04 04:01:27,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=111933.33333333333, ans=0.125 2023-12-04 04:01:42,496 INFO [train.py:1087] (1/4) Epoch 19, batch 700, loss[loss=0.1767, simple_loss=0.2695, pruned_loss=0.04198, over 24810.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2723, pruned_loss=0.04863, over 4690686.61 frames. ], batch size: 62, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:01:49,797 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.258e+02 1.467e+02 1.598e+02 1.818e+02 2.734e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:01:52,447 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=12.0 2023-12-04 04:01:53,697 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-12-04 04:01:58,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=112133.33333333333, ans=0.125 2023-12-04 04:02:00,631 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.90 vs. limit=5.0 2023-12-04 04:02:07,817 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=112200.0, ans=0.2 2023-12-04 04:02:09,299 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=112200.0, ans=22.5 2023-12-04 04:02:14,072 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.98 vs. limit=10.0 2023-12-04 04:02:16,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=112266.66666666667, ans=0.125 2023-12-04 04:02:26,136 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=112333.33333333333, ans=0.2 2023-12-04 04:02:30,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=112333.33333333333, ans=0.04949747468305833 2023-12-04 04:02:37,380 INFO [train.py:1087] (1/4) Epoch 19, batch 750, loss[loss=0.169, simple_loss=0.2587, pruned_loss=0.03964, over 24567.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2727, pruned_loss=0.049, over 4702782.77 frames. ], batch size: 65, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:02:39,474 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.81 vs. limit=22.5 2023-12-04 04:02:45,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112400.0, ans=0.1 2023-12-04 04:02:50,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=112466.66666666667, ans=0.2 2023-12-04 04:02:56,941 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=112466.66666666667, ans=0.2 2023-12-04 04:03:07,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=112533.33333333333, ans=0.125 2023-12-04 04:03:19,460 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=112600.0, ans=0.0 2023-12-04 04:03:25,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112666.66666666667, ans=0.1 2023-12-04 04:03:32,478 INFO [train.py:1087] (1/4) Epoch 19, batch 800, loss[loss=0.1884, simple_loss=0.2748, pruned_loss=0.05096, over 24727.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2727, pruned_loss=0.04898, over 4718218.25 frames. ], batch size: 67, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:03:40,501 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.431e+02 1.569e+02 1.763e+02 3.495e+02, threshold=3.139e+02, percent-clipped=1.0 2023-12-04 04:03:45,089 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=112800.0, ans=0.2 2023-12-04 04:03:50,313 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.94 vs. limit=15.0 2023-12-04 04:03:52,543 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.97 vs. limit=15.0 2023-12-04 04:04:14,341 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.78 vs. limit=10.0 2023-12-04 04:04:17,050 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=113000.0, ans=0.05 2023-12-04 04:04:19,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=113000.0, ans=0.2 2023-12-04 04:04:23,889 INFO [train.py:1087] (1/4) Epoch 19, batch 850, loss[loss=0.1754, simple_loss=0.2663, pruned_loss=0.04228, over 24547.00 frames. ], tot_loss[loss=0.1853, simple_loss=0.2727, pruned_loss=0.049, over 4726525.58 frames. ], batch size: 62, lr: 1.25e-02, grad_scale: 32.0 2023-12-04 04:04:44,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=113200.0, ans=0.1 2023-12-04 04:04:49,774 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.11 vs. limit=15.0 2023-12-04 04:05:04,843 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=113333.33333333333, ans=0.125 2023-12-04 04:05:26,255 INFO [train.py:1087] (1/4) Epoch 20, batch 0, loss[loss=0.182, simple_loss=0.2729, pruned_loss=0.04555, over 21430.00 frames. ], tot_loss[loss=0.182, simple_loss=0.2729, pruned_loss=0.04555, over 21430.00 frames. ], batch size: 128, lr: 1.22e-02, grad_scale: 32.0 2023-12-04 04:05:26,256 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 04:05:38,413 INFO [train.py:1119] (1/4) Epoch 20, validation: loss=0.1617, simple_loss=0.2631, pruned_loss=0.03021, over 944034.00 frames. 2023-12-04 04:05:38,414 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 04:05:50,350 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=113433.33333333333, ans=0.125 2023-12-04 04:05:51,050 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.230e+02 1.506e+02 1.674e+02 1.994e+02 3.219e+02, threshold=3.349e+02, percent-clipped=1.0 2023-12-04 04:05:52,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=113433.33333333333, ans=0.1 2023-12-04 04:06:11,500 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.93 vs. limit=15.0 2023-12-04 04:06:23,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113633.33333333333, ans=0.125 2023-12-04 04:06:28,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=113633.33333333333, ans=0.0 2023-12-04 04:06:28,531 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.63 vs. limit=22.5 2023-12-04 04:06:30,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=113633.33333333333, ans=0.125 2023-12-04 04:06:33,121 INFO [train.py:1087] (1/4) Epoch 20, batch 50, loss[loss=0.1823, simple_loss=0.2699, pruned_loss=0.04734, over 24741.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.2729, pruned_loss=0.04775, over 1088637.98 frames. ], batch size: 63, lr: 1.22e-02, grad_scale: 32.0 2023-12-04 04:07:19,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=113966.66666666667, ans=0.0 2023-12-04 04:07:27,377 INFO [train.py:1087] (1/4) Epoch 20, batch 100, loss[loss=0.2254, simple_loss=0.3035, pruned_loss=0.07365, over 16517.00 frames. ], tot_loss[loss=0.1857, simple_loss=0.2739, pruned_loss=0.04877, over 1902234.57 frames. ], batch size: 177, lr: 1.21e-02, grad_scale: 32.0 2023-12-04 04:07:30,926 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-12-04 04:07:41,261 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.273e+02 1.491e+02 1.624e+02 1.858e+02 3.040e+02, threshold=3.248e+02, percent-clipped=0.0 2023-12-04 04:07:50,086 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=114166.66666666667, ans=0.125 2023-12-04 04:07:57,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114166.66666666667, ans=0.125 2023-12-04 04:08:04,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=114233.33333333333, ans=0.125 2023-12-04 04:08:13,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=114300.0, ans=0.125 2023-12-04 04:08:13,286 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114300.0, ans=0.1 2023-12-04 04:08:15,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=114300.0, ans=0.125 2023-12-04 04:08:18,573 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=114300.0, ans=0.125 2023-12-04 04:08:22,653 INFO [train.py:1087] (1/4) Epoch 20, batch 150, loss[loss=0.1918, simple_loss=0.2743, pruned_loss=0.0547, over 24730.00 frames. ], tot_loss[loss=0.1849, simple_loss=0.2733, pruned_loss=0.0483, over 2555277.02 frames. ], batch size: 67, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:08:30,269 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=114366.66666666667, ans=0.125 2023-12-04 04:08:31,309 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=114366.66666666667, ans=0.125 2023-12-04 04:08:51,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=114500.0, ans=0.09899494936611666 2023-12-04 04:09:00,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=114566.66666666667, ans=0.5 2023-12-04 04:09:14,197 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=114633.33333333333, ans=0.125 2023-12-04 04:09:14,710 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-12-04 04:09:18,242 INFO [train.py:1087] (1/4) Epoch 20, batch 200, loss[loss=0.1647, simple_loss=0.253, pruned_loss=0.03825, over 24721.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.2716, pruned_loss=0.04735, over 3073198.18 frames. ], batch size: 67, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:09:32,386 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.461e+02 1.607e+02 1.753e+02 2.582e+02, threshold=3.215e+02, percent-clipped=0.0 2023-12-04 04:09:34,023 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.69 vs. limit=10.0 2023-12-04 04:09:45,279 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-12-04 04:10:00,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114966.66666666667, ans=0.1 2023-12-04 04:10:12,625 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:10:13,391 INFO [train.py:1087] (1/4) Epoch 20, batch 250, loss[loss=0.1816, simple_loss=0.2688, pruned_loss=0.04718, over 24494.00 frames. ], tot_loss[loss=0.184, simple_loss=0.272, pruned_loss=0.04802, over 3445991.48 frames. ], batch size: 77, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:10:49,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=115233.33333333333, ans=0.125 2023-12-04 04:10:52,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=115233.33333333333, ans=0.2 2023-12-04 04:10:54,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=115233.33333333333, ans=0.125 2023-12-04 04:10:58,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=115300.0, ans=10.0 2023-12-04 04:11:08,292 INFO [train.py:1087] (1/4) Epoch 20, batch 300, loss[loss=0.1862, simple_loss=0.2743, pruned_loss=0.04905, over 24553.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2717, pruned_loss=0.04799, over 3753211.98 frames. ], batch size: 62, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:11:17,730 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=115366.66666666667, ans=0.0 2023-12-04 04:11:22,635 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.290e+02 1.437e+02 1.530e+02 1.741e+02 2.975e+02, threshold=3.060e+02, percent-clipped=0.0 2023-12-04 04:11:22,862 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:11:52,398 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=15.0 2023-12-04 04:12:03,183 INFO [train.py:1087] (1/4) Epoch 20, batch 350, loss[loss=0.1742, simple_loss=0.2649, pruned_loss=0.04169, over 24757.00 frames. ], tot_loss[loss=0.1848, simple_loss=0.2722, pruned_loss=0.04875, over 3953226.77 frames. ], batch size: 64, lr: 1.21e-02, grad_scale: 16.0 2023-12-04 04:12:39,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=115900.0, ans=0.2 2023-12-04 04:12:58,840 INFO [train.py:1087] (1/4) Epoch 20, batch 400, loss[loss=0.1981, simple_loss=0.2823, pruned_loss=0.05693, over 22839.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2718, pruned_loss=0.04839, over 4153125.69 frames. ], batch size: 106, lr: 1.20e-02, grad_scale: 32.0 2023-12-04 04:13:13,100 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.517e+02 1.725e+02 1.895e+02 2.824e+02, threshold=3.451e+02, percent-clipped=0.0 2023-12-04 04:13:27,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116166.66666666667, ans=0.1 2023-12-04 04:13:28,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=116166.66666666667, ans=0.0 2023-12-04 04:13:49,596 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-12-04 04:13:53,311 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=116366.66666666667, ans=0.0 2023-12-04 04:13:54,173 INFO [train.py:1087] (1/4) Epoch 20, batch 450, loss[loss=0.221, simple_loss=0.2928, pruned_loss=0.07463, over 16778.00 frames. ], tot_loss[loss=0.1842, simple_loss=0.272, pruned_loss=0.04816, over 4284210.98 frames. ], batch size: 177, lr: 1.20e-02, grad_scale: 16.0 2023-12-04 04:14:11,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=116433.33333333333, ans=0.125 2023-12-04 04:14:13,836 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.31 vs. limit=22.5 2023-12-04 04:14:14,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116500.0, ans=0.1 2023-12-04 04:14:26,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=116566.66666666667, ans=0.0 2023-12-04 04:14:27,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116566.66666666667, ans=0.125 2023-12-04 04:14:31,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=116566.66666666667, ans=0.125 2023-12-04 04:14:48,756 INFO [train.py:1087] (1/4) Epoch 20, batch 500, loss[loss=0.1883, simple_loss=0.2734, pruned_loss=0.05163, over 24702.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.2721, pruned_loss=0.04825, over 4402276.82 frames. ], batch size: 69, lr: 1.20e-02, grad_scale: 16.0 2023-12-04 04:14:57,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=116700.0, ans=0.2 2023-12-04 04:15:03,789 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.217e+02 1.461e+02 1.591e+02 1.740e+02 2.413e+02, threshold=3.181e+02, percent-clipped=0.0 2023-12-04 04:15:30,959 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=12.0 2023-12-04 04:15:41,336 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.21 vs. limit=15.0 2023-12-04 04:15:43,008 INFO [train.py:1087] (1/4) Epoch 20, batch 550, loss[loss=0.1665, simple_loss=0.257, pruned_loss=0.03797, over 24565.00 frames. ], tot_loss[loss=0.1843, simple_loss=0.272, pruned_loss=0.04835, over 4479514.24 frames. ], batch size: 62, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:16:39,076 INFO [train.py:1087] (1/4) Epoch 20, batch 600, loss[loss=0.1863, simple_loss=0.2714, pruned_loss=0.05059, over 24493.00 frames. ], tot_loss[loss=0.1845, simple_loss=0.2721, pruned_loss=0.0485, over 4539622.42 frames. ], batch size: 77, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:16:43,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=117366.66666666667, ans=0.125 2023-12-04 04:16:43,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=117366.66666666667, ans=0.0 2023-12-04 04:16:45,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=117366.66666666667, ans=0.2 2023-12-04 04:16:55,665 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.445e+02 1.548e+02 1.707e+02 4.018e+02, threshold=3.095e+02, percent-clipped=1.0 2023-12-04 04:17:09,299 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=117500.0, ans=10.0 2023-12-04 04:17:35,043 INFO [train.py:1087] (1/4) Epoch 20, batch 650, loss[loss=0.1782, simple_loss=0.2699, pruned_loss=0.04321, over 24722.00 frames. ], tot_loss[loss=0.1839, simple_loss=0.2715, pruned_loss=0.04815, over 4596499.21 frames. ], batch size: 69, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:17:37,551 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117700.0, ans=0.1 2023-12-04 04:17:47,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=117766.66666666667, ans=0.125 2023-12-04 04:17:50,211 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=117766.66666666667, ans=0.0 2023-12-04 04:18:30,186 INFO [train.py:1087] (1/4) Epoch 20, batch 700, loss[loss=0.177, simple_loss=0.2678, pruned_loss=0.04312, over 24273.00 frames. ], tot_loss[loss=0.183, simple_loss=0.2707, pruned_loss=0.04764, over 4666514.04 frames. ], batch size: 79, lr: 1.20e-02, grad_scale: 8.0 2023-12-04 04:18:31,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=118033.33333333333, ans=0.125 2023-12-04 04:18:46,759 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.264e+02 1.613e+02 1.825e+02 2.244e+02 3.207e+02, threshold=3.650e+02, percent-clipped=2.0 2023-12-04 04:18:58,053 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=118166.66666666667, ans=0.0 2023-12-04 04:19:10,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118233.33333333333, ans=0.1 2023-12-04 04:19:10,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=118233.33333333333, ans=0.0 2023-12-04 04:19:11,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=118233.33333333333, ans=0.125 2023-12-04 04:19:16,439 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.44 vs. limit=22.5 2023-12-04 04:19:25,348 INFO [train.py:1087] (1/4) Epoch 20, batch 750, loss[loss=0.1723, simple_loss=0.2614, pruned_loss=0.04163, over 24565.00 frames. ], tot_loss[loss=0.1834, simple_loss=0.2711, pruned_loss=0.04781, over 4675187.37 frames. ], batch size: 64, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:19:35,750 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.46 vs. limit=15.0 2023-12-04 04:19:37,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=118433.33333333333, ans=0.0 2023-12-04 04:19:42,617 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=118433.33333333333, ans=0.125 2023-12-04 04:19:58,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=118566.66666666667, ans=0.0 2023-12-04 04:19:59,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=118566.66666666667, ans=0.125 2023-12-04 04:20:02,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=118566.66666666667, ans=0.0 2023-12-04 04:20:12,635 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=118633.33333333333, ans=0.5 2023-12-04 04:20:15,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.50 vs. limit=15.0 2023-12-04 04:20:17,733 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.82 vs. limit=15.0 2023-12-04 04:20:21,374 INFO [train.py:1087] (1/4) Epoch 20, batch 800, loss[loss=0.1911, simple_loss=0.2793, pruned_loss=0.05145, over 24713.00 frames. ], tot_loss[loss=0.1836, simple_loss=0.2712, pruned_loss=0.048, over 4675893.84 frames. ], batch size: 69, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:20:23,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=118700.0, ans=0.0 2023-12-04 04:20:28,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=118700.0, ans=0.0 2023-12-04 04:20:38,596 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.491e+02 1.643e+02 1.888e+02 3.122e+02, threshold=3.286e+02, percent-clipped=0.0 2023-12-04 04:20:42,141 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.95 vs. limit=15.0 2023-12-04 04:20:46,942 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=118833.33333333333, ans=0.1 2023-12-04 04:20:47,858 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=118833.33333333333, ans=0.0 2023-12-04 04:20:51,981 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=118900.0, ans=0.125 2023-12-04 04:21:12,872 INFO [train.py:1087] (1/4) Epoch 20, batch 850, loss[loss=0.1874, simple_loss=0.2744, pruned_loss=0.05024, over 24286.00 frames. ], tot_loss[loss=0.1837, simple_loss=0.2711, pruned_loss=0.04813, over 4705984.84 frames. ], batch size: 82, lr: 1.19e-02, grad_scale: 8.0 2023-12-04 04:21:13,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=119033.33333333333, ans=0.0 2023-12-04 04:21:23,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=119100.0, ans=0.0 2023-12-04 04:21:26,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=119100.0, ans=0.0 2023-12-04 04:21:32,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=119166.66666666667, ans=0.0 2023-12-04 04:21:38,915 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.42 vs. limit=10.0 2023-12-04 04:21:46,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=119233.33333333333, ans=0.125 2023-12-04 04:21:52,439 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=119300.0, ans=0.2 2023-12-04 04:21:53,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=119300.0, ans=0.0 2023-12-04 04:21:54,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=119300.0, ans=0.125 2023-12-04 04:22:12,132 INFO [train.py:1087] (1/4) Epoch 21, batch 0, loss[loss=0.1759, simple_loss=0.2656, pruned_loss=0.04311, over 24771.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2656, pruned_loss=0.04311, over 24771.00 frames. ], batch size: 64, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:22:12,133 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 04:22:24,244 INFO [train.py:1119] (1/4) Epoch 21, validation: loss=0.1615, simple_loss=0.2627, pruned_loss=0.03013, over 944034.00 frames. 2023-12-04 04:22:24,245 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 04:22:36,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119400.0, ans=0.1 2023-12-04 04:22:47,218 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.228e+02 1.497e+02 1.650e+02 1.872e+02 2.842e+02, threshold=3.300e+02, percent-clipped=0.0 2023-12-04 04:22:58,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=119533.33333333333, ans=0.2 2023-12-04 04:23:02,460 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=119533.33333333333, ans=0.0 2023-12-04 04:23:13,979 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=119600.0, ans=0.0 2023-12-04 04:23:19,153 INFO [train.py:1087] (1/4) Epoch 21, batch 50, loss[loss=0.1994, simple_loss=0.2835, pruned_loss=0.05766, over 24723.00 frames. ], tot_loss[loss=0.1851, simple_loss=0.2726, pruned_loss=0.04883, over 1068307.72 frames. ], batch size: 69, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:23:24,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=119666.66666666667, ans=0.0 2023-12-04 04:23:29,574 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=15.0 2023-12-04 04:23:35,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=119733.33333333333, ans=0.1 2023-12-04 04:23:47,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119800.0, ans=0.125 2023-12-04 04:23:52,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=119866.66666666667, ans=0.05 2023-12-04 04:23:53,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=119866.66666666667, ans=0.125 2023-12-04 04:24:04,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=119933.33333333333, ans=0.2 2023-12-04 04:24:13,795 INFO [train.py:1087] (1/4) Epoch 21, batch 100, loss[loss=0.1697, simple_loss=0.2592, pruned_loss=0.04013, over 24758.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2706, pruned_loss=0.0472, over 1909509.23 frames. ], batch size: 66, lr: 1.16e-02, grad_scale: 16.0 2023-12-04 04:24:18,462 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-12-04 04:24:37,169 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.224e+02 1.430e+02 1.557e+02 1.757e+02 2.969e+02, threshold=3.114e+02, percent-clipped=0.0 2023-12-04 04:24:49,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=120200.0, ans=0.125 2023-12-04 04:24:52,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120200.0, ans=0.1 2023-12-04 04:25:07,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=120266.66666666667, ans=0.07 2023-12-04 04:25:09,002 INFO [train.py:1087] (1/4) Epoch 21, batch 150, loss[loss=0.181, simple_loss=0.2671, pruned_loss=0.04748, over 23482.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2695, pruned_loss=0.04657, over 2559115.48 frames. ], batch size: 94, lr: 1.16e-02, grad_scale: 8.0 2023-12-04 04:25:10,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=120333.33333333333, ans=0.125 2023-12-04 04:25:29,054 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=120400.0, ans=0.0 2023-12-04 04:25:47,090 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=120533.33333333333, ans=0.2 2023-12-04 04:25:49,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=120533.33333333333, ans=0.125 2023-12-04 04:25:52,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=120600.0, ans=0.125 2023-12-04 04:26:02,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=120600.0, ans=0.0 2023-12-04 04:26:04,730 INFO [train.py:1087] (1/4) Epoch 21, batch 200, loss[loss=0.1622, simple_loss=0.2507, pruned_loss=0.03687, over 24800.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2697, pruned_loss=0.04687, over 3053042.30 frames. ], batch size: 72, lr: 1.16e-02, grad_scale: 8.0 2023-12-04 04:26:07,589 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-12-04 04:26:12,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=120666.66666666667, ans=0.125 2023-12-04 04:26:29,416 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.197e+02 1.469e+02 1.704e+02 1.964e+02 2.913e+02, threshold=3.408e+02, percent-clipped=0.0 2023-12-04 04:26:43,386 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.34 vs. limit=5.0 2023-12-04 04:26:45,212 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=15.0 2023-12-04 04:26:54,981 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=120933.33333333333, ans=0.2 2023-12-04 04:27:00,144 INFO [train.py:1087] (1/4) Epoch 21, batch 250, loss[loss=0.1769, simple_loss=0.2686, pruned_loss=0.04265, over 24707.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2696, pruned_loss=0.04689, over 3431137.26 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:27:11,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=121066.66666666667, ans=0.025 2023-12-04 04:27:19,401 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121066.66666666667, ans=0.125 2023-12-04 04:27:24,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=121133.33333333333, ans=0.125 2023-12-04 04:27:38,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121200.0, ans=0.125 2023-12-04 04:27:56,407 INFO [train.py:1087] (1/4) Epoch 21, batch 300, loss[loss=0.1805, simple_loss=0.2682, pruned_loss=0.04637, over 24783.00 frames. ], tot_loss[loss=0.1815, simple_loss=0.2695, pruned_loss=0.04679, over 3736947.55 frames. ], batch size: 72, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:28:20,269 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.254e+02 1.454e+02 1.562e+02 1.743e+02 3.143e+02, threshold=3.125e+02, percent-clipped=0.0 2023-12-04 04:28:22,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=121466.66666666667, ans=0.0 2023-12-04 04:28:22,281 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.75 vs. limit=15.0 2023-12-04 04:28:23,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=121466.66666666667, ans=0.0 2023-12-04 04:28:42,313 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-12-04 04:28:46,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121600.0, ans=0.1 2023-12-04 04:28:51,491 INFO [train.py:1087] (1/4) Epoch 21, batch 350, loss[loss=0.1783, simple_loss=0.2629, pruned_loss=0.04678, over 24702.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.2693, pruned_loss=0.04657, over 3976012.82 frames. ], batch size: 69, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:28:54,289 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=121666.66666666667, ans=0.125 2023-12-04 04:29:16,691 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.60 vs. limit=15.0 2023-12-04 04:29:19,994 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-12-04 04:29:34,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=121866.66666666667, ans=0.0 2023-12-04 04:29:43,790 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-12-04 04:29:47,546 INFO [train.py:1087] (1/4) Epoch 21, batch 400, loss[loss=0.1731, simple_loss=0.2591, pruned_loss=0.04357, over 24308.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.2693, pruned_loss=0.04654, over 4156367.71 frames. ], batch size: 79, lr: 1.15e-02, grad_scale: 16.0 2023-12-04 04:29:59,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=122066.66666666667, ans=0.125 2023-12-04 04:30:10,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122133.33333333333, ans=0.125 2023-12-04 04:30:12,439 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.189e+02 1.444e+02 1.634e+02 1.974e+02 2.926e+02, threshold=3.268e+02, percent-clipped=0.0 2023-12-04 04:30:40,475 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=122266.66666666667, ans=0.125 2023-12-04 04:30:43,590 INFO [train.py:1087] (1/4) Epoch 21, batch 450, loss[loss=0.1708, simple_loss=0.257, pruned_loss=0.04227, over 24723.00 frames. ], tot_loss[loss=0.1813, simple_loss=0.2695, pruned_loss=0.04654, over 4302918.17 frames. ], batch size: 67, lr: 1.15e-02, grad_scale: 16.0 2023-12-04 04:30:55,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=122400.0, ans=0.125 2023-12-04 04:31:01,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=122400.0, ans=0.125 2023-12-04 04:31:22,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=122533.33333333333, ans=0.0 2023-12-04 04:31:39,266 INFO [train.py:1087] (1/4) Epoch 21, batch 500, loss[loss=0.2249, simple_loss=0.2947, pruned_loss=0.07755, over 17181.00 frames. ], tot_loss[loss=0.1809, simple_loss=0.2691, pruned_loss=0.04636, over 4405829.82 frames. ], batch size: 177, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:31:39,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=122666.66666666667, ans=0.125 2023-12-04 04:31:57,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=122733.33333333333, ans=0.2 2023-12-04 04:32:00,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=122800.0, ans=0.0 2023-12-04 04:32:03,939 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.232e+02 1.484e+02 1.598e+02 1.764e+02 4.123e+02, threshold=3.196e+02, percent-clipped=1.0 2023-12-04 04:32:17,125 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.66 vs. limit=10.0 2023-12-04 04:32:20,515 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-12-04 04:32:21,450 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-12-04 04:32:24,303 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=122933.33333333333, ans=0.125 2023-12-04 04:32:25,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.04 vs. limit=15.0 2023-12-04 04:32:34,082 INFO [train.py:1087] (1/4) Epoch 21, batch 550, loss[loss=0.1767, simple_loss=0.2723, pruned_loss=0.0406, over 24680.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2689, pruned_loss=0.04603, over 4508578.65 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:32:36,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=123000.0, ans=0.125 2023-12-04 04:32:48,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123066.66666666667, ans=0.125 2023-12-04 04:32:55,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=123133.33333333333, ans=0.0 2023-12-04 04:32:58,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=123133.33333333333, ans=0.07 2023-12-04 04:33:17,960 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=123266.66666666667, ans=0.125 2023-12-04 04:33:23,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=123266.66666666667, ans=0.125 2023-12-04 04:33:29,583 INFO [train.py:1087] (1/4) Epoch 21, batch 600, loss[loss=0.1979, simple_loss=0.2823, pruned_loss=0.05676, over 24514.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2697, pruned_loss=0.04687, over 4547461.26 frames. ], batch size: 75, lr: 1.15e-02, grad_scale: 8.0 2023-12-04 04:33:29,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=123333.33333333333, ans=0.1 2023-12-04 04:33:47,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=15.0 2023-12-04 04:33:47,773 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=22.5 2023-12-04 04:33:49,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=123400.0, ans=0.125 2023-12-04 04:33:50,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123400.0, ans=0.1 2023-12-04 04:33:55,481 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.192e+02 1.454e+02 1.634e+02 1.764e+02 2.243e+02, threshold=3.267e+02, percent-clipped=0.0 2023-12-04 04:33:57,276 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-12-04 04:34:03,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=123533.33333333333, ans=0.0 2023-12-04 04:34:25,790 INFO [train.py:1087] (1/4) Epoch 21, batch 650, loss[loss=0.182, simple_loss=0.269, pruned_loss=0.04746, over 24573.00 frames. ], tot_loss[loss=0.1812, simple_loss=0.2693, pruned_loss=0.04656, over 4617908.87 frames. ], batch size: 65, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:34:47,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123800.0, ans=0.125 2023-12-04 04:34:48,908 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:35:06,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=123866.66666666667, ans=0.125 2023-12-04 04:35:13,722 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.11 vs. limit=22.5 2023-12-04 04:35:15,349 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=123933.33333333333, ans=0.125 2023-12-04 04:35:22,126 INFO [train.py:1087] (1/4) Epoch 21, batch 700, loss[loss=0.1772, simple_loss=0.2648, pruned_loss=0.04478, over 24767.00 frames. ], tot_loss[loss=0.1816, simple_loss=0.2696, pruned_loss=0.0468, over 4642374.54 frames. ], batch size: 64, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:35:22,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=124000.0, ans=0.05 2023-12-04 04:35:26,670 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=124000.0, ans=0.09899494936611666 2023-12-04 04:35:36,300 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=124066.66666666667, ans=0.0 2023-12-04 04:35:47,785 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.184e+02 1.416e+02 1.577e+02 1.741e+02 2.814e+02, threshold=3.154e+02, percent-clipped=0.0 2023-12-04 04:36:18,178 INFO [train.py:1087] (1/4) Epoch 21, batch 750, loss[loss=0.1667, simple_loss=0.2582, pruned_loss=0.03755, over 24852.00 frames. ], tot_loss[loss=0.1819, simple_loss=0.2698, pruned_loss=0.04697, over 4680184.14 frames. ], batch size: 68, lr: 1.14e-02, grad_scale: 8.0 2023-12-04 04:36:20,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=124333.33333333333, ans=0.0 2023-12-04 04:36:27,246 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=124333.33333333333, ans=0.0 2023-12-04 04:36:29,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=124400.0, ans=0.2 2023-12-04 04:36:51,173 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.68 vs. limit=22.5 2023-12-04 04:36:53,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=124533.33333333333, ans=0.1 2023-12-04 04:37:13,039 INFO [train.py:1087] (1/4) Epoch 21, batch 800, loss[loss=0.1744, simple_loss=0.2618, pruned_loss=0.04349, over 24579.00 frames. ], tot_loss[loss=0.182, simple_loss=0.27, pruned_loss=0.04705, over 4693497.47 frames. ], batch size: 65, lr: 1.14e-02, grad_scale: 16.0 2023-12-04 04:37:37,101 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.246e+02 1.466e+02 1.578e+02 1.737e+02 2.758e+02, threshold=3.156e+02, percent-clipped=0.0 2023-12-04 04:37:52,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=124866.66666666667, ans=10.0 2023-12-04 04:38:00,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=124933.33333333333, ans=0.125 2023-12-04 04:38:04,651 INFO [train.py:1087] (1/4) Epoch 21, batch 850, loss[loss=0.1714, simple_loss=0.2636, pruned_loss=0.03956, over 24863.00 frames. ], tot_loss[loss=0.1817, simple_loss=0.2697, pruned_loss=0.04684, over 4725698.22 frames. ], batch size: 68, lr: 1.14e-02, grad_scale: 16.0 2023-12-04 04:38:25,922 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.04 vs. limit=15.0 2023-12-04 04:38:37,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=125200.0, ans=0.0 2023-12-04 04:38:44,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=125266.66666666667, ans=0.07 2023-12-04 04:39:02,748 INFO [train.py:1087] (1/4) Epoch 22, batch 0, loss[loss=0.1687, simple_loss=0.2587, pruned_loss=0.0394, over 24776.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2587, pruned_loss=0.0394, over 24776.00 frames. ], batch size: 70, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:39:02,748 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 04:39:15,180 INFO [train.py:1119] (1/4) Epoch 22, validation: loss=0.1596, simple_loss=0.2606, pruned_loss=0.0293, over 944034.00 frames. 2023-12-04 04:39:15,180 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 04:39:25,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125366.66666666667, ans=0.1 2023-12-04 04:39:45,901 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.163e+02 1.445e+02 1.599e+02 1.752e+02 2.849e+02, threshold=3.197e+02, percent-clipped=0.0 2023-12-04 04:39:49,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=125500.0, ans=0.0 2023-12-04 04:39:50,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125500.0, ans=0.125 2023-12-04 04:40:03,937 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=125566.66666666667, ans=0.0 2023-12-04 04:40:10,024 INFO [train.py:1087] (1/4) Epoch 22, batch 50, loss[loss=0.1894, simple_loss=0.2751, pruned_loss=0.05184, over 24050.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2672, pruned_loss=0.04474, over 1093803.69 frames. ], batch size: 87, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:40:38,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=125766.66666666667, ans=0.125 2023-12-04 04:40:49,039 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=125833.33333333333, ans=0.125 2023-12-04 04:41:04,965 INFO [train.py:1087] (1/4) Epoch 22, batch 100, loss[loss=0.1737, simple_loss=0.2625, pruned_loss=0.04246, over 24806.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2676, pruned_loss=0.04525, over 1906417.57 frames. ], batch size: 62, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:41:08,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=125966.66666666667, ans=0.125 2023-12-04 04:41:18,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126033.33333333333, ans=0.1 2023-12-04 04:41:19,603 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=126033.33333333333, ans=0.0 2023-12-04 04:41:19,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=126033.33333333333, ans=0.125 2023-12-04 04:41:29,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=126100.0, ans=10.0 2023-12-04 04:41:35,558 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.440e+02 1.549e+02 1.793e+02 2.554e+02, threshold=3.098e+02, percent-clipped=0.0 2023-12-04 04:41:43,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=126166.66666666667, ans=0.0 2023-12-04 04:41:53,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=126233.33333333333, ans=0.125 2023-12-04 04:41:54,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=126233.33333333333, ans=0.125 2023-12-04 04:41:59,323 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126300.0, ans=0.125 2023-12-04 04:42:00,155 INFO [train.py:1087] (1/4) Epoch 22, batch 150, loss[loss=0.1879, simple_loss=0.2701, pruned_loss=0.05282, over 24484.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2682, pruned_loss=0.04575, over 2557074.72 frames. ], batch size: 77, lr: 1.11e-02, grad_scale: 32.0 2023-12-04 04:42:08,988 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=126300.0, ans=0.125 2023-12-04 04:42:29,910 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.59 vs. limit=22.5 2023-12-04 04:42:30,741 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-12-04 04:42:33,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=126500.0, ans=0.1 2023-12-04 04:42:55,948 INFO [train.py:1087] (1/4) Epoch 22, batch 200, loss[loss=0.1701, simple_loss=0.2624, pruned_loss=0.03893, over 24548.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2682, pruned_loss=0.04585, over 3047215.14 frames. ], batch size: 66, lr: 1.11e-02, grad_scale: 16.0 2023-12-04 04:42:56,271 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=126633.33333333333, ans=0.2 2023-12-04 04:43:28,335 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.405e+02 1.560e+02 1.711e+02 3.260e+02, threshold=3.121e+02, percent-clipped=1.0 2023-12-04 04:43:32,203 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=126833.33333333333, ans=0.125 2023-12-04 04:43:44,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=126900.0, ans=0.125 2023-12-04 04:43:51,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=126966.66666666667, ans=0.125 2023-12-04 04:43:51,325 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.12 vs. limit=15.0 2023-12-04 04:43:51,866 INFO [train.py:1087] (1/4) Epoch 22, batch 250, loss[loss=0.1786, simple_loss=0.2659, pruned_loss=0.04561, over 24142.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.2689, pruned_loss=0.04612, over 3432557.84 frames. ], batch size: 82, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:44:07,472 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:44:08,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=127033.33333333333, ans=0.125 2023-12-04 04:44:09,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=127033.33333333333, ans=0.125 2023-12-04 04:44:14,516 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=127100.0, ans=0.0 2023-12-04 04:44:42,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=127233.33333333333, ans=0.2 2023-12-04 04:44:47,175 INFO [train.py:1087] (1/4) Epoch 22, batch 300, loss[loss=0.2065, simple_loss=0.2825, pruned_loss=0.06525, over 16948.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.269, pruned_loss=0.04611, over 3738898.05 frames. ], batch size: 177, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:45:03,244 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=127366.66666666667, ans=0.2 2023-12-04 04:45:05,322 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=127366.66666666667, ans=0.125 2023-12-04 04:45:06,273 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=127366.66666666667, ans=0.2 2023-12-04 04:45:10,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=127433.33333333333, ans=0.125 2023-12-04 04:45:15,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=127433.33333333333, ans=0.125 2023-12-04 04:45:18,864 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.295e+02 1.432e+02 1.560e+02 1.813e+02 2.392e+02, threshold=3.120e+02, percent-clipped=0.0 2023-12-04 04:45:39,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=127566.66666666667, ans=10.0 2023-12-04 04:45:40,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=127633.33333333333, ans=0.125 2023-12-04 04:45:40,898 INFO [train.py:1087] (1/4) Epoch 22, batch 350, loss[loss=0.1707, simple_loss=0.2593, pruned_loss=0.04105, over 24690.00 frames. ], tot_loss[loss=0.1806, simple_loss=0.2692, pruned_loss=0.04601, over 3991814.50 frames. ], batch size: 74, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:45:41,245 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=127633.33333333333, ans=0.2 2023-12-04 04:45:42,173 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:45:53,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=127700.0, ans=0.0 2023-12-04 04:45:55,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=127700.0, ans=0.2 2023-12-04 04:45:59,728 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127700.0, ans=0.1 2023-12-04 04:46:05,190 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 04:46:18,107 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.09 vs. limit=15.0 2023-12-04 04:46:31,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127900.0, ans=0.1 2023-12-04 04:46:37,415 INFO [train.py:1087] (1/4) Epoch 22, batch 400, loss[loss=0.1678, simple_loss=0.2637, pruned_loss=0.03598, over 24805.00 frames. ], tot_loss[loss=0.1805, simple_loss=0.269, pruned_loss=0.04598, over 4167470.84 frames. ], batch size: 71, lr: 1.10e-02, grad_scale: 32.0 2023-12-04 04:46:38,714 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=127966.66666666667, ans=0.0 2023-12-04 04:46:51,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=128033.33333333333, ans=0.125 2023-12-04 04:47:05,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=128100.0, ans=0.0 2023-12-04 04:47:09,770 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.423e+02 1.544e+02 1.672e+02 2.386e+02, threshold=3.087e+02, percent-clipped=0.0 2023-12-04 04:47:11,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=128166.66666666667, ans=0.0 2023-12-04 04:47:17,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=128166.66666666667, ans=0.2 2023-12-04 04:47:28,105 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=128233.33333333333, ans=0.035 2023-12-04 04:47:30,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=128233.33333333333, ans=0.0 2023-12-04 04:47:33,186 INFO [train.py:1087] (1/4) Epoch 22, batch 450, loss[loss=0.1752, simple_loss=0.2652, pruned_loss=0.04262, over 24433.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2686, pruned_loss=0.04573, over 4315650.44 frames. ], batch size: 77, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:47:34,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=128300.0, ans=0.0 2023-12-04 04:47:44,478 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-12-04 04:47:50,830 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=128366.66666666667, ans=0.95 2023-12-04 04:48:06,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=128500.0, ans=0.1 2023-12-04 04:48:20,055 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=15.0 2023-12-04 04:48:26,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=128566.66666666667, ans=0.125 2023-12-04 04:48:29,321 INFO [train.py:1087] (1/4) Epoch 22, batch 500, loss[loss=0.1826, simple_loss=0.2761, pruned_loss=0.04453, over 24754.00 frames. ], tot_loss[loss=0.18, simple_loss=0.2685, pruned_loss=0.04571, over 4422224.95 frames. ], batch size: 70, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:48:36,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=128633.33333333333, ans=0.125 2023-12-04 04:48:52,166 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.76 vs. limit=6.0 2023-12-04 04:49:00,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128766.66666666667, ans=0.125 2023-12-04 04:49:00,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=128766.66666666667, ans=0.0 2023-12-04 04:49:02,609 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.487e+02 1.631e+02 1.827e+02 2.916e+02, threshold=3.262e+02, percent-clipped=0.0 2023-12-04 04:49:15,391 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=128900.0, ans=0.0 2023-12-04 04:49:19,072 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.27 vs. limit=15.0 2023-12-04 04:49:24,078 INFO [train.py:1087] (1/4) Epoch 22, batch 550, loss[loss=0.171, simple_loss=0.2609, pruned_loss=0.04048, over 24552.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2685, pruned_loss=0.04542, over 4506092.62 frames. ], batch size: 66, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:49:33,844 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=15.0 2023-12-04 04:49:37,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=129033.33333333333, ans=0.125 2023-12-04 04:49:43,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129033.33333333333, ans=0.1 2023-12-04 04:49:48,005 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-12-04 04:50:00,847 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.41 vs. limit=15.0 2023-12-04 04:50:17,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=129233.33333333333, ans=0.125 2023-12-04 04:50:19,881 INFO [train.py:1087] (1/4) Epoch 22, batch 600, loss[loss=0.183, simple_loss=0.2715, pruned_loss=0.04719, over 24688.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.2689, pruned_loss=0.04568, over 4558930.79 frames. ], batch size: 74, lr: 1.10e-02, grad_scale: 16.0 2023-12-04 04:50:45,496 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-12-04 04:50:53,448 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.181e+02 1.429e+02 1.565e+02 1.721e+02 2.657e+02, threshold=3.131e+02, percent-clipped=0.0 2023-12-04 04:51:08,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=129566.66666666667, ans=0.5 2023-12-04 04:51:10,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=129566.66666666667, ans=0.125 2023-12-04 04:51:15,930 INFO [train.py:1087] (1/4) Epoch 22, batch 650, loss[loss=0.1715, simple_loss=0.2623, pruned_loss=0.04037, over 24756.00 frames. ], tot_loss[loss=0.1797, simple_loss=0.2685, pruned_loss=0.04548, over 4622385.77 frames. ], batch size: 66, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:51:16,527 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-12-04 04:51:18,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=129633.33333333333, ans=0.125 2023-12-04 04:51:24,701 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=129633.33333333333, ans=0.125 2023-12-04 04:51:45,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=129766.66666666667, ans=0.125 2023-12-04 04:51:50,363 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=129833.33333333333, ans=0.125 2023-12-04 04:52:09,968 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=12.0 2023-12-04 04:52:11,362 INFO [train.py:1087] (1/4) Epoch 22, batch 700, loss[loss=0.2062, simple_loss=0.296, pruned_loss=0.05824, over 21231.00 frames. ], tot_loss[loss=0.1793, simple_loss=0.2681, pruned_loss=0.04523, over 4678210.82 frames. ], batch size: 127, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:52:13,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=129966.66666666667, ans=0.125 2023-12-04 04:52:24,672 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=130033.33333333333, ans=0.0 2023-12-04 04:52:32,071 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=130100.0, ans=0.125 2023-12-04 04:52:37,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=130100.0, ans=0.125 2023-12-04 04:52:44,310 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.399e+02 1.549e+02 1.743e+02 2.457e+02, threshold=3.098e+02, percent-clipped=0.0 2023-12-04 04:52:50,355 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-12-04 04:52:57,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130233.33333333333, ans=0.1 2023-12-04 04:53:06,048 INFO [train.py:1087] (1/4) Epoch 22, batch 750, loss[loss=0.1767, simple_loss=0.2675, pruned_loss=0.04292, over 24597.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2679, pruned_loss=0.04518, over 4695215.68 frames. ], batch size: 68, lr: 1.09e-02, grad_scale: 16.0 2023-12-04 04:53:06,232 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130300.0, ans=0.1 2023-12-04 04:53:18,485 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-12-04 04:53:37,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=130433.33333333333, ans=0.0 2023-12-04 04:53:38,598 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2023-12-04 04:54:00,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130633.33333333333, ans=0.125 2023-12-04 04:54:00,951 INFO [train.py:1087] (1/4) Epoch 22, batch 800, loss[loss=0.1802, simple_loss=0.2666, pruned_loss=0.04687, over 24210.00 frames. ], tot_loss[loss=0.1794, simple_loss=0.2681, pruned_loss=0.04537, over 4718226.56 frames. ], batch size: 82, lr: 1.09e-02, grad_scale: 32.0 2023-12-04 04:54:05,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=130633.33333333333, ans=0.0 2023-12-04 04:54:13,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=130700.0, ans=0.2 2023-12-04 04:54:22,314 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=130766.66666666667, ans=0.125 2023-12-04 04:54:28,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=130766.66666666667, ans=0.2 2023-12-04 04:54:33,125 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.431e+02 1.591e+02 1.794e+02 2.662e+02, threshold=3.181e+02, percent-clipped=0.0 2023-12-04 04:54:35,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=130833.33333333333, ans=0.0 2023-12-04 04:54:53,221 INFO [train.py:1087] (1/4) Epoch 22, batch 850, loss[loss=0.1767, simple_loss=0.2622, pruned_loss=0.04558, over 24519.00 frames. ], tot_loss[loss=0.179, simple_loss=0.2677, pruned_loss=0.04511, over 4745398.32 frames. ], batch size: 75, lr: 1.09e-02, grad_scale: 32.0 2023-12-04 04:55:06,702 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-12-04 04:55:12,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131100.0, ans=0.1 2023-12-04 04:55:45,432 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2023-12-04 04:55:55,154 INFO [train.py:1087] (1/4) Epoch 23, batch 0, loss[loss=0.1889, simple_loss=0.278, pruned_loss=0.04985, over 24094.00 frames. ], tot_loss[loss=0.1889, simple_loss=0.278, pruned_loss=0.04985, over 24094.00 frames. ], batch size: 87, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:55:55,155 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 04:56:07,330 INFO [train.py:1119] (1/4) Epoch 23, validation: loss=0.1586, simple_loss=0.2601, pruned_loss=0.02859, over 944034.00 frames. 2023-12-04 04:56:07,331 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 04:56:25,610 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=131333.33333333334, ans=0.07 2023-12-04 04:56:43,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=131466.66666666666, ans=0.09899494936611666 2023-12-04 04:56:44,429 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-12-04 04:56:45,823 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.445e+02 1.626e+02 1.881e+02 3.357e+02, threshold=3.253e+02, percent-clipped=2.0 2023-12-04 04:57:02,812 INFO [train.py:1087] (1/4) Epoch 23, batch 50, loss[loss=0.1714, simple_loss=0.2566, pruned_loss=0.04314, over 23981.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2682, pruned_loss=0.04567, over 1078293.65 frames. ], batch size: 87, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:57:05,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=131600.0, ans=0.1 2023-12-04 04:57:11,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=131600.0, ans=0.1 2023-12-04 04:57:14,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=131666.66666666666, ans=0.125 2023-12-04 04:57:19,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=131666.66666666666, ans=0.04949747468305833 2023-12-04 04:57:26,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=131733.33333333334, ans=0.1 2023-12-04 04:57:27,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=131733.33333333334, ans=0.2 2023-12-04 04:57:42,785 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.04 vs. limit=10.0 2023-12-04 04:57:48,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=131866.66666666666, ans=0.025 2023-12-04 04:57:54,962 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.76 vs. limit=15.0 2023-12-04 04:57:57,558 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-12-04 04:57:57,958 INFO [train.py:1087] (1/4) Epoch 23, batch 100, loss[loss=0.2283, simple_loss=0.3054, pruned_loss=0.07565, over 17341.00 frames. ], tot_loss[loss=0.1798, simple_loss=0.2684, pruned_loss=0.04558, over 1888095.23 frames. ], batch size: 177, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:58:00,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=131933.33333333334, ans=0.05 2023-12-04 04:58:09,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=132000.0, ans=0.0 2023-12-04 04:58:33,564 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-12-04 04:58:37,412 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.409e+02 1.496e+02 1.675e+02 3.177e+02, threshold=2.991e+02, percent-clipped=0.0 2023-12-04 04:58:51,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=132200.0, ans=0.0 2023-12-04 04:58:53,615 INFO [train.py:1087] (1/4) Epoch 23, batch 150, loss[loss=0.1728, simple_loss=0.2617, pruned_loss=0.04198, over 24751.00 frames. ], tot_loss[loss=0.1785, simple_loss=0.2674, pruned_loss=0.04484, over 2550728.60 frames. ], batch size: 70, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 04:59:03,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=132266.66666666666, ans=0.5 2023-12-04 04:59:06,832 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.38 vs. limit=12.0 2023-12-04 04:59:07,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=132333.33333333334, ans=0.125 2023-12-04 04:59:34,368 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=132466.66666666666, ans=0.125 2023-12-04 04:59:35,540 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-12-04 04:59:49,245 INFO [train.py:1087] (1/4) Epoch 23, batch 200, loss[loss=0.1659, simple_loss=0.2545, pruned_loss=0.03863, over 24545.00 frames. ], tot_loss[loss=0.1788, simple_loss=0.2677, pruned_loss=0.04492, over 3048955.33 frames. ], batch size: 62, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:00:10,270 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-12-04 05:00:27,846 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.207e+02 1.403e+02 1.524e+02 1.738e+02 3.583e+02, threshold=3.048e+02, percent-clipped=1.0 2023-12-04 05:00:40,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=132866.66666666666, ans=0.0 2023-12-04 05:00:42,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=132866.66666666666, ans=0.125 2023-12-04 05:00:42,793 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2023-12-04 05:00:45,489 INFO [train.py:1087] (1/4) Epoch 23, batch 250, loss[loss=0.1932, simple_loss=0.2808, pruned_loss=0.05283, over 23140.00 frames. ], tot_loss[loss=0.1786, simple_loss=0.2676, pruned_loss=0.04477, over 3439045.07 frames. ], batch size: 106, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:00:51,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=132933.33333333334, ans=0.0 2023-12-04 05:00:54,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=132933.33333333334, ans=0.125 2023-12-04 05:01:18,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=133133.33333333334, ans=0.05 2023-12-04 05:01:20,119 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=15.0 2023-12-04 05:01:25,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133133.33333333334, ans=0.1 2023-12-04 05:01:27,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=133133.33333333334, ans=0.0 2023-12-04 05:01:30,619 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=133200.0, ans=0.125 2023-12-04 05:01:33,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=133200.0, ans=0.1 2023-12-04 05:01:34,951 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:01:40,332 INFO [train.py:1087] (1/4) Epoch 23, batch 300, loss[loss=0.1618, simple_loss=0.2459, pruned_loss=0.03884, over 24729.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2668, pruned_loss=0.04427, over 3762249.30 frames. ], batch size: 67, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:01:46,914 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=15.0 2023-12-04 05:01:50,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=133333.33333333334, ans=0.125 2023-12-04 05:02:05,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=133400.0, ans=0.125 2023-12-04 05:02:16,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133466.66666666666, ans=0.1 2023-12-04 05:02:21,686 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.188e+02 1.411e+02 1.532e+02 1.698e+02 2.321e+02, threshold=3.065e+02, percent-clipped=0.0 2023-12-04 05:02:29,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133533.33333333334, ans=0.125 2023-12-04 05:02:30,543 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=133533.33333333334, ans=10.0 2023-12-04 05:02:37,644 INFO [train.py:1087] (1/4) Epoch 23, batch 350, loss[loss=0.1806, simple_loss=0.2762, pruned_loss=0.04254, over 22989.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2669, pruned_loss=0.04442, over 3992600.64 frames. ], batch size: 106, lr: 1.06e-02, grad_scale: 32.0 2023-12-04 05:02:37,859 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=133600.0, ans=0.125 2023-12-04 05:02:40,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133600.0, ans=0.1 2023-12-04 05:02:51,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=133666.66666666666, ans=0.0 2023-12-04 05:02:57,237 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-12-04 05:02:57,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=133666.66666666666, ans=0.2 2023-12-04 05:03:03,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=133733.33333333334, ans=0.125 2023-12-04 05:03:14,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133800.0, ans=0.1 2023-12-04 05:03:19,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133800.0, ans=0.1 2023-12-04 05:03:32,709 INFO [train.py:1087] (1/4) Epoch 23, batch 400, loss[loss=0.1755, simple_loss=0.263, pruned_loss=0.04404, over 24176.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2672, pruned_loss=0.0447, over 4166636.20 frames. ], batch size: 82, lr: 1.05e-02, grad_scale: 32.0 2023-12-04 05:03:36,354 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.10 vs. limit=15.0 2023-12-04 05:04:12,107 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.207e+02 1.404e+02 1.564e+02 1.778e+02 2.505e+02, threshold=3.128e+02, percent-clipped=0.0 2023-12-04 05:04:17,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134200.0, ans=0.1 2023-12-04 05:04:24,133 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-12-04 05:04:28,732 INFO [train.py:1087] (1/4) Epoch 23, batch 450, loss[loss=0.186, simple_loss=0.2738, pruned_loss=0.04914, over 24766.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2671, pruned_loss=0.04469, over 4306115.89 frames. ], batch size: 70, lr: 1.05e-02, grad_scale: 32.0 2023-12-04 05:04:50,040 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-12-04 05:05:04,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=134466.66666666666, ans=0.0 2023-12-04 05:05:11,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=134533.33333333334, ans=0.0 2023-12-04 05:05:24,236 INFO [train.py:1087] (1/4) Epoch 23, batch 500, loss[loss=0.1833, simple_loss=0.2732, pruned_loss=0.04671, over 23536.00 frames. ], tot_loss[loss=0.1784, simple_loss=0.2671, pruned_loss=0.04483, over 4400130.84 frames. ], batch size: 94, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:05:28,457 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-12-04 05:05:42,083 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=134666.66666666666, ans=0.125 2023-12-04 05:05:50,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-12-04 05:06:05,592 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.442e+02 1.601e+02 1.815e+02 2.698e+02, threshold=3.202e+02, percent-clipped=0.0 2023-12-04 05:06:07,265 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.44 vs. limit=15.0 2023-12-04 05:06:14,448 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=134866.66666666666, ans=0.125 2023-12-04 05:06:19,790 INFO [train.py:1087] (1/4) Epoch 23, batch 550, loss[loss=0.1927, simple_loss=0.2812, pruned_loss=0.05206, over 24126.00 frames. ], tot_loss[loss=0.1781, simple_loss=0.2669, pruned_loss=0.04464, over 4506408.90 frames. ], batch size: 58, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:06:24,122 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=12.0 2023-12-04 05:06:41,209 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=135066.66666666666, ans=0.125 2023-12-04 05:06:48,677 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=135066.66666666666, ans=0.125 2023-12-04 05:06:49,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=135066.66666666666, ans=0.125 2023-12-04 05:06:57,180 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=135133.33333333334, ans=0.125 2023-12-04 05:06:59,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=135133.33333333334, ans=0.2 2023-12-04 05:07:00,679 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-12-04 05:07:09,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135200.0, ans=0.0 2023-12-04 05:07:15,072 INFO [train.py:1087] (1/4) Epoch 23, batch 600, loss[loss=0.1758, simple_loss=0.2608, pruned_loss=0.0454, over 23974.00 frames. ], tot_loss[loss=0.1778, simple_loss=0.2666, pruned_loss=0.04451, over 4584679.34 frames. ], batch size: 87, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:07:19,478 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135266.66666666666, ans=0.1 2023-12-04 05:07:33,135 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=3.952e-03 2023-12-04 05:07:33,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=135333.33333333334, ans=0.95 2023-12-04 05:07:57,304 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.383e+02 1.459e+02 1.628e+02 2.950e+02, threshold=2.918e+02, percent-clipped=0.0 2023-12-04 05:08:00,105 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=135533.33333333334, ans=0.0 2023-12-04 05:08:02,264 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:08:10,607 INFO [train.py:1087] (1/4) Epoch 23, batch 650, loss[loss=0.1835, simple_loss=0.2742, pruned_loss=0.04646, over 24785.00 frames. ], tot_loss[loss=0.1782, simple_loss=0.2669, pruned_loss=0.04477, over 4635117.47 frames. ], batch size: 73, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:08:17,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=135600.0, ans=0.1 2023-12-04 05:08:32,816 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.35 vs. limit=22.5 2023-12-04 05:08:50,588 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:08:51,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=135800.0, ans=0.0 2023-12-04 05:09:01,263 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=135866.66666666666, ans=0.125 2023-12-04 05:09:03,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=135866.66666666666, ans=0.125 2023-12-04 05:09:06,363 INFO [train.py:1087] (1/4) Epoch 23, batch 700, loss[loss=0.1762, simple_loss=0.2619, pruned_loss=0.04525, over 24554.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2661, pruned_loss=0.04436, over 4692907.37 frames. ], batch size: 63, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:09:13,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135933.33333333334, ans=0.0 2023-12-04 05:09:14,117 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=135933.33333333334, ans=0.125 2023-12-04 05:09:25,708 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=12.0 2023-12-04 05:09:31,182 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136066.66666666666, ans=0.1 2023-12-04 05:09:42,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=136133.33333333334, ans=0.2 2023-12-04 05:09:43,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=136133.33333333334, ans=0.0 2023-12-04 05:09:48,658 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.220e+02 1.387e+02 1.558e+02 1.692e+02 2.997e+02, threshold=3.117e+02, percent-clipped=1.0 2023-12-04 05:10:02,092 INFO [train.py:1087] (1/4) Epoch 23, batch 750, loss[loss=0.1662, simple_loss=0.2561, pruned_loss=0.03817, over 24599.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2663, pruned_loss=0.04443, over 4726123.39 frames. ], batch size: 68, lr: 1.05e-02, grad_scale: 8.0 2023-12-04 05:10:02,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=136266.66666666666, ans=0.0 2023-12-04 05:10:27,781 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136400.0, ans=0.0 2023-12-04 05:10:37,205 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.24 vs. limit=15.0 2023-12-04 05:10:51,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=136533.33333333334, ans=0.0 2023-12-04 05:10:56,919 INFO [train.py:1087] (1/4) Epoch 23, batch 800, loss[loss=0.1828, simple_loss=0.2714, pruned_loss=0.04707, over 24737.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2661, pruned_loss=0.04422, over 4755417.64 frames. ], batch size: 61, lr: 1.05e-02, grad_scale: 16.0 2023-12-04 05:11:07,345 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=136666.66666666666, ans=0.0 2023-12-04 05:11:08,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=136666.66666666666, ans=0.0 2023-12-04 05:11:37,275 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.190e+02 1.432e+02 1.525e+02 1.721e+02 2.600e+02, threshold=3.050e+02, percent-clipped=0.0 2023-12-04 05:11:43,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=136866.66666666666, ans=0.95 2023-12-04 05:11:45,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=136866.66666666666, ans=0.125 2023-12-04 05:11:49,174 INFO [train.py:1087] (1/4) Epoch 23, batch 850, loss[loss=0.1846, simple_loss=0.2728, pruned_loss=0.04824, over 24790.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2662, pruned_loss=0.04426, over 4776170.85 frames. ], batch size: 62, lr: 1.04e-02, grad_scale: 16.0 2023-12-04 05:11:51,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=136933.33333333334, ans=0.125 2023-12-04 05:11:56,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=136933.33333333334, ans=0.0 2023-12-04 05:12:51,367 INFO [train.py:1087] (1/4) Epoch 24, batch 0, loss[loss=0.1683, simple_loss=0.2605, pruned_loss=0.03807, over 24536.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2605, pruned_loss=0.03807, over 24536.00 frames. ], batch size: 63, lr: 1.02e-02, grad_scale: 32.0 2023-12-04 05:12:51,368 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 05:13:03,461 INFO [train.py:1119] (1/4) Epoch 24, validation: loss=0.1585, simple_loss=0.2596, pruned_loss=0.02867, over 944034.00 frames. 2023-12-04 05:13:03,462 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 05:13:08,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=137233.33333333334, ans=0.2 2023-12-04 05:13:22,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=137300.0, ans=0.125 2023-12-04 05:13:22,645 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:13:43,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=137433.33333333334, ans=0.125 2023-12-04 05:13:46,211 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=137500.0, ans=0.125 2023-12-04 05:13:50,537 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.490e+02 1.634e+02 1.835e+02 3.241e+02, threshold=3.268e+02, percent-clipped=2.0 2023-12-04 05:13:58,651 INFO [train.py:1087] (1/4) Epoch 24, batch 50, loss[loss=0.1712, simple_loss=0.2606, pruned_loss=0.04092, over 24542.00 frames. ], tot_loss[loss=0.1783, simple_loss=0.2671, pruned_loss=0.04472, over 1088112.88 frames. ], batch size: 66, lr: 1.02e-02, grad_scale: 32.0 2023-12-04 05:14:13,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=137633.33333333334, ans=0.0 2023-12-04 05:14:17,914 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-12-04 05:14:35,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=137766.66666666666, ans=0.2 2023-12-04 05:14:39,321 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=137766.66666666666, ans=0.05 2023-12-04 05:14:53,181 INFO [train.py:1087] (1/4) Epoch 24, batch 100, loss[loss=0.1593, simple_loss=0.2466, pruned_loss=0.03599, over 24754.00 frames. ], tot_loss[loss=0.1776, simple_loss=0.2663, pruned_loss=0.04439, over 1921122.84 frames. ], batch size: 66, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:14:59,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=137900.0, ans=0.125 2023-12-04 05:15:06,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=137966.66666666666, ans=0.125 2023-12-04 05:15:21,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=138033.33333333334, ans=0.125 2023-12-04 05:15:27,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=138100.0, ans=0.2 2023-12-04 05:15:39,109 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.44 vs. limit=15.0 2023-12-04 05:15:41,777 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.250e+02 1.417e+02 1.577e+02 1.700e+02 2.458e+02, threshold=3.154e+02, percent-clipped=0.0 2023-12-04 05:15:44,480 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.30 vs. limit=15.0 2023-12-04 05:15:48,294 INFO [train.py:1087] (1/4) Epoch 24, batch 150, loss[loss=0.1733, simple_loss=0.2653, pruned_loss=0.04071, over 24755.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2666, pruned_loss=0.04421, over 2554862.54 frames. ], batch size: 70, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:15:48,970 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-12-04 05:15:59,519 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=22.5 2023-12-04 05:16:16,607 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138366.66666666666, ans=0.1 2023-12-04 05:16:22,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138433.33333333334, ans=0.1 2023-12-04 05:16:23,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=138433.33333333334, ans=0.125 2023-12-04 05:16:32,330 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.69 vs. limit=15.0 2023-12-04 05:16:36,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=138500.0, ans=0.0 2023-12-04 05:16:43,689 INFO [train.py:1087] (1/4) Epoch 24, batch 200, loss[loss=0.1792, simple_loss=0.2678, pruned_loss=0.04533, over 24576.00 frames. ], tot_loss[loss=0.1774, simple_loss=0.2665, pruned_loss=0.04414, over 3067397.98 frames. ], batch size: 65, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:16:44,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=138566.66666666666, ans=0.0 2023-12-04 05:16:48,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=138566.66666666666, ans=0.0 2023-12-04 05:16:48,534 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-12-04 05:17:04,390 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=15.0 2023-12-04 05:17:10,883 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.26 vs. limit=15.0 2023-12-04 05:17:20,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=138766.66666666666, ans=0.125 2023-12-04 05:17:21,923 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-12-04 05:17:33,089 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.176e+02 1.351e+02 1.474e+02 1.677e+02 2.689e+02, threshold=2.947e+02, percent-clipped=0.0 2023-12-04 05:17:39,399 INFO [train.py:1087] (1/4) Epoch 24, batch 250, loss[loss=0.1685, simple_loss=0.2572, pruned_loss=0.03991, over 24612.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2661, pruned_loss=0.04373, over 3457255.31 frames. ], batch size: 68, lr: 1.02e-02, grad_scale: 16.0 2023-12-04 05:17:42,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=138900.0, ans=0.125 2023-12-04 05:17:53,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=138966.66666666666, ans=0.0 2023-12-04 05:17:54,853 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=138966.66666666666, ans=0.2 2023-12-04 05:18:04,110 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139033.33333333334, ans=0.0 2023-12-04 05:18:07,810 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-12-04 05:18:35,187 INFO [train.py:1087] (1/4) Epoch 24, batch 300, loss[loss=0.1835, simple_loss=0.2698, pruned_loss=0.04862, over 23875.00 frames. ], tot_loss[loss=0.1768, simple_loss=0.2661, pruned_loss=0.04374, over 3761993.23 frames. ], batch size: 87, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:18:42,954 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:18:51,448 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=139300.0, ans=0.0 2023-12-04 05:18:54,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=139300.0, ans=0.125 2023-12-04 05:18:57,401 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.56 vs. limit=15.0 2023-12-04 05:18:59,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=139366.66666666666, ans=0.125 2023-12-04 05:19:00,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=139366.66666666666, ans=0.1 2023-12-04 05:19:12,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139433.33333333334, ans=0.1 2023-12-04 05:19:17,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=139500.0, ans=0.125 2023-12-04 05:19:23,149 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.392e+02 1.510e+02 1.643e+02 2.800e+02, threshold=3.020e+02, percent-clipped=0.0 2023-12-04 05:19:26,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=139500.0, ans=0.2 2023-12-04 05:19:30,016 INFO [train.py:1087] (1/4) Epoch 24, batch 350, loss[loss=0.1917, simple_loss=0.2828, pruned_loss=0.05033, over 24609.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.267, pruned_loss=0.04424, over 3978417.05 frames. ], batch size: 68, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:19:36,602 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2023-12-04 05:19:59,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=139700.0, ans=0.2 2023-12-04 05:20:24,712 INFO [train.py:1087] (1/4) Epoch 24, batch 400, loss[loss=0.1886, simple_loss=0.2726, pruned_loss=0.05223, over 24478.00 frames. ], tot_loss[loss=0.1777, simple_loss=0.2669, pruned_loss=0.04424, over 4169993.68 frames. ], batch size: 75, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:20:28,238 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.37 vs. limit=22.5 2023-12-04 05:20:28,492 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-12-04 05:20:53,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=140033.33333333334, ans=0.2 2023-12-04 05:20:56,342 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=15.0 2023-12-04 05:20:58,322 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=140100.0, ans=0.125 2023-12-04 05:21:14,176 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.193e+02 1.386e+02 1.500e+02 1.610e+02 2.229e+02, threshold=2.999e+02, percent-clipped=0.0 2023-12-04 05:21:20,586 INFO [train.py:1087] (1/4) Epoch 24, batch 450, loss[loss=0.1744, simple_loss=0.2641, pruned_loss=0.04234, over 24608.00 frames. ], tot_loss[loss=0.1772, simple_loss=0.2662, pruned_loss=0.04409, over 4320688.69 frames. ], batch size: 68, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:21:52,028 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140366.66666666666, ans=0.1 2023-12-04 05:22:00,969 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.26 vs. limit=15.0 2023-12-04 05:22:16,263 INFO [train.py:1087] (1/4) Epoch 24, batch 500, loss[loss=0.1749, simple_loss=0.2613, pruned_loss=0.0443, over 24551.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2664, pruned_loss=0.0443, over 4426154.21 frames. ], batch size: 62, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:22:17,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=140566.66666666666, ans=0.125 2023-12-04 05:22:28,412 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140633.33333333334, ans=0.125 2023-12-04 05:22:33,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=140633.33333333334, ans=0.125 2023-12-04 05:22:41,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=140700.0, ans=0.125 2023-12-04 05:22:45,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=140700.0, ans=0.125 2023-12-04 05:23:00,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=140833.33333333334, ans=0.2 2023-12-04 05:23:04,030 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.391e+02 1.542e+02 1.716e+02 2.195e+02, threshold=3.083e+02, percent-clipped=0.0 2023-12-04 05:23:05,337 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=140833.33333333334, ans=0.125 2023-12-04 05:23:11,416 INFO [train.py:1087] (1/4) Epoch 24, batch 550, loss[loss=0.1776, simple_loss=0.2664, pruned_loss=0.04442, over 24782.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.266, pruned_loss=0.04406, over 4505006.29 frames. ], batch size: 73, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:23:25,159 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.27 vs. limit=15.0 2023-12-04 05:24:04,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141166.66666666666, ans=0.1 2023-12-04 05:24:06,076 INFO [train.py:1087] (1/4) Epoch 24, batch 600, loss[loss=0.1891, simple_loss=0.2774, pruned_loss=0.05043, over 23989.00 frames. ], tot_loss[loss=0.177, simple_loss=0.2661, pruned_loss=0.04396, over 4579242.75 frames. ], batch size: 87, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:24:06,426 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=141233.33333333334, ans=0.2 2023-12-04 05:24:14,557 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=141233.33333333334, ans=0.2 2023-12-04 05:24:32,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=141366.66666666666, ans=0.125 2023-12-04 05:24:38,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141433.33333333334, ans=0.125 2023-12-04 05:24:55,384 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.415e+02 1.530e+02 1.698e+02 2.648e+02, threshold=3.060e+02, percent-clipped=0.0 2023-12-04 05:25:01,191 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.42 vs. limit=15.0 2023-12-04 05:25:01,795 INFO [train.py:1087] (1/4) Epoch 24, batch 650, loss[loss=0.1624, simple_loss=0.2504, pruned_loss=0.03714, over 24581.00 frames. ], tot_loss[loss=0.1773, simple_loss=0.2662, pruned_loss=0.04422, over 4613779.40 frames. ], batch size: 64, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:25:07,745 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-12-04 05:25:09,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=141566.66666666666, ans=0.0 2023-12-04 05:25:57,577 INFO [train.py:1087] (1/4) Epoch 24, batch 700, loss[loss=0.1716, simple_loss=0.2631, pruned_loss=0.04, over 24573.00 frames. ], tot_loss[loss=0.1769, simple_loss=0.266, pruned_loss=0.04392, over 4668784.42 frames. ], batch size: 64, lr: 1.01e-02, grad_scale: 32.0 2023-12-04 05:26:05,165 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141900.0, ans=0.1 2023-12-04 05:26:09,733 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-12-04 05:26:26,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=142033.33333333334, ans=0.125 2023-12-04 05:26:38,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=142100.0, ans=0.0 2023-12-04 05:26:44,410 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.49 vs. limit=15.0 2023-12-04 05:26:47,018 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.211e+02 1.394e+02 1.531e+02 1.673e+02 2.876e+02, threshold=3.062e+02, percent-clipped=0.0 2023-12-04 05:26:53,145 INFO [train.py:1087] (1/4) Epoch 24, batch 750, loss[loss=0.1764, simple_loss=0.2704, pruned_loss=0.04122, over 24562.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.2655, pruned_loss=0.04359, over 4694735.14 frames. ], batch size: 62, lr: 1.01e-02, grad_scale: 16.0 2023-12-04 05:27:14,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=142366.66666666666, ans=0.2 2023-12-04 05:27:23,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142366.66666666666, ans=0.0 2023-12-04 05:27:27,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142433.33333333334, ans=0.1 2023-12-04 05:27:28,102 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-12-04 05:27:28,654 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:27:34,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=142433.33333333334, ans=0.05 2023-12-04 05:27:47,857 INFO [train.py:1087] (1/4) Epoch 24, batch 800, loss[loss=0.1758, simple_loss=0.2673, pruned_loss=0.04218, over 24567.00 frames. ], tot_loss[loss=0.1765, simple_loss=0.2656, pruned_loss=0.04374, over 4720862.92 frames. ], batch size: 65, lr: 1.00e-02, grad_scale: 32.0 2023-12-04 05:27:48,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=142566.66666666666, ans=0.2 2023-12-04 05:28:13,527 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=142700.0, ans=0.2 2023-12-04 05:28:30,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=142833.33333333334, ans=0.0 2023-12-04 05:28:31,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=142833.33333333334, ans=0.2 2023-12-04 05:28:32,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=142833.33333333334, ans=0.125 2023-12-04 05:28:32,525 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=142833.33333333334, ans=0.125 2023-12-04 05:28:34,641 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.200e+02 1.402e+02 1.503e+02 1.645e+02 2.328e+02, threshold=3.007e+02, percent-clipped=0.0 2023-12-04 05:28:38,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=142833.33333333334, ans=22.5 2023-12-04 05:28:39,618 INFO [train.py:1087] (1/4) Epoch 24, batch 850, loss[loss=0.169, simple_loss=0.2589, pruned_loss=0.03954, over 24738.00 frames. ], tot_loss[loss=0.1771, simple_loss=0.2658, pruned_loss=0.04413, over 4720676.92 frames. ], batch size: 63, lr: 1.00e-02, grad_scale: 32.0 2023-12-04 05:28:42,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=142900.0, ans=0.0 2023-12-04 05:28:42,951 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-12-04 05:28:43,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=142900.0, ans=0.125 2023-12-04 05:28:47,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142900.0, ans=0.1 2023-12-04 05:28:53,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=142966.66666666666, ans=0.125 2023-12-04 05:29:08,386 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143100.0, ans=0.125 2023-12-04 05:29:08,674 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-12-04 05:29:21,566 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143166.66666666666, ans=0.125 2023-12-04 05:29:39,884 INFO [train.py:1087] (1/4) Epoch 25, batch 0, loss[loss=0.1638, simple_loss=0.2583, pruned_loss=0.03464, over 24784.00 frames. ], tot_loss[loss=0.1638, simple_loss=0.2583, pruned_loss=0.03464, over 24784.00 frames. ], batch size: 71, lr: 9.81e-03, grad_scale: 32.0 2023-12-04 05:29:39,885 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 05:29:52,067 INFO [train.py:1119] (1/4) Epoch 25, validation: loss=0.1569, simple_loss=0.258, pruned_loss=0.02794, over 944034.00 frames. 2023-12-04 05:29:52,068 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 05:30:11,195 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-12-04 05:30:26,595 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.23 vs. limit=15.0 2023-12-04 05:30:30,664 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-12-04 05:30:35,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=143466.66666666666, ans=0.125 2023-12-04 05:30:40,432 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=143466.66666666666, ans=0.0 2023-12-04 05:30:47,249 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.087e+02 1.371e+02 1.534e+02 1.745e+02 2.652e+02, threshold=3.068e+02, percent-clipped=0.0 2023-12-04 05:30:47,276 INFO [train.py:1087] (1/4) Epoch 25, batch 50, loss[loss=0.1677, simple_loss=0.2596, pruned_loss=0.03788, over 24117.00 frames. ], tot_loss[loss=0.1741, simple_loss=0.2636, pruned_loss=0.0423, over 1077976.30 frames. ], batch size: 58, lr: 9.80e-03, grad_scale: 32.0 2023-12-04 05:30:53,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=143533.33333333334, ans=0.125 2023-12-04 05:31:05,357 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=143600.0, ans=0.2 2023-12-04 05:31:19,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.47 vs. limit=10.0 2023-12-04 05:31:29,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=143733.33333333334, ans=0.0 2023-12-04 05:31:41,488 INFO [train.py:1087] (1/4) Epoch 25, batch 100, loss[loss=0.1684, simple_loss=0.2578, pruned_loss=0.03957, over 24713.00 frames. ], tot_loss[loss=0.1775, simple_loss=0.2662, pruned_loss=0.04438, over 1885012.98 frames. ], batch size: 67, lr: 9.79e-03, grad_scale: 32.0 2023-12-04 05:31:50,413 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=143866.66666666666, ans=0.0 2023-12-04 05:31:51,677 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.97 vs. limit=15.0 2023-12-04 05:32:13,510 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-12-04 05:32:15,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=144066.66666666666, ans=0.02 2023-12-04 05:32:34,308 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.50 vs. limit=22.5 2023-12-04 05:32:37,714 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.373e+02 1.509e+02 1.642e+02 2.631e+02, threshold=3.018e+02, percent-clipped=0.0 2023-12-04 05:32:37,740 INFO [train.py:1087] (1/4) Epoch 25, batch 150, loss[loss=0.1602, simple_loss=0.2522, pruned_loss=0.03407, over 24760.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2645, pruned_loss=0.04329, over 2543734.74 frames. ], batch size: 66, lr: 9.78e-03, grad_scale: 32.0 2023-12-04 05:33:00,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=144333.33333333334, ans=0.125 2023-12-04 05:33:04,500 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=144333.33333333334, ans=0.2 2023-12-04 05:33:11,081 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.27 vs. limit=15.0 2023-12-04 05:33:33,286 INFO [train.py:1087] (1/4) Epoch 25, batch 200, loss[loss=0.1791, simple_loss=0.2644, pruned_loss=0.04688, over 23968.00 frames. ], tot_loss[loss=0.1763, simple_loss=0.265, pruned_loss=0.04379, over 3042192.32 frames. ], batch size: 87, lr: 9.77e-03, grad_scale: 16.0 2023-12-04 05:33:58,165 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=16.40 vs. limit=15.0 2023-12-04 05:33:58,315 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-12-04 05:33:59,978 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144666.66666666666, ans=0.1 2023-12-04 05:34:08,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=144733.33333333334, ans=0.1 2023-12-04 05:34:15,066 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=144733.33333333334, ans=0.125 2023-12-04 05:34:17,211 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=144800.0, ans=0.0 2023-12-04 05:34:28,531 INFO [train.py:1087] (1/4) Epoch 25, batch 250, loss[loss=0.1838, simple_loss=0.2655, pruned_loss=0.05101, over 24470.00 frames. ], tot_loss[loss=0.1767, simple_loss=0.2655, pruned_loss=0.04399, over 3440665.50 frames. ], batch size: 75, lr: 9.76e-03, grad_scale: 16.0 2023-12-04 05:34:29,517 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.419e+02 1.558e+02 1.718e+02 2.856e+02, threshold=3.117e+02, percent-clipped=0.0 2023-12-04 05:34:43,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=144933.33333333334, ans=0.0 2023-12-04 05:34:43,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=144933.33333333334, ans=0.0 2023-12-04 05:34:50,692 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-12-04 05:34:52,658 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:34:52,685 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=145000.0, ans=0.125 2023-12-04 05:34:58,384 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=145000.0, ans=0.125 2023-12-04 05:35:05,062 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.11 vs. limit=22.5 2023-12-04 05:35:18,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=145133.33333333334, ans=0.0 2023-12-04 05:35:19,781 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145133.33333333334, ans=0.1 2023-12-04 05:35:23,216 INFO [train.py:1087] (1/4) Epoch 25, batch 300, loss[loss=0.1795, simple_loss=0.2667, pruned_loss=0.0461, over 24451.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.2649, pruned_loss=0.0434, over 3748821.63 frames. ], batch size: 77, lr: 9.75e-03, grad_scale: 16.0 2023-12-04 05:35:37,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=145266.66666666666, ans=0.2 2023-12-04 05:35:41,092 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.12 vs. limit=22.5 2023-12-04 05:35:48,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=145333.33333333334, ans=0.0 2023-12-04 05:36:07,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=145466.66666666666, ans=0.0 2023-12-04 05:36:09,740 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=145466.66666666666, ans=0.125 2023-12-04 05:36:17,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=145533.33333333334, ans=0.0 2023-12-04 05:36:18,584 INFO [train.py:1087] (1/4) Epoch 25, batch 350, loss[loss=0.1892, simple_loss=0.2774, pruned_loss=0.05046, over 24486.00 frames. ], tot_loss[loss=0.1759, simple_loss=0.265, pruned_loss=0.0434, over 3982349.98 frames. ], batch size: 77, lr: 9.74e-03, grad_scale: 16.0 2023-12-04 05:36:19,593 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.169e+02 1.342e+02 1.479e+02 1.657e+02 2.661e+02, threshold=2.958e+02, percent-clipped=0.0 2023-12-04 05:36:31,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145600.0, ans=0.1 2023-12-04 05:36:32,721 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=145600.0, ans=0.125 2023-12-04 05:36:46,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=145666.66666666666, ans=0.04949747468305833 2023-12-04 05:36:58,740 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=145733.33333333334, ans=0.07 2023-12-04 05:37:09,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=145800.0, ans=10.0 2023-12-04 05:37:11,752 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-12-04 05:37:14,130 INFO [train.py:1087] (1/4) Epoch 25, batch 400, loss[loss=0.1631, simple_loss=0.2579, pruned_loss=0.03418, over 24712.00 frames. ], tot_loss[loss=0.1754, simple_loss=0.2648, pruned_loss=0.04303, over 4169520.73 frames. ], batch size: 69, lr: 9.73e-03, grad_scale: 32.0 2023-12-04 05:37:18,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145866.66666666666, ans=0.1 2023-12-04 05:37:44,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146000.0, ans=0.1 2023-12-04 05:37:48,620 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.95 vs. limit=22.5 2023-12-04 05:37:49,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=146066.66666666666, ans=0.0 2023-12-04 05:37:52,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.max_abs, batch_count=146066.66666666666, ans=10.0 2023-12-04 05:37:53,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=146066.66666666666, ans=0.125 2023-12-04 05:37:57,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=146133.33333333334, ans=0.0 2023-12-04 05:38:02,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=146133.33333333334, ans=0.0 2023-12-04 05:38:05,572 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=146133.33333333334, ans=0.0 2023-12-04 05:38:09,562 INFO [train.py:1087] (1/4) Epoch 25, batch 450, loss[loss=0.1754, simple_loss=0.2623, pruned_loss=0.04428, over 24796.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2647, pruned_loss=0.04312, over 4314545.46 frames. ], batch size: 73, lr: 9.72e-03, grad_scale: 32.0 2023-12-04 05:38:10,565 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.420e+02 1.535e+02 1.721e+02 2.483e+02, threshold=3.070e+02, percent-clipped=0.0 2023-12-04 05:38:16,526 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-12-04 05:38:21,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=146266.66666666666, ans=0.0 2023-12-04 05:38:28,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=146266.66666666666, ans=0.0 2023-12-04 05:38:39,410 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-12-04 05:38:57,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=146466.66666666666, ans=0.2 2023-12-04 05:39:02,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=146466.66666666666, ans=0.0 2023-12-04 05:39:05,049 INFO [train.py:1087] (1/4) Epoch 25, batch 500, loss[loss=0.1713, simple_loss=0.2629, pruned_loss=0.03986, over 24610.00 frames. ], tot_loss[loss=0.1755, simple_loss=0.2648, pruned_loss=0.0431, over 4429490.95 frames. ], batch size: 68, lr: 9.71e-03, grad_scale: 32.0 2023-12-04 05:39:43,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=146733.33333333334, ans=0.0 2023-12-04 05:39:47,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=146800.0, ans=0.015 2023-12-04 05:39:59,312 INFO [train.py:1087] (1/4) Epoch 25, batch 550, loss[loss=0.1706, simple_loss=0.2632, pruned_loss=0.03902, over 24751.00 frames. ], tot_loss[loss=0.1756, simple_loss=0.2649, pruned_loss=0.04317, over 4504722.09 frames. ], batch size: 61, lr: 9.70e-03, grad_scale: 16.0 2023-12-04 05:40:01,393 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.385e+02 1.522e+02 1.751e+02 2.434e+02, threshold=3.044e+02, percent-clipped=0.0 2023-12-04 05:40:09,513 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.19 vs. limit=15.0 2023-12-04 05:40:24,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=147000.0, ans=0.125 2023-12-04 05:40:40,490 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=147066.66666666666, ans=0.125 2023-12-04 05:40:53,786 INFO [train.py:1087] (1/4) Epoch 25, batch 600, loss[loss=0.1606, simple_loss=0.2498, pruned_loss=0.03568, over 24571.00 frames. ], tot_loss[loss=0.1753, simple_loss=0.2646, pruned_loss=0.04295, over 4574393.25 frames. ], batch size: 64, lr: 9.69e-03, grad_scale: 16.0 2023-12-04 05:41:00,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147200.0, ans=0.0 2023-12-04 05:41:23,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=147333.33333333334, ans=0.0 2023-12-04 05:41:31,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=147400.0, ans=0.125 2023-12-04 05:41:34,368 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=147400.0, ans=0.125 2023-12-04 05:41:41,817 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=147466.66666666666, ans=0.0 2023-12-04 05:41:44,290 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-12-04 05:41:49,105 INFO [train.py:1087] (1/4) Epoch 25, batch 650, loss[loss=0.1742, simple_loss=0.2623, pruned_loss=0.04306, over 24760.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2644, pruned_loss=0.04276, over 4632065.86 frames. ], batch size: 65, lr: 9.68e-03, grad_scale: 16.0 2023-12-04 05:41:51,296 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.379e+02 1.470e+02 1.642e+02 3.747e+02, threshold=2.939e+02, percent-clipped=1.0 2023-12-04 05:42:13,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147666.66666666666, ans=0.0 2023-12-04 05:42:19,303 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=147666.66666666666, ans=0.125 2023-12-04 05:42:24,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147733.33333333334, ans=0.0 2023-12-04 05:42:26,782 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147733.33333333334, ans=0.0 2023-12-04 05:42:32,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=147800.0, ans=0.125 2023-12-04 05:42:32,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=147800.0, ans=0.0 2023-12-04 05:42:35,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=147800.0, ans=0.125 2023-12-04 05:42:44,038 INFO [train.py:1087] (1/4) Epoch 25, batch 700, loss[loss=0.166, simple_loss=0.2568, pruned_loss=0.03764, over 24783.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2643, pruned_loss=0.04274, over 4668700.37 frames. ], batch size: 71, lr: 9.67e-03, grad_scale: 16.0 2023-12-04 05:43:03,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=147933.33333333334, ans=0.0 2023-12-04 05:43:05,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148000.0, ans=0.1 2023-12-04 05:43:07,090 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:43:25,788 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-12-04 05:43:32,908 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.56 vs. limit=15.0 2023-12-04 05:43:39,041 INFO [train.py:1087] (1/4) Epoch 25, batch 750, loss[loss=0.1708, simple_loss=0.2603, pruned_loss=0.04062, over 24687.00 frames. ], tot_loss[loss=0.1749, simple_loss=0.2642, pruned_loss=0.04281, over 4692555.22 frames. ], batch size: 74, lr: 9.67e-03, grad_scale: 16.0 2023-12-04 05:43:41,170 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.201e+02 1.392e+02 1.520e+02 1.675e+02 2.210e+02, threshold=3.040e+02, percent-clipped=0.0 2023-12-04 05:43:41,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=148200.0, ans=0.125 2023-12-04 05:43:50,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148266.66666666666, ans=0.1 2023-12-04 05:43:50,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=148266.66666666666, ans=0.2 2023-12-04 05:43:51,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=148266.66666666666, ans=0.0 2023-12-04 05:44:14,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=148400.0, ans=0.2 2023-12-04 05:44:33,502 INFO [train.py:1087] (1/4) Epoch 25, batch 800, loss[loss=0.1782, simple_loss=0.2668, pruned_loss=0.04479, over 24546.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2636, pruned_loss=0.04224, over 4735095.76 frames. ], batch size: 62, lr: 9.66e-03, grad_scale: 32.0 2023-12-04 05:44:37,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=148533.33333333334, ans=0.125 2023-12-04 05:44:43,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=148600.0, ans=0.2 2023-12-04 05:44:48,766 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=148600.0, ans=0.125 2023-12-04 05:45:25,060 INFO [train.py:1087] (1/4) Epoch 25, batch 850, loss[loss=0.1767, simple_loss=0.2649, pruned_loss=0.04423, over 24289.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2639, pruned_loss=0.04237, over 4755188.55 frames. ], batch size: 79, lr: 9.65e-03, grad_scale: 32.0 2023-12-04 05:45:25,623 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.83 vs. limit=15.0 2023-12-04 05:45:26,991 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.224e+02 1.434e+02 1.522e+02 1.667e+02 2.579e+02, threshold=3.044e+02, percent-clipped=0.0 2023-12-04 05:45:28,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=148866.66666666666, ans=0.125 2023-12-04 05:45:30,240 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:45:42,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=148933.33333333334, ans=0.125 2023-12-04 05:45:43,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148933.33333333334, ans=0.1 2023-12-04 05:45:47,561 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-12-04 05:45:53,025 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=149000.0, ans=0.0 2023-12-04 05:46:25,826 INFO [train.py:1087] (1/4) Epoch 26, batch 0, loss[loss=0.1734, simple_loss=0.2643, pruned_loss=0.04127, over 24549.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2643, pruned_loss=0.04127, over 24549.00 frames. ], batch size: 63, lr: 9.45e-03, grad_scale: 32.0 2023-12-04 05:46:25,827 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 05:46:37,859 INFO [train.py:1119] (1/4) Epoch 26, validation: loss=0.1564, simple_loss=0.2574, pruned_loss=0.02768, over 944034.00 frames. 2023-12-04 05:46:37,860 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 05:46:56,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=149233.33333333334, ans=0.0 2023-12-04 05:47:01,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=149300.0, ans=0.1 2023-12-04 05:47:11,222 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-12-04 05:47:34,148 INFO [train.py:1087] (1/4) Epoch 26, batch 50, loss[loss=0.173, simple_loss=0.2616, pruned_loss=0.04221, over 24559.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2641, pruned_loss=0.04198, over 1093853.82 frames. ], batch size: 64, lr: 9.44e-03, grad_scale: 32.0 2023-12-04 05:47:37,567 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=149500.0, ans=0.0 2023-12-04 05:47:41,534 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.353e+02 1.478e+02 1.657e+02 3.231e+02, threshold=2.956e+02, percent-clipped=1.0 2023-12-04 05:47:42,832 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=149500.0, ans=0.125 2023-12-04 05:48:08,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=149700.0, ans=0.2 2023-12-04 05:48:12,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=149700.0, ans=0.2 2023-12-04 05:48:24,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=149766.66666666666, ans=0.125 2023-12-04 05:48:28,393 INFO [train.py:1087] (1/4) Epoch 26, batch 100, loss[loss=0.1611, simple_loss=0.2529, pruned_loss=0.03461, over 24551.00 frames. ], tot_loss[loss=0.1743, simple_loss=0.2645, pruned_loss=0.04208, over 1914767.27 frames. ], batch size: 62, lr: 9.43e-03, grad_scale: 32.0 2023-12-04 05:48:28,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=149833.33333333334, ans=0.0 2023-12-04 05:48:31,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=149833.33333333334, ans=0.0 2023-12-04 05:48:42,244 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.13 vs. limit=8.0 2023-12-04 05:48:50,093 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=149966.66666666666, ans=0.2 2023-12-04 05:49:23,249 INFO [train.py:1087] (1/4) Epoch 26, batch 150, loss[loss=0.177, simple_loss=0.2682, pruned_loss=0.04294, over 24326.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2642, pruned_loss=0.04212, over 2549976.51 frames. ], batch size: 79, lr: 9.42e-03, grad_scale: 32.0 2023-12-04 05:49:23,673 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=22.5 2023-12-04 05:49:31,109 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.354e+02 1.451e+02 1.669e+02 2.320e+02, threshold=2.903e+02, percent-clipped=0.0 2023-12-04 05:49:36,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=150233.33333333334, ans=0.125 2023-12-04 05:49:59,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150366.66666666666, ans=0.1 2023-12-04 05:50:03,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=150366.66666666666, ans=0.125 2023-12-04 05:50:18,526 INFO [train.py:1087] (1/4) Epoch 26, batch 200, loss[loss=0.1732, simple_loss=0.2653, pruned_loss=0.04051, over 24731.00 frames. ], tot_loss[loss=0.1745, simple_loss=0.2644, pruned_loss=0.04234, over 3048735.02 frames. ], batch size: 61, lr: 9.41e-03, grad_scale: 32.0 2023-12-04 05:50:42,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=150633.33333333334, ans=0.0 2023-12-04 05:51:14,176 INFO [train.py:1087] (1/4) Epoch 26, batch 250, loss[loss=0.1892, simple_loss=0.2761, pruned_loss=0.05117, over 23896.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2637, pruned_loss=0.04229, over 3444094.94 frames. ], batch size: 87, lr: 9.40e-03, grad_scale: 32.0 2023-12-04 05:51:14,713 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.51 vs. limit=22.5 2023-12-04 05:51:21,781 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.178e+02 1.341e+02 1.458e+02 1.607e+02 2.850e+02, threshold=2.916e+02, percent-clipped=0.0 2023-12-04 05:51:22,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=150833.33333333334, ans=0.09899494936611666 2023-12-04 05:51:52,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=151033.33333333334, ans=0.125 2023-12-04 05:51:59,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151100.0, ans=0.1 2023-12-04 05:52:09,149 INFO [train.py:1087] (1/4) Epoch 26, batch 300, loss[loss=0.1653, simple_loss=0.2618, pruned_loss=0.03435, over 24697.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.263, pruned_loss=0.04192, over 3751435.59 frames. ], batch size: 74, lr: 9.39e-03, grad_scale: 32.0 2023-12-04 05:52:10,508 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=151166.66666666666, ans=0.125 2023-12-04 05:52:12,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=151166.66666666666, ans=0.125 2023-12-04 05:52:18,015 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.12 vs. limit=15.0 2023-12-04 05:52:55,221 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:52:58,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=151433.33333333334, ans=0.125 2023-12-04 05:53:00,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151433.33333333334, ans=0.125 2023-12-04 05:53:03,531 INFO [train.py:1087] (1/4) Epoch 26, batch 350, loss[loss=0.1786, simple_loss=0.2706, pruned_loss=0.04335, over 24705.00 frames. ], tot_loss[loss=0.1746, simple_loss=0.264, pruned_loss=0.04258, over 3965149.64 frames. ], batch size: 69, lr: 9.38e-03, grad_scale: 32.0 2023-12-04 05:53:03,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=151500.0, ans=0.035 2023-12-04 05:53:10,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151500.0, ans=0.125 2023-12-04 05:53:11,695 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.373e+02 1.473e+02 1.593e+02 2.083e+02, threshold=2.946e+02, percent-clipped=0.0 2023-12-04 05:53:11,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=151500.0, ans=0.125 2023-12-04 05:53:17,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=151566.66666666666, ans=0.0 2023-12-04 05:53:29,429 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-12-04 05:53:39,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=151700.0, ans=0.2 2023-12-04 05:53:54,727 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.05 vs. limit=22.5 2023-12-04 05:53:58,350 INFO [train.py:1087] (1/4) Epoch 26, batch 400, loss[loss=0.2332, simple_loss=0.3056, pruned_loss=0.08039, over 16750.00 frames. ], tot_loss[loss=0.1747, simple_loss=0.2641, pruned_loss=0.0426, over 4144603.87 frames. ], batch size: 177, lr: 9.37e-03, grad_scale: 32.0 2023-12-04 05:54:02,232 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=151833.33333333334, ans=0.0 2023-12-04 05:54:07,413 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=151833.33333333334, ans=0.2 2023-12-04 05:54:10,591 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-12-04 05:54:11,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=151900.0, ans=0.125 2023-12-04 05:54:27,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=151966.66666666666, ans=0.125 2023-12-04 05:54:30,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=152033.33333333334, ans=0.0 2023-12-04 05:54:42,062 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152100.0, ans=0.125 2023-12-04 05:54:42,236 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.10 vs. limit=22.5 2023-12-04 05:54:49,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152100.0, ans=0.1 2023-12-04 05:54:53,828 INFO [train.py:1087] (1/4) Epoch 26, batch 450, loss[loss=0.1778, simple_loss=0.265, pruned_loss=0.04534, over 23482.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2638, pruned_loss=0.04252, over 4292137.44 frames. ], batch size: 94, lr: 9.36e-03, grad_scale: 32.0 2023-12-04 05:54:56,325 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=152166.66666666666, ans=0.125 2023-12-04 05:55:01,287 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.219e+02 1.411e+02 1.506e+02 1.676e+02 2.196e+02, threshold=3.012e+02, percent-clipped=0.0 2023-12-04 05:55:49,574 INFO [train.py:1087] (1/4) Epoch 26, batch 500, loss[loss=0.1737, simple_loss=0.2611, pruned_loss=0.04315, over 24724.00 frames. ], tot_loss[loss=0.1744, simple_loss=0.2638, pruned_loss=0.04252, over 4391548.22 frames. ], batch size: 61, lr: 9.35e-03, grad_scale: 32.0 2023-12-04 05:55:50,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=152500.0, ans=0.0 2023-12-04 05:55:54,547 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-12-04 05:56:19,275 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.63 vs. limit=15.0 2023-12-04 05:56:25,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=152700.0, ans=0.2 2023-12-04 05:56:45,238 INFO [train.py:1087] (1/4) Epoch 26, batch 550, loss[loss=0.168, simple_loss=0.2604, pruned_loss=0.03779, over 24721.00 frames. ], tot_loss[loss=0.1742, simple_loss=0.2636, pruned_loss=0.04235, over 4484265.61 frames. ], batch size: 67, lr: 9.34e-03, grad_scale: 32.0 2023-12-04 05:56:52,941 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.362e+02 1.478e+02 1.633e+02 2.091e+02, threshold=2.955e+02, percent-clipped=0.0 2023-12-04 05:56:54,604 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-12-04 05:57:07,286 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0 2023-12-04 05:57:14,265 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.32 vs. limit=22.5 2023-12-04 05:57:39,769 INFO [train.py:1087] (1/4) Epoch 26, batch 600, loss[loss=0.1782, simple_loss=0.2625, pruned_loss=0.04698, over 24484.00 frames. ], tot_loss[loss=0.174, simple_loss=0.2635, pruned_loss=0.0423, over 4546793.93 frames. ], batch size: 77, lr: 9.33e-03, grad_scale: 32.0 2023-12-04 05:57:50,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=153233.33333333334, ans=0.0 2023-12-04 05:58:08,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=153300.0, ans=0.125 2023-12-04 05:58:09,012 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.51 vs. limit=15.0 2023-12-04 05:58:29,161 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153433.33333333334, ans=0.1 2023-12-04 05:58:35,284 INFO [train.py:1087] (1/4) Epoch 26, batch 650, loss[loss=0.1672, simple_loss=0.258, pruned_loss=0.03817, over 24701.00 frames. ], tot_loss[loss=0.1732, simple_loss=0.2628, pruned_loss=0.04184, over 4607464.88 frames. ], batch size: 69, lr: 9.32e-03, grad_scale: 32.0 2023-12-04 05:58:42,603 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.332e+02 1.436e+02 1.583e+02 2.553e+02, threshold=2.873e+02, percent-clipped=0.0 2023-12-04 05:58:58,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=153633.33333333334, ans=0.025 2023-12-04 05:59:00,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=153633.33333333334, ans=0.2 2023-12-04 05:59:09,619 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 05:59:14,290 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-12-04 05:59:18,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153766.66666666666, ans=0.0 2023-12-04 05:59:30,804 INFO [train.py:1087] (1/4) Epoch 26, batch 700, loss[loss=0.1687, simple_loss=0.2575, pruned_loss=0.04001, over 24565.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.263, pruned_loss=0.04207, over 4652594.68 frames. ], batch size: 62, lr: 9.32e-03, grad_scale: 32.0 2023-12-04 05:59:38,020 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.57 vs. limit=5.0 2023-12-04 05:59:41,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=153900.0, ans=0.2 2023-12-04 05:59:45,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=153900.0, ans=0.0 2023-12-04 05:59:51,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=153966.66666666666, ans=0.0 2023-12-04 06:00:22,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-12-04 06:00:25,924 INFO [train.py:1087] (1/4) Epoch 26, batch 750, loss[loss=0.171, simple_loss=0.2655, pruned_loss=0.03823, over 24553.00 frames. ], tot_loss[loss=0.1733, simple_loss=0.2628, pruned_loss=0.04185, over 4699029.43 frames. ], batch size: 63, lr: 9.31e-03, grad_scale: 32.0 2023-12-04 06:00:33,759 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.063e+02 1.380e+02 1.538e+02 1.731e+02 2.192e+02, threshold=3.076e+02, percent-clipped=0.0 2023-12-04 06:00:35,470 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.91 vs. limit=6.0 2023-12-04 06:00:36,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=154233.33333333334, ans=0.0 2023-12-04 06:00:37,237 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=154233.33333333334, ans=0.125 2023-12-04 06:00:49,290 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-12-04 06:00:50,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=154300.0, ans=0.0 2023-12-04 06:00:59,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=154366.66666666666, ans=0.125 2023-12-04 06:01:06,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-12-04 06:01:12,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=154433.33333333334, ans=0.125 2023-12-04 06:01:20,924 INFO [train.py:1087] (1/4) Epoch 26, batch 800, loss[loss=0.172, simple_loss=0.2621, pruned_loss=0.04096, over 24615.00 frames. ], tot_loss[loss=0.1736, simple_loss=0.2631, pruned_loss=0.04205, over 4712017.42 frames. ], batch size: 68, lr: 9.30e-03, grad_scale: 32.0 2023-12-04 06:01:23,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=154500.0, ans=0.125 2023-12-04 06:01:23,631 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=154500.0, ans=0.0 2023-12-04 06:01:25,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=154500.0, ans=0.125 2023-12-04 06:01:40,385 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=154566.66666666666, ans=0.125 2023-12-04 06:01:59,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=154700.0, ans=0.125 2023-12-04 06:02:12,823 INFO [train.py:1087] (1/4) Epoch 26, batch 850, loss[loss=0.1695, simple_loss=0.2625, pruned_loss=0.03827, over 24749.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2625, pruned_loss=0.04162, over 4748356.07 frames. ], batch size: 70, lr: 9.29e-03, grad_scale: 32.0 2023-12-04 06:02:19,852 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.214e+02 1.388e+02 1.511e+02 1.708e+02 2.505e+02, threshold=3.022e+02, percent-clipped=0.0 2023-12-04 06:02:22,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.23 vs. limit=15.0 2023-12-04 06:02:27,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=154900.0, ans=0.125 2023-12-04 06:02:36,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=154966.66666666666, ans=0.0 2023-12-04 06:02:42,593 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.67 vs. limit=15.0 2023-12-04 06:02:48,416 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.648e-03 2023-12-04 06:02:49,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=155033.33333333334, ans=0.0 2023-12-04 06:03:14,473 INFO [train.py:1087] (1/4) Epoch 27, batch 0, loss[loss=0.1779, simple_loss=0.2619, pruned_loss=0.04702, over 22206.00 frames. ], tot_loss[loss=0.1779, simple_loss=0.2619, pruned_loss=0.04702, over 22206.00 frames. ], batch size: 53, lr: 9.10e-03, grad_scale: 32.0 2023-12-04 06:03:14,474 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 06:03:26,615 INFO [train.py:1119] (1/4) Epoch 27, validation: loss=0.1567, simple_loss=0.2572, pruned_loss=0.02815, over 944034.00 frames. 2023-12-04 06:03:26,615 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 06:03:29,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=155133.33333333334, ans=0.125 2023-12-04 06:03:32,280 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.51 vs. limit=22.5 2023-12-04 06:03:57,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155266.66666666666, ans=0.0 2023-12-04 06:04:02,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=155333.33333333334, ans=0.125 2023-12-04 06:04:05,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=155333.33333333334, ans=0.125 2023-12-04 06:04:08,170 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=155333.33333333334, ans=0.125 2023-12-04 06:04:11,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=155400.0, ans=0.2 2023-12-04 06:04:21,333 INFO [train.py:1087] (1/4) Epoch 27, batch 50, loss[loss=0.1694, simple_loss=0.2632, pruned_loss=0.03787, over 24708.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2626, pruned_loss=0.04159, over 1082753.21 frames. ], batch size: 67, lr: 9.09e-03, grad_scale: 32.0 2023-12-04 06:04:25,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155466.66666666666, ans=0.1 2023-12-04 06:04:28,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=155466.66666666666, ans=0.125 2023-12-04 06:04:30,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=155466.66666666666, ans=0.125 2023-12-04 06:04:31,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=155533.33333333334, ans=0.2 2023-12-04 06:04:34,948 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.419e+02 1.535e+02 1.765e+02 3.473e+02, threshold=3.070e+02, percent-clipped=1.0 2023-12-04 06:04:36,512 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.04 vs. limit=10.0 2023-12-04 06:04:41,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=155600.0, ans=0.0 2023-12-04 06:04:54,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155666.66666666666, ans=0.0 2023-12-04 06:05:05,964 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:05:15,879 INFO [train.py:1087] (1/4) Epoch 27, batch 100, loss[loss=0.1781, simple_loss=0.2654, pruned_loss=0.04546, over 24512.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2628, pruned_loss=0.04151, over 1913058.45 frames. ], batch size: 75, lr: 9.09e-03, grad_scale: 32.0 2023-12-04 06:05:25,989 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155866.66666666666, ans=0.125 2023-12-04 06:05:34,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=155866.66666666666, ans=15.0 2023-12-04 06:05:38,898 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=155933.33333333334, ans=0.125 2023-12-04 06:05:39,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155933.33333333334, ans=0.1 2023-12-04 06:05:51,109 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-12-04 06:06:10,395 INFO [train.py:1087] (1/4) Epoch 27, batch 150, loss[loss=0.1555, simple_loss=0.2481, pruned_loss=0.03148, over 24556.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.262, pruned_loss=0.04109, over 2570126.09 frames. ], batch size: 66, lr: 9.08e-03, grad_scale: 32.0 2023-12-04 06:06:12,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=156133.33333333334, ans=0.125 2023-12-04 06:06:25,102 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.219e+02 1.353e+02 1.457e+02 1.592e+02 2.359e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 06:06:38,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-12-04 06:06:39,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=156266.66666666666, ans=0.125 2023-12-04 06:06:44,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=156333.33333333334, ans=0.04949747468305833 2023-12-04 06:07:00,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=156400.0, ans=10.0 2023-12-04 06:07:05,161 INFO [train.py:1087] (1/4) Epoch 27, batch 200, loss[loss=0.1688, simple_loss=0.2617, pruned_loss=0.0379, over 24751.00 frames. ], tot_loss[loss=0.1734, simple_loss=0.2627, pruned_loss=0.04202, over 3050939.60 frames. ], batch size: 61, lr: 9.07e-03, grad_scale: 32.0 2023-12-04 06:07:11,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=156466.66666666666, ans=0.125 2023-12-04 06:07:35,280 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=12.0 2023-12-04 06:08:00,029 INFO [train.py:1087] (1/4) Epoch 27, batch 250, loss[loss=0.1669, simple_loss=0.2565, pruned_loss=0.03864, over 24547.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2622, pruned_loss=0.04147, over 3453855.28 frames. ], batch size: 62, lr: 9.06e-03, grad_scale: 32.0 2023-12-04 06:08:03,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=156800.0, ans=0.035 2023-12-04 06:08:13,748 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.395e+02 1.466e+02 1.676e+02 3.271e+02, threshold=2.931e+02, percent-clipped=1.0 2023-12-04 06:08:18,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156866.66666666666, ans=0.1 2023-12-04 06:08:24,283 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=156933.33333333334, ans=0.2 2023-12-04 06:08:40,374 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157000.0, ans=0.1 2023-12-04 06:08:44,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=157066.66666666666, ans=0.125 2023-12-04 06:08:52,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=157066.66666666666, ans=0.1 2023-12-04 06:08:54,217 INFO [train.py:1087] (1/4) Epoch 27, batch 300, loss[loss=0.16, simple_loss=0.2519, pruned_loss=0.03408, over 23966.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2622, pruned_loss=0.04154, over 3746931.79 frames. ], batch size: 87, lr: 9.05e-03, grad_scale: 32.0 2023-12-04 06:08:55,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=157133.33333333334, ans=0.125 2023-12-04 06:09:15,703 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.10 vs. limit=22.5 2023-12-04 06:09:24,303 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-12-04 06:09:38,947 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.64 vs. limit=15.0 2023-12-04 06:09:49,178 INFO [train.py:1087] (1/4) Epoch 27, batch 350, loss[loss=0.156, simple_loss=0.2443, pruned_loss=0.03387, over 24749.00 frames. ], tot_loss[loss=0.1723, simple_loss=0.2618, pruned_loss=0.04142, over 3994131.43 frames. ], batch size: 66, lr: 9.04e-03, grad_scale: 32.0 2023-12-04 06:09:49,335 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=157466.66666666666, ans=0.125 2023-12-04 06:09:50,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157466.66666666666, ans=0.125 2023-12-04 06:09:51,489 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=157466.66666666666, ans=0.05 2023-12-04 06:09:55,197 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=8.0 2023-12-04 06:10:02,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=157533.33333333334, ans=0.125 2023-12-04 06:10:03,549 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.341e+02 1.468e+02 1.612e+02 2.620e+02, threshold=2.936e+02, percent-clipped=0.0 2023-12-04 06:10:14,801 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157600.0, ans=0.125 2023-12-04 06:10:23,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=157666.66666666666, ans=0.125 2023-12-04 06:10:24,168 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=157666.66666666666, ans=0.125 2023-12-04 06:10:30,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=157666.66666666666, ans=0.05 2023-12-04 06:10:43,526 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=157800.0, ans=0.0 2023-12-04 06:10:44,317 INFO [train.py:1087] (1/4) Epoch 27, batch 400, loss[loss=0.1755, simple_loss=0.2635, pruned_loss=0.04377, over 24565.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2622, pruned_loss=0.0416, over 4176630.00 frames. ], batch size: 63, lr: 9.03e-03, grad_scale: 32.0 2023-12-04 06:10:50,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=157800.0, ans=10.0 2023-12-04 06:10:54,176 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=15.0 2023-12-04 06:10:57,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=157866.66666666666, ans=0.5 2023-12-04 06:11:16,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-12-04 06:11:18,060 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.20 vs. limit=15.0 2023-12-04 06:11:39,242 INFO [train.py:1087] (1/4) Epoch 27, batch 450, loss[loss=0.1668, simple_loss=0.2547, pruned_loss=0.03949, over 24732.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2621, pruned_loss=0.0415, over 4327387.90 frames. ], batch size: 61, lr: 9.02e-03, grad_scale: 32.0 2023-12-04 06:11:53,338 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.360e+02 1.480e+02 1.620e+02 2.583e+02, threshold=2.959e+02, percent-clipped=0.0 2023-12-04 06:11:57,716 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=158200.0, ans=0.125 2023-12-04 06:12:01,402 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-12-04 06:12:04,200 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-12-04 06:12:04,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=158266.66666666666, ans=0.125 2023-12-04 06:12:15,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=158333.33333333334, ans=0.0 2023-12-04 06:12:23,705 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-12-04 06:12:27,971 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.29 vs. limit=15.0 2023-12-04 06:12:31,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=158400.0, ans=0.125 2023-12-04 06:12:33,470 INFO [train.py:1087] (1/4) Epoch 27, batch 500, loss[loss=0.1693, simple_loss=0.2605, pruned_loss=0.03904, over 24716.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2621, pruned_loss=0.04146, over 4445650.38 frames. ], batch size: 69, lr: 9.02e-03, grad_scale: 32.0 2023-12-04 06:12:35,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.93 vs. limit=15.0 2023-12-04 06:12:39,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=158466.66666666666, ans=0.1 2023-12-04 06:12:45,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158533.33333333334, ans=0.1 2023-12-04 06:12:53,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=158533.33333333334, ans=0.125 2023-12-04 06:13:06,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=158666.66666666666, ans=0.125 2023-12-04 06:13:12,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158666.66666666666, ans=0.1 2023-12-04 06:13:12,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=158666.66666666666, ans=0.125 2023-12-04 06:13:16,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=158733.33333333334, ans=0.05 2023-12-04 06:13:22,083 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=158733.33333333334, ans=0.09899494936611666 2023-12-04 06:13:28,572 INFO [train.py:1087] (1/4) Epoch 27, batch 550, loss[loss=0.174, simple_loss=0.2637, pruned_loss=0.04219, over 24708.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2624, pruned_loss=0.04171, over 4509361.38 frames. ], batch size: 69, lr: 9.01e-03, grad_scale: 32.0 2023-12-04 06:13:33,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=158800.0, ans=0.125 2023-12-04 06:13:42,756 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.388e+02 1.537e+02 1.705e+02 2.928e+02, threshold=3.074e+02, percent-clipped=0.0 2023-12-04 06:13:49,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=158933.33333333334, ans=0.2 2023-12-04 06:13:57,556 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=15.0 2023-12-04 06:14:10,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=159000.0, ans=0.125 2023-12-04 06:14:10,426 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159000.0, ans=0.1 2023-12-04 06:14:22,641 INFO [train.py:1087] (1/4) Epoch 27, batch 600, loss[loss=0.1673, simple_loss=0.258, pruned_loss=0.03827, over 24761.00 frames. ], tot_loss[loss=0.1725, simple_loss=0.2622, pruned_loss=0.04139, over 4574681.52 frames. ], batch size: 65, lr: 9.00e-03, grad_scale: 32.0 2023-12-04 06:14:23,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=159133.33333333334, ans=0.0 2023-12-04 06:14:35,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=159200.0, ans=0.0 2023-12-04 06:14:44,740 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=159266.66666666666, ans=0.0 2023-12-04 06:14:50,281 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.01 vs. limit=15.0 2023-12-04 06:15:01,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=159333.33333333334, ans=0.125 2023-12-04 06:15:03,241 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=159333.33333333334, ans=0.1 2023-12-04 06:15:17,601 INFO [train.py:1087] (1/4) Epoch 27, batch 650, loss[loss=0.1669, simple_loss=0.2569, pruned_loss=0.03849, over 24573.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2624, pruned_loss=0.04146, over 4616063.13 frames. ], batch size: 64, lr: 8.99e-03, grad_scale: 16.0 2023-12-04 06:15:18,082 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-12-04 06:15:32,776 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.395e+02 1.557e+02 1.777e+02 2.410e+02, threshold=3.114e+02, percent-clipped=0.0 2023-12-04 06:15:36,581 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.28 vs. limit=15.0 2023-12-04 06:15:40,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=159600.0, ans=0.125 2023-12-04 06:15:59,882 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=159733.33333333334, ans=0.125 2023-12-04 06:16:04,362 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:16:09,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=159733.33333333334, ans=0.125 2023-12-04 06:16:12,610 INFO [train.py:1087] (1/4) Epoch 27, batch 700, loss[loss=0.1758, simple_loss=0.2628, pruned_loss=0.04437, over 24579.00 frames. ], tot_loss[loss=0.1726, simple_loss=0.2624, pruned_loss=0.04144, over 4647796.28 frames. ], batch size: 64, lr: 8.98e-03, grad_scale: 16.0 2023-12-04 06:16:23,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-12-04 06:16:32,972 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=159933.33333333334, ans=0.0 2023-12-04 06:17:09,797 INFO [train.py:1087] (1/4) Epoch 27, batch 750, loss[loss=0.1841, simple_loss=0.2731, pruned_loss=0.0476, over 23469.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2626, pruned_loss=0.04169, over 4660077.25 frames. ], batch size: 94, lr: 8.97e-03, grad_scale: 16.0 2023-12-04 06:17:25,234 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.212e+02 1.358e+02 1.459e+02 1.607e+02 2.648e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 06:17:58,460 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=160400.0, ans=0.0 2023-12-04 06:18:04,563 INFO [train.py:1087] (1/4) Epoch 27, batch 800, loss[loss=0.1626, simple_loss=0.2523, pruned_loss=0.03648, over 24593.00 frames. ], tot_loss[loss=0.1731, simple_loss=0.2625, pruned_loss=0.04185, over 4653695.47 frames. ], batch size: 68, lr: 8.96e-03, grad_scale: 32.0 2023-12-04 06:18:12,895 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-12-04 06:18:23,489 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.86 vs. limit=15.0 2023-12-04 06:18:32,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160600.0, ans=0.1 2023-12-04 06:18:49,718 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2023-12-04 06:18:56,080 INFO [train.py:1087] (1/4) Epoch 27, batch 850, loss[loss=0.1658, simple_loss=0.2537, pruned_loss=0.039, over 24757.00 frames. ], tot_loss[loss=0.1729, simple_loss=0.2623, pruned_loss=0.04173, over 4684765.72 frames. ], batch size: 65, lr: 8.96e-03, grad_scale: 32.0 2023-12-04 06:18:56,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=160800.0, ans=10.0 2023-12-04 06:19:08,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=160866.66666666666, ans=0.125 2023-12-04 06:19:09,988 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.345e+02 1.467e+02 1.580e+02 2.866e+02, threshold=2.933e+02, percent-clipped=0.0 2023-12-04 06:19:18,186 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=160933.33333333334, ans=0.035 2023-12-04 06:19:54,851 INFO [train.py:1087] (1/4) Epoch 28, batch 0, loss[loss=0.1596, simple_loss=0.2548, pruned_loss=0.03215, over 24602.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2548, pruned_loss=0.03215, over 24602.00 frames. ], batch size: 68, lr: 8.78e-03, grad_scale: 32.0 2023-12-04 06:19:54,851 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 06:20:07,427 INFO [train.py:1119] (1/4) Epoch 28, validation: loss=0.1564, simple_loss=0.2567, pruned_loss=0.02802, over 944034.00 frames. 2023-12-04 06:20:07,428 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 06:20:34,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=161233.33333333334, ans=0.125 2023-12-04 06:20:37,659 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.52 vs. limit=15.0 2023-12-04 06:21:02,993 INFO [train.py:1087] (1/4) Epoch 28, batch 50, loss[loss=0.1733, simple_loss=0.2666, pruned_loss=0.03998, over 24291.00 frames. ], tot_loss[loss=0.1728, simple_loss=0.263, pruned_loss=0.04127, over 1079224.73 frames. ], batch size: 79, lr: 8.78e-03, grad_scale: 32.0 2023-12-04 06:21:06,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=161433.33333333334, ans=0.0 2023-12-04 06:21:10,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=161433.33333333334, ans=0.125 2023-12-04 06:21:23,521 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.413e+02 1.493e+02 1.807e+02 2.971e+02, threshold=2.986e+02, percent-clipped=1.0 2023-12-04 06:21:26,704 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-12-04 06:21:57,352 INFO [train.py:1087] (1/4) Epoch 28, batch 100, loss[loss=0.1591, simple_loss=0.2517, pruned_loss=0.03327, over 24763.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2616, pruned_loss=0.04031, over 1921188.94 frames. ], batch size: 64, lr: 8.77e-03, grad_scale: 32.0 2023-12-04 06:21:59,062 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=161766.66666666666, ans=0.1 2023-12-04 06:22:00,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=161766.66666666666, ans=0.125 2023-12-04 06:22:00,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=161766.66666666666, ans=0.125 2023-12-04 06:22:01,917 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.32 vs. limit=15.0 2023-12-04 06:22:16,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-12-04 06:22:38,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=161966.66666666666, ans=0.2 2023-12-04 06:22:43,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=162033.33333333334, ans=0.125 2023-12-04 06:22:52,756 INFO [train.py:1087] (1/4) Epoch 28, batch 150, loss[loss=0.1571, simple_loss=0.2534, pruned_loss=0.03037, over 24763.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2629, pruned_loss=0.04129, over 2546027.22 frames. ], batch size: 65, lr: 8.76e-03, grad_scale: 32.0 2023-12-04 06:22:53,550 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=12.0 2023-12-04 06:22:55,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=162100.0, ans=0.125 2023-12-04 06:23:13,264 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162166.66666666666, ans=0.0 2023-12-04 06:23:14,104 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.314e+02 1.419e+02 1.618e+02 2.443e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 06:23:23,324 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.59 vs. limit=22.5 2023-12-04 06:23:31,229 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-12-04 06:23:35,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=162300.0, ans=0.125 2023-12-04 06:23:48,642 INFO [train.py:1087] (1/4) Epoch 28, batch 200, loss[loss=0.1701, simple_loss=0.2585, pruned_loss=0.04087, over 24343.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2619, pruned_loss=0.0408, over 3052245.34 frames. ], batch size: 79, lr: 8.75e-03, grad_scale: 32.0 2023-12-04 06:23:50,063 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=162433.33333333334, ans=0.125 2023-12-04 06:23:52,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=162433.33333333334, ans=0.0 2023-12-04 06:24:05,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=162500.0, ans=0.125 2023-12-04 06:24:28,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162633.33333333334, ans=0.1 2023-12-04 06:24:28,454 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=22.5 2023-12-04 06:24:45,043 INFO [train.py:1087] (1/4) Epoch 28, batch 250, loss[loss=0.1692, simple_loss=0.26, pruned_loss=0.03921, over 24730.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2614, pruned_loss=0.04039, over 3455282.15 frames. ], batch size: 67, lr: 8.74e-03, grad_scale: 32.0 2023-12-04 06:24:56,823 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=162833.33333333334, ans=0.125 2023-12-04 06:24:58,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=162833.33333333334, ans=0.0 2023-12-04 06:25:05,487 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.354e+02 1.477e+02 1.650e+02 2.112e+02, threshold=2.954e+02, percent-clipped=0.0 2023-12-04 06:25:15,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=162900.0, ans=0.0 2023-12-04 06:25:34,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=163033.33333333334, ans=0.0 2023-12-04 06:25:35,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=163033.33333333334, ans=0.125 2023-12-04 06:25:40,325 INFO [train.py:1087] (1/4) Epoch 28, batch 300, loss[loss=0.1802, simple_loss=0.2603, pruned_loss=0.05007, over 24469.00 frames. ], tot_loss[loss=0.1719, simple_loss=0.2618, pruned_loss=0.04101, over 3739357.36 frames. ], batch size: 75, lr: 8.73e-03, grad_scale: 32.0 2023-12-04 06:25:44,746 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.83 vs. limit=22.5 2023-12-04 06:25:47,648 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=163100.0, ans=0.1 2023-12-04 06:25:57,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=163166.66666666666, ans=0.125 2023-12-04 06:26:14,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=163300.0, ans=0.0 2023-12-04 06:26:35,161 INFO [train.py:1087] (1/4) Epoch 28, batch 350, loss[loss=0.1663, simple_loss=0.2546, pruned_loss=0.03895, over 24725.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2617, pruned_loss=0.04081, over 3984983.62 frames. ], batch size: 69, lr: 8.73e-03, grad_scale: 32.0 2023-12-04 06:26:38,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=163433.33333333334, ans=0.0 2023-12-04 06:26:42,459 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2023-12-04 06:26:50,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=163500.0, ans=0.125 2023-12-04 06:26:56,398 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.203e+02 1.381e+02 1.505e+02 1.634e+02 2.897e+02, threshold=3.011e+02, percent-clipped=0.0 2023-12-04 06:27:07,485 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=12.0 2023-12-04 06:27:09,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=163633.33333333334, ans=0.125 2023-12-04 06:27:09,534 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-12-04 06:27:16,957 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2023-12-04 06:27:30,268 INFO [train.py:1087] (1/4) Epoch 28, batch 400, loss[loss=0.1648, simple_loss=0.257, pruned_loss=0.03628, over 24762.00 frames. ], tot_loss[loss=0.1718, simple_loss=0.2618, pruned_loss=0.04095, over 4161004.53 frames. ], batch size: 64, lr: 8.72e-03, grad_scale: 32.0 2023-12-04 06:27:35,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=163766.66666666666, ans=0.2 2023-12-04 06:27:50,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=163833.33333333334, ans=0.2 2023-12-04 06:27:52,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163900.0, ans=0.125 2023-12-04 06:28:01,661 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:28:02,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=163966.66666666666, ans=0.2 2023-12-04 06:28:04,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=163966.66666666666, ans=0.125 2023-12-04 06:28:04,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=163966.66666666666, ans=0.125 2023-12-04 06:28:11,592 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.78 vs. limit=15.0 2023-12-04 06:28:16,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=164033.33333333334, ans=0.0 2023-12-04 06:28:16,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=164033.33333333334, ans=0.0 2023-12-04 06:28:26,127 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.18 vs. limit=22.5 2023-12-04 06:28:26,466 INFO [train.py:1087] (1/4) Epoch 28, batch 450, loss[loss=0.1656, simple_loss=0.2556, pruned_loss=0.03787, over 24770.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.262, pruned_loss=0.04112, over 4275209.31 frames. ], batch size: 64, lr: 8.71e-03, grad_scale: 32.0 2023-12-04 06:28:36,275 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=164166.66666666666, ans=0.125 2023-12-04 06:28:40,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164166.66666666666, ans=0.1 2023-12-04 06:28:46,907 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.427e+02 1.550e+02 1.706e+02 2.380e+02, threshold=3.099e+02, percent-clipped=0.0 2023-12-04 06:28:56,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=164233.33333333334, ans=0.2 2023-12-04 06:28:57,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=164233.33333333334, ans=0.0 2023-12-04 06:29:07,165 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=164300.0, ans=0.0 2023-12-04 06:29:11,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164366.66666666666, ans=0.1 2023-12-04 06:29:18,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=164366.66666666666, ans=0.2 2023-12-04 06:29:21,528 INFO [train.py:1087] (1/4) Epoch 28, batch 500, loss[loss=0.1707, simple_loss=0.2637, pruned_loss=0.03885, over 24727.00 frames. ], tot_loss[loss=0.1721, simple_loss=0.262, pruned_loss=0.04112, over 4378351.09 frames. ], batch size: 67, lr: 8.70e-03, grad_scale: 32.0 2023-12-04 06:29:32,670 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-12-04 06:29:54,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=164633.33333333334, ans=0.1 2023-12-04 06:29:56,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=164633.33333333334, ans=0.0 2023-12-04 06:30:16,786 INFO [train.py:1087] (1/4) Epoch 28, batch 550, loss[loss=0.1815, simple_loss=0.2713, pruned_loss=0.04585, over 24749.00 frames. ], tot_loss[loss=0.1717, simple_loss=0.2618, pruned_loss=0.04079, over 4483977.73 frames. ], batch size: 63, lr: 8.69e-03, grad_scale: 32.0 2023-12-04 06:30:28,999 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.60 vs. limit=12.0 2023-12-04 06:30:34,206 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=164833.33333333334, ans=0.2 2023-12-04 06:30:38,253 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.410e+02 1.492e+02 1.603e+02 2.301e+02, threshold=2.985e+02, percent-clipped=0.0 2023-12-04 06:30:57,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=164966.66666666666, ans=0.125 2023-12-04 06:31:12,405 INFO [train.py:1087] (1/4) Epoch 28, batch 600, loss[loss=0.1654, simple_loss=0.2532, pruned_loss=0.03876, over 24799.00 frames. ], tot_loss[loss=0.1714, simple_loss=0.2615, pruned_loss=0.04064, over 4560541.46 frames. ], batch size: 72, lr: 8.69e-03, grad_scale: 32.0 2023-12-04 06:31:26,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=165166.66666666666, ans=15.0 2023-12-04 06:31:34,956 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165233.33333333334, ans=0.0 2023-12-04 06:31:55,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=165366.66666666666, ans=0.0 2023-12-04 06:32:07,895 INFO [train.py:1087] (1/4) Epoch 28, batch 650, loss[loss=0.1758, simple_loss=0.2665, pruned_loss=0.04257, over 24858.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2612, pruned_loss=0.04051, over 4635540.71 frames. ], batch size: 68, lr: 8.68e-03, grad_scale: 32.0 2023-12-04 06:32:11,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165433.33333333334, ans=0.1 2023-12-04 06:32:28,977 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.188e+02 1.357e+02 1.538e+02 1.703e+02 2.135e+02, threshold=3.076e+02, percent-clipped=0.0 2023-12-04 06:33:02,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=165766.66666666666, ans=0.2 2023-12-04 06:33:03,422 INFO [train.py:1087] (1/4) Epoch 28, batch 700, loss[loss=0.1809, simple_loss=0.2702, pruned_loss=0.04578, over 24289.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2613, pruned_loss=0.04094, over 4659185.91 frames. ], batch size: 79, lr: 8.67e-03, grad_scale: 32.0 2023-12-04 06:33:03,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=165766.66666666666, ans=0.125 2023-12-04 06:33:22,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=165833.33333333334, ans=0.125 2023-12-04 06:33:39,293 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=165966.66666666666, ans=0.125 2023-12-04 06:33:55,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166033.33333333334, ans=0.1 2023-12-04 06:33:58,747 INFO [train.py:1087] (1/4) Epoch 28, batch 750, loss[loss=0.1684, simple_loss=0.2543, pruned_loss=0.04126, over 24454.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2611, pruned_loss=0.04061, over 4687781.65 frames. ], batch size: 75, lr: 8.66e-03, grad_scale: 32.0 2023-12-04 06:34:05,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166100.0, ans=0.1 2023-12-04 06:34:12,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=166166.66666666666, ans=0.5 2023-12-04 06:34:14,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=166166.66666666666, ans=0.04949747468305833 2023-12-04 06:34:20,247 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.371e+02 1.508e+02 1.622e+02 2.380e+02, threshold=3.017e+02, percent-clipped=0.0 2023-12-04 06:34:37,746 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=166300.0, ans=0.125 2023-12-04 06:34:48,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=166366.66666666666, ans=0.125 2023-12-04 06:34:49,940 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-12-04 06:34:50,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=166366.66666666666, ans=0.125 2023-12-04 06:34:53,467 INFO [train.py:1087] (1/4) Epoch 28, batch 800, loss[loss=0.1667, simple_loss=0.2613, pruned_loss=0.03609, over 24784.00 frames. ], tot_loss[loss=0.1706, simple_loss=0.2606, pruned_loss=0.04029, over 4721322.21 frames. ], batch size: 71, lr: 8.65e-03, grad_scale: 32.0 2023-12-04 06:35:03,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=166500.0, ans=0.5 2023-12-04 06:35:26,829 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=15.0 2023-12-04 06:35:32,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166633.33333333334, ans=0.0 2023-12-04 06:35:33,757 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=166633.33333333334, ans=0.125 2023-12-04 06:35:33,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=166633.33333333334, ans=0.5 2023-12-04 06:35:45,560 INFO [train.py:1087] (1/4) Epoch 28, batch 850, loss[loss=0.1675, simple_loss=0.2515, pruned_loss=0.04181, over 24569.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2605, pruned_loss=0.04029, over 4742782.18 frames. ], batch size: 64, lr: 8.65e-03, grad_scale: 32.0 2023-12-04 06:36:05,284 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.336e+02 1.417e+02 1.578e+02 2.215e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 06:36:24,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167033.33333333334, ans=0.125 2023-12-04 06:36:45,598 INFO [train.py:1087] (1/4) Epoch 29, batch 0, loss[loss=0.1722, simple_loss=0.2643, pruned_loss=0.04005, over 24764.00 frames. ], tot_loss[loss=0.1722, simple_loss=0.2643, pruned_loss=0.04005, over 24764.00 frames. ], batch size: 64, lr: 8.49e-03, grad_scale: 32.0 2023-12-04 06:36:45,599 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 06:36:56,922 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2383, 3.1590, 2.8748, 4.7406], device='cuda:1') 2023-12-04 06:36:57,634 INFO [train.py:1119] (1/4) Epoch 29, validation: loss=0.1551, simple_loss=0.2558, pruned_loss=0.02721, over 944034.00 frames. 2023-12-04 06:36:57,635 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 06:36:57,972 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=167066.66666666666, ans=0.0 2023-12-04 06:37:09,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167133.33333333334, ans=0.1 2023-12-04 06:37:19,600 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-12-04 06:37:28,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=167200.0, ans=0.2 2023-12-04 06:37:35,284 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.46 vs. limit=10.0 2023-12-04 06:37:51,881 INFO [train.py:1087] (1/4) Epoch 29, batch 50, loss[loss=0.1618, simple_loss=0.252, pruned_loss=0.03579, over 24786.00 frames. ], tot_loss[loss=0.1716, simple_loss=0.2616, pruned_loss=0.04082, over 1093821.72 frames. ], batch size: 73, lr: 8.48e-03, grad_scale: 32.0 2023-12-04 06:37:58,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=167400.0, ans=0.1 2023-12-04 06:38:06,164 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=167466.66666666666, ans=15.0 2023-12-04 06:38:07,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=167466.66666666666, ans=0.125 2023-12-04 06:38:12,252 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:38:19,740 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.229e+02 1.470e+02 1.626e+02 1.821e+02 2.720e+02, threshold=3.251e+02, percent-clipped=0.0 2023-12-04 06:38:19,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=167533.33333333334, ans=0.0 2023-12-04 06:38:20,948 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167533.33333333334, ans=0.125 2023-12-04 06:38:20,974 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=167533.33333333334, ans=0.125 2023-12-04 06:38:33,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=167600.0, ans=0.125 2023-12-04 06:38:35,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=167666.66666666666, ans=15.0 2023-12-04 06:38:39,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167666.66666666666, ans=0.125 2023-12-04 06:38:39,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=167666.66666666666, ans=0.125 2023-12-04 06:38:43,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167666.66666666666, ans=0.125 2023-12-04 06:38:46,866 INFO [train.py:1087] (1/4) Epoch 29, batch 100, loss[loss=0.1733, simple_loss=0.2626, pruned_loss=0.04196, over 24003.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2612, pruned_loss=0.03987, over 1922352.82 frames. ], batch size: 87, lr: 8.47e-03, grad_scale: 32.0 2023-12-04 06:38:49,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167733.33333333334, ans=0.125 2023-12-04 06:38:49,719 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-12-04 06:39:08,782 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=167866.66666666666, ans=0.0 2023-12-04 06:39:09,115 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.78 vs. limit=22.5 2023-12-04 06:39:20,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167933.33333333334, ans=0.1 2023-12-04 06:39:32,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168000.0, ans=0.125 2023-12-04 06:39:40,536 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=168000.0, ans=0.125 2023-12-04 06:39:42,348 INFO [train.py:1087] (1/4) Epoch 29, batch 150, loss[loss=0.1725, simple_loss=0.2618, pruned_loss=0.04162, over 24563.00 frames. ], tot_loss[loss=0.1711, simple_loss=0.2616, pruned_loss=0.04035, over 2539617.15 frames. ], batch size: 63, lr: 8.46e-03, grad_scale: 32.0 2023-12-04 06:39:45,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168066.66666666666, ans=0.1 2023-12-04 06:40:03,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=168200.0, ans=0.125 2023-12-04 06:40:10,045 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.375e+02 1.505e+02 1.698e+02 3.083e+02, threshold=3.010e+02, percent-clipped=0.0 2023-12-04 06:40:37,334 INFO [train.py:1087] (1/4) Epoch 29, batch 200, loss[loss=0.1644, simple_loss=0.258, pruned_loss=0.03543, over 24796.00 frames. ], tot_loss[loss=0.1712, simple_loss=0.2615, pruned_loss=0.04042, over 3031229.20 frames. ], batch size: 73, lr: 8.46e-03, grad_scale: 32.0 2023-12-04 06:40:46,365 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=168400.0, ans=0.0 2023-12-04 06:41:07,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=168533.33333333334, ans=0.125 2023-12-04 06:41:18,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168600.0, ans=0.125 2023-12-04 06:41:20,583 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168666.66666666666, ans=0.125 2023-12-04 06:41:22,850 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.51 vs. limit=15.0 2023-12-04 06:41:23,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=168666.66666666666, ans=0.125 2023-12-04 06:41:32,669 INFO [train.py:1087] (1/4) Epoch 29, batch 250, loss[loss=0.1761, simple_loss=0.267, pruned_loss=0.0426, over 20891.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2609, pruned_loss=0.04008, over 3442333.94 frames. ], batch size: 50, lr: 8.45e-03, grad_scale: 32.0 2023-12-04 06:41:40,559 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.85 vs. limit=15.0 2023-12-04 06:41:42,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=168800.0, ans=0.125 2023-12-04 06:41:52,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=168800.0, ans=0.0 2023-12-04 06:41:55,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=168866.66666666666, ans=0.2 2023-12-04 06:41:59,577 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.376e+02 1.495e+02 1.723e+02 2.429e+02, threshold=2.991e+02, percent-clipped=0.0 2023-12-04 06:41:59,929 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=168866.66666666666, ans=0.125 2023-12-04 06:42:04,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168933.33333333334, ans=0.125 2023-12-04 06:42:13,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=168933.33333333334, ans=0.125 2023-12-04 06:42:26,622 INFO [train.py:1087] (1/4) Epoch 29, batch 300, loss[loss=0.1762, simple_loss=0.2632, pruned_loss=0.04463, over 24187.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.2602, pruned_loss=0.03932, over 3762643.33 frames. ], batch size: 82, lr: 8.44e-03, grad_scale: 32.0 2023-12-04 06:42:31,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=169066.66666666666, ans=0.125 2023-12-04 06:42:37,577 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.89 vs. limit=15.0 2023-12-04 06:42:44,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=169133.33333333334, ans=0.125 2023-12-04 06:42:44,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=169133.33333333334, ans=0.1 2023-12-04 06:42:47,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169200.0, ans=0.1 2023-12-04 06:42:52,143 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-12-04 06:43:01,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169266.66666666666, ans=0.125 2023-12-04 06:43:22,038 INFO [train.py:1087] (1/4) Epoch 29, batch 350, loss[loss=0.1551, simple_loss=0.2478, pruned_loss=0.03124, over 24571.00 frames. ], tot_loss[loss=0.1692, simple_loss=0.2599, pruned_loss=0.03929, over 4005851.94 frames. ], batch size: 62, lr: 8.43e-03, grad_scale: 32.0 2023-12-04 06:43:48,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169533.33333333334, ans=0.125 2023-12-04 06:43:49,500 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.359e+02 1.460e+02 1.582e+02 2.441e+02, threshold=2.921e+02, percent-clipped=0.0 2023-12-04 06:43:50,277 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-12-04 06:44:17,605 INFO [train.py:1087] (1/4) Epoch 29, batch 400, loss[loss=0.1643, simple_loss=0.2531, pruned_loss=0.03778, over 24866.00 frames. ], tot_loss[loss=0.17, simple_loss=0.2605, pruned_loss=0.03981, over 4174362.98 frames. ], batch size: 68, lr: 8.42e-03, grad_scale: 32.0 2023-12-04 06:44:23,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169733.33333333334, ans=0.1 2023-12-04 06:44:24,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=169733.33333333334, ans=0.125 2023-12-04 06:44:28,648 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-12-04 06:44:31,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=169800.0, ans=0.125 2023-12-04 06:44:33,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=169800.0, ans=0.0 2023-12-04 06:44:35,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=169800.0, ans=0.0 2023-12-04 06:45:12,742 INFO [train.py:1087] (1/4) Epoch 29, batch 450, loss[loss=0.1573, simple_loss=0.2465, pruned_loss=0.03401, over 24761.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2601, pruned_loss=0.03973, over 4310986.10 frames. ], batch size: 63, lr: 8.42e-03, grad_scale: 32.0 2023-12-04 06:45:20,783 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=170066.66666666666, ans=0.125 2023-12-04 06:45:27,327 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-12-04 06:45:32,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=170133.33333333334, ans=0.125 2023-12-04 06:45:40,585 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.390e+02 1.484e+02 1.747e+02 2.437e+02, threshold=2.967e+02, percent-clipped=0.0 2023-12-04 06:45:45,911 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-12-04 06:46:07,941 INFO [train.py:1087] (1/4) Epoch 29, batch 500, loss[loss=0.1622, simple_loss=0.2542, pruned_loss=0.03508, over 24734.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.2597, pruned_loss=0.03955, over 4420424.89 frames. ], batch size: 67, lr: 8.41e-03, grad_scale: 32.0 2023-12-04 06:46:26,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=170466.66666666666, ans=0.125 2023-12-04 06:46:36,105 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.30 vs. limit=15.0 2023-12-04 06:46:38,737 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:46:59,593 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=170666.66666666666, ans=0.125 2023-12-04 06:47:03,539 INFO [train.py:1087] (1/4) Epoch 29, batch 550, loss[loss=0.1728, simple_loss=0.2604, pruned_loss=0.04258, over 24736.00 frames. ], tot_loss[loss=0.1704, simple_loss=0.2605, pruned_loss=0.04014, over 4503593.27 frames. ], batch size: 63, lr: 8.40e-03, grad_scale: 32.0 2023-12-04 06:47:15,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=170800.0, ans=0.2 2023-12-04 06:47:22,140 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=170800.0, ans=0.125 2023-12-04 06:47:25,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=170866.66666666666, ans=22.5 2023-12-04 06:47:31,297 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.371e+02 1.473e+02 1.639e+02 2.975e+02, threshold=2.945e+02, percent-clipped=1.0 2023-12-04 06:47:38,125 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170933.33333333334, ans=0.1 2023-12-04 06:47:43,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=170933.33333333334, ans=0.0 2023-12-04 06:47:58,655 INFO [train.py:1087] (1/4) Epoch 29, batch 600, loss[loss=0.1765, simple_loss=0.2665, pruned_loss=0.04319, over 24556.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2603, pruned_loss=0.04009, over 4560364.19 frames. ], batch size: 66, lr: 8.39e-03, grad_scale: 32.0 2023-12-04 06:48:11,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=171133.33333333334, ans=0.0 2023-12-04 06:48:54,933 INFO [train.py:1087] (1/4) Epoch 29, batch 650, loss[loss=0.172, simple_loss=0.2647, pruned_loss=0.03968, over 24697.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2601, pruned_loss=0.03986, over 4630001.78 frames. ], batch size: 69, lr: 8.39e-03, grad_scale: 32.0 2023-12-04 06:48:58,287 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171400.0, ans=0.125 2023-12-04 06:49:04,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=171466.66666666666, ans=10.0 2023-12-04 06:49:04,746 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:49:15,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=171466.66666666666, ans=0.125 2023-12-04 06:49:22,502 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.349e+02 1.422e+02 1.549e+02 2.709e+02, threshold=2.844e+02, percent-clipped=0.0 2023-12-04 06:49:33,663 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=171600.0, ans=0.125 2023-12-04 06:49:50,068 INFO [train.py:1087] (1/4) Epoch 29, batch 700, loss[loss=0.1683, simple_loss=0.2543, pruned_loss=0.04112, over 24734.00 frames. ], tot_loss[loss=0.1703, simple_loss=0.2603, pruned_loss=0.04013, over 4658533.47 frames. ], batch size: 61, lr: 8.38e-03, grad_scale: 32.0 2023-12-04 06:49:55,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=171733.33333333334, ans=0.0 2023-12-04 06:50:19,357 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=171866.66666666666, ans=0.05 2023-12-04 06:50:46,337 INFO [train.py:1087] (1/4) Epoch 29, batch 750, loss[loss=0.1659, simple_loss=0.259, pruned_loss=0.03638, over 24763.00 frames. ], tot_loss[loss=0.1705, simple_loss=0.2605, pruned_loss=0.04023, over 4684955.05 frames. ], batch size: 64, lr: 8.37e-03, grad_scale: 32.0 2023-12-04 06:50:54,720 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=172066.66666666666, ans=0.125 2023-12-04 06:50:54,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172066.66666666666, ans=0.1 2023-12-04 06:50:59,107 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=172133.33333333334, ans=0.125 2023-12-04 06:51:10,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-12-04 06:51:13,866 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.338e+02 1.442e+02 1.731e+02 2.430e+02, threshold=2.884e+02, percent-clipped=0.0 2023-12-04 06:51:15,716 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=22.5 2023-12-04 06:51:16,241 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=172200.0, ans=0.125 2023-12-04 06:51:17,865 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-12-04 06:51:25,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=172266.66666666666, ans=0.125 2023-12-04 06:51:28,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=172266.66666666666, ans=0.2 2023-12-04 06:51:39,947 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=172333.33333333334, ans=0.2 2023-12-04 06:51:41,780 INFO [train.py:1087] (1/4) Epoch 29, batch 800, loss[loss=0.1662, simple_loss=0.2582, pruned_loss=0.03712, over 24800.00 frames. ], tot_loss[loss=0.1696, simple_loss=0.2598, pruned_loss=0.03968, over 4721414.20 frames. ], batch size: 73, lr: 8.36e-03, grad_scale: 32.0 2023-12-04 06:51:56,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=172466.66666666666, ans=0.125 2023-12-04 06:52:29,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=172666.66666666666, ans=0.0 2023-12-04 06:52:33,318 INFO [train.py:1087] (1/4) Epoch 29, batch 850, loss[loss=0.1928, simple_loss=0.2781, pruned_loss=0.05372, over 24138.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2599, pruned_loss=0.03986, over 4737804.50 frames. ], batch size: 82, lr: 8.36e-03, grad_scale: 32.0 2023-12-04 06:52:58,525 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.185e+02 1.381e+02 1.487e+02 1.682e+02 2.314e+02, threshold=2.974e+02, percent-clipped=0.0 2023-12-04 06:53:09,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172933.33333333334, ans=0.1 2023-12-04 06:53:12,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=173000.0, ans=0.2 2023-12-04 06:53:15,633 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=173000.0, ans=0.0 2023-12-04 06:53:16,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=173000.0, ans=0.0 2023-12-04 06:53:34,419 INFO [train.py:1087] (1/4) Epoch 30, batch 0, loss[loss=0.173, simple_loss=0.2689, pruned_loss=0.03859, over 22834.00 frames. ], tot_loss[loss=0.173, simple_loss=0.2689, pruned_loss=0.03859, over 22834.00 frames. ], batch size: 106, lr: 8.21e-03, grad_scale: 32.0 2023-12-04 06:53:34,420 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 06:53:41,976 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.1061, 4.3555, 4.2979, 4.6213], device='cuda:1') 2023-12-04 06:53:46,442 INFO [train.py:1119] (1/4) Epoch 30, validation: loss=0.155, simple_loss=0.2554, pruned_loss=0.02733, over 944034.00 frames. 2023-12-04 06:53:46,442 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 06:54:06,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=173100.0, ans=0.125 2023-12-04 06:54:07,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=173166.66666666666, ans=0.125 2023-12-04 06:54:09,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=173166.66666666666, ans=0.0 2023-12-04 06:54:24,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=173233.33333333334, ans=0.0 2023-12-04 06:54:40,663 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=173366.66666666666, ans=0.125 2023-12-04 06:54:41,391 INFO [train.py:1087] (1/4) Epoch 30, batch 50, loss[loss=0.1801, simple_loss=0.2668, pruned_loss=0.04668, over 24737.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2599, pruned_loss=0.03975, over 1087242.44 frames. ], batch size: 69, lr: 8.20e-03, grad_scale: 32.0 2023-12-04 06:54:51,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=173433.33333333334, ans=0.0 2023-12-04 06:54:57,396 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 06:55:11,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=173500.0, ans=0.1 2023-12-04 06:55:14,574 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.323e+02 1.434e+02 1.672e+02 2.619e+02, threshold=2.868e+02, percent-clipped=0.0 2023-12-04 06:55:19,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=173566.66666666666, ans=0.125 2023-12-04 06:55:20,602 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=12.0 2023-12-04 06:55:25,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=173633.33333333334, ans=0.125 2023-12-04 06:55:36,609 INFO [train.py:1087] (1/4) Epoch 30, batch 100, loss[loss=0.216, simple_loss=0.295, pruned_loss=0.06856, over 16573.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2594, pruned_loss=0.03915, over 1908282.25 frames. ], batch size: 177, lr: 8.19e-03, grad_scale: 32.0 2023-12-04 06:55:48,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=173766.66666666666, ans=0.035 2023-12-04 06:55:53,557 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.84 vs. limit=15.0 2023-12-04 06:56:12,242 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.51 vs. limit=15.0 2023-12-04 06:56:19,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173966.66666666666, ans=0.125 2023-12-04 06:56:31,351 INFO [train.py:1087] (1/4) Epoch 30, batch 150, loss[loss=0.1639, simple_loss=0.254, pruned_loss=0.03688, over 24571.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2595, pruned_loss=0.03895, over 2560381.06 frames. ], batch size: 65, lr: 8.19e-03, grad_scale: 32.0 2023-12-04 06:56:37,384 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-12-04 06:56:56,786 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=174166.66666666666, ans=0.125 2023-12-04 06:57:05,082 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.321e+02 1.434e+02 1.605e+02 3.805e+02, threshold=2.868e+02, percent-clipped=1.0 2023-12-04 06:57:14,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=174300.0, ans=0.125 2023-12-04 06:57:20,836 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=174300.0, ans=0.0 2023-12-04 06:57:25,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174366.66666666666, ans=0.1 2023-12-04 06:57:26,224 INFO [train.py:1087] (1/4) Epoch 30, batch 200, loss[loss=0.1751, simple_loss=0.26, pruned_loss=0.0451, over 24298.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2603, pruned_loss=0.03969, over 3039712.31 frames. ], batch size: 79, lr: 8.18e-03, grad_scale: 16.0 2023-12-04 06:57:32,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=174366.66666666666, ans=0.0 2023-12-04 06:57:32,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=174366.66666666666, ans=0.125 2023-12-04 06:57:41,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=174433.33333333334, ans=0.125 2023-12-04 06:58:04,905 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=15.0 2023-12-04 06:58:18,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=174633.33333333334, ans=0.0 2023-12-04 06:58:20,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174700.0, ans=0.125 2023-12-04 06:58:21,725 INFO [train.py:1087] (1/4) Epoch 30, batch 250, loss[loss=0.1605, simple_loss=0.2554, pruned_loss=0.03275, over 24785.00 frames. ], tot_loss[loss=0.1699, simple_loss=0.2601, pruned_loss=0.03987, over 3422522.95 frames. ], batch size: 73, lr: 8.17e-03, grad_scale: 16.0 2023-12-04 06:58:39,913 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-12-04 06:58:44,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=174833.33333333334, ans=0.125 2023-12-04 06:58:46,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=174833.33333333334, ans=0.2 2023-12-04 06:58:56,043 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.388e+02 1.491e+02 1.678e+02 2.390e+02, threshold=2.982e+02, percent-clipped=0.0 2023-12-04 06:59:08,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=174966.66666666666, ans=0.0 2023-12-04 06:59:17,732 INFO [train.py:1087] (1/4) Epoch 30, batch 300, loss[loss=0.1564, simple_loss=0.2496, pruned_loss=0.03159, over 24785.00 frames. ], tot_loss[loss=0.1694, simple_loss=0.2596, pruned_loss=0.03954, over 3737856.36 frames. ], batch size: 72, lr: 8.16e-03, grad_scale: 16.0 2023-12-04 06:59:27,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=175100.0, ans=0.0 2023-12-04 06:59:27,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=175100.0, ans=0.125 2023-12-04 06:59:31,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175100.0, ans=0.1 2023-12-04 06:59:31,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-12-04 06:59:33,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175100.0, ans=0.1 2023-12-04 06:59:57,739 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=175233.33333333334, ans=0.2 2023-12-04 07:00:02,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=175300.0, ans=0.125 2023-12-04 07:00:06,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=175300.0, ans=0.125 2023-12-04 07:00:08,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=175300.0, ans=0.2 2023-12-04 07:00:13,213 INFO [train.py:1087] (1/4) Epoch 30, batch 350, loss[loss=0.1718, simple_loss=0.2652, pruned_loss=0.03916, over 24613.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.26, pruned_loss=0.03975, over 3960426.72 frames. ], batch size: 68, lr: 8.16e-03, grad_scale: 16.0 2023-12-04 07:00:33,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=175433.33333333334, ans=0.0 2023-12-04 07:00:48,228 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.393e+02 1.497e+02 1.658e+02 2.053e+02, threshold=2.994e+02, percent-clipped=0.0 2023-12-04 07:00:48,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=175566.66666666666, ans=0.0 2023-12-04 07:00:50,131 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.53 vs. limit=15.0 2023-12-04 07:00:52,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175566.66666666666, ans=0.1 2023-12-04 07:01:07,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=175633.33333333334, ans=0.2 2023-12-04 07:01:09,284 INFO [train.py:1087] (1/4) Epoch 30, batch 400, loss[loss=0.1756, simple_loss=0.2666, pruned_loss=0.04229, over 24079.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2598, pruned_loss=0.03976, over 4143253.09 frames. ], batch size: 87, lr: 8.15e-03, grad_scale: 32.0 2023-12-04 07:01:32,615 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=175833.33333333334, ans=0.0 2023-12-04 07:01:42,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=175900.0, ans=0.0 2023-12-04 07:01:47,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=175900.0, ans=0.2 2023-12-04 07:01:54,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=175966.66666666666, ans=0.125 2023-12-04 07:02:02,615 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-12-04 07:02:04,273 INFO [train.py:1087] (1/4) Epoch 30, batch 450, loss[loss=0.1701, simple_loss=0.2603, pruned_loss=0.0399, over 24221.00 frames. ], tot_loss[loss=0.1698, simple_loss=0.2601, pruned_loss=0.03982, over 4284070.97 frames. ], batch size: 82, lr: 8.14e-03, grad_scale: 32.0 2023-12-04 07:02:16,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=176100.0, ans=0.0 2023-12-04 07:02:33,835 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.81 vs. limit=15.0 2023-12-04 07:02:38,483 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.195e+02 1.360e+02 1.451e+02 1.639e+02 2.185e+02, threshold=2.903e+02, percent-clipped=0.0 2023-12-04 07:02:46,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=176233.33333333334, ans=0.0 2023-12-04 07:02:47,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=176300.0, ans=0.0 2023-12-04 07:03:00,277 INFO [train.py:1087] (1/4) Epoch 30, batch 500, loss[loss=0.1664, simple_loss=0.2555, pruned_loss=0.03865, over 23531.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2601, pruned_loss=0.03966, over 4396872.81 frames. ], batch size: 94, lr: 8.13e-03, grad_scale: 32.0 2023-12-04 07:03:17,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=176433.33333333334, ans=0.125 2023-12-04 07:03:17,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=176433.33333333334, ans=0.0 2023-12-04 07:03:21,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=176500.0, ans=0.0 2023-12-04 07:03:39,638 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-12-04 07:03:55,702 INFO [train.py:1087] (1/4) Epoch 30, batch 550, loss[loss=0.1601, simple_loss=0.2506, pruned_loss=0.03482, over 24764.00 frames. ], tot_loss[loss=0.169, simple_loss=0.2595, pruned_loss=0.03928, over 4503480.16 frames. ], batch size: 70, lr: 8.13e-03, grad_scale: 32.0 2023-12-04 07:04:18,568 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:04:18,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=176833.33333333334, ans=0.125 2023-12-04 07:04:30,799 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.234e+02 1.385e+02 1.470e+02 1.609e+02 2.556e+02, threshold=2.939e+02, percent-clipped=0.0 2023-12-04 07:04:35,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=176900.0, ans=0.2 2023-12-04 07:04:45,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=176966.66666666666, ans=0.09899494936611666 2023-12-04 07:04:45,248 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=176966.66666666666, ans=0.0 2023-12-04 07:04:51,866 INFO [train.py:1087] (1/4) Epoch 30, batch 600, loss[loss=0.1951, simple_loss=0.2818, pruned_loss=0.05424, over 23533.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2598, pruned_loss=0.03943, over 4570916.99 frames. ], batch size: 94, lr: 8.12e-03, grad_scale: 32.0 2023-12-04 07:05:31,304 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-12-04 07:05:31,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=177233.33333333334, ans=0.0 2023-12-04 07:05:47,489 INFO [train.py:1087] (1/4) Epoch 30, batch 650, loss[loss=0.2142, simple_loss=0.2848, pruned_loss=0.07179, over 17653.00 frames. ], tot_loss[loss=0.1693, simple_loss=0.2598, pruned_loss=0.03937, over 4627855.84 frames. ], batch size: 177, lr: 8.11e-03, grad_scale: 32.0 2023-12-04 07:06:20,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=177566.66666666666, ans=0.125 2023-12-04 07:06:21,614 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.163e+02 1.367e+02 1.457e+02 1.593e+02 2.870e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 07:06:29,836 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=177566.66666666666, ans=0.125 2023-12-04 07:06:43,216 INFO [train.py:1087] (1/4) Epoch 30, batch 700, loss[loss=0.1679, simple_loss=0.2568, pruned_loss=0.03951, over 21195.00 frames. ], tot_loss[loss=0.1691, simple_loss=0.2596, pruned_loss=0.03931, over 4677673.75 frames. ], batch size: 127, lr: 8.11e-03, grad_scale: 32.0 2023-12-04 07:06:45,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=177700.0, ans=0.125 2023-12-04 07:07:04,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=177833.33333333334, ans=0.125 2023-12-04 07:07:15,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=177900.0, ans=0.035 2023-12-04 07:07:21,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=177900.0, ans=0.125 2023-12-04 07:07:34,474 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.88 vs. limit=22.5 2023-12-04 07:07:34,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.17 vs. limit=10.0 2023-12-04 07:07:37,854 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=178033.33333333334, ans=0.0 2023-12-04 07:07:38,659 INFO [train.py:1087] (1/4) Epoch 30, batch 750, loss[loss=0.183, simple_loss=0.2674, pruned_loss=0.04931, over 24465.00 frames. ], tot_loss[loss=0.1689, simple_loss=0.2594, pruned_loss=0.03921, over 4705694.25 frames. ], batch size: 75, lr: 8.10e-03, grad_scale: 32.0 2023-12-04 07:07:42,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=178033.33333333334, ans=22.5 2023-12-04 07:07:48,847 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=178100.0, ans=0.0 2023-12-04 07:07:52,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=178100.0, ans=10.0 2023-12-04 07:07:57,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=178100.0, ans=0.125 2023-12-04 07:08:13,527 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.341e+02 1.453e+02 1.706e+02 2.595e+02, threshold=2.907e+02, percent-clipped=0.0 2023-12-04 07:08:28,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.80 vs. limit=22.5 2023-12-04 07:08:34,617 INFO [train.py:1087] (1/4) Epoch 30, batch 800, loss[loss=0.1738, simple_loss=0.2577, pruned_loss=0.04496, over 24243.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2591, pruned_loss=0.03895, over 4742223.30 frames. ], batch size: 82, lr: 8.09e-03, grad_scale: 32.0 2023-12-04 07:08:39,063 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:08:55,955 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:09:00,285 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=12.0 2023-12-04 07:09:06,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=178566.66666666666, ans=0.125 2023-12-04 07:09:07,909 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=178566.66666666666, ans=0.035 2023-12-04 07:09:26,661 INFO [train.py:1087] (1/4) Epoch 30, batch 850, loss[loss=0.1668, simple_loss=0.2547, pruned_loss=0.03942, over 24533.00 frames. ], tot_loss[loss=0.1687, simple_loss=0.2591, pruned_loss=0.03914, over 4754946.80 frames. ], batch size: 63, lr: 8.09e-03, grad_scale: 16.0 2023-12-04 07:09:28,004 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.70 vs. limit=22.5 2023-12-04 07:09:47,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=178833.33333333334, ans=0.0 2023-12-04 07:09:51,285 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.94 vs. limit=15.0 2023-12-04 07:09:55,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=178900.0, ans=0.0 2023-12-04 07:09:58,683 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.343e+02 1.447e+02 1.578e+02 2.521e+02, threshold=2.894e+02, percent-clipped=0.0 2023-12-04 07:10:25,794 INFO [train.py:1087] (1/4) Epoch 31, batch 0, loss[loss=0.1602, simple_loss=0.2486, pruned_loss=0.03588, over 24740.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2486, pruned_loss=0.03588, over 24740.00 frames. ], batch size: 63, lr: 7.95e-03, grad_scale: 32.0 2023-12-04 07:10:25,795 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 07:10:36,939 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.7124, 2.3467, 2.2844, 2.5071, 2.2429, 2.3446, 2.4261, 2.4339], device='cuda:1') 2023-12-04 07:10:37,277 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5222, 3.9013, 3.3305, 4.3221], device='cuda:1') 2023-12-04 07:10:38,036 INFO [train.py:1119] (1/4) Epoch 31, validation: loss=0.1549, simple_loss=0.2551, pruned_loss=0.02731, over 944034.00 frames. 2023-12-04 07:10:38,036 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 07:10:41,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=179000.0, ans=0.125 2023-12-04 07:10:43,804 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.99 vs. limit=15.0 2023-12-04 07:10:45,583 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=179000.0, ans=0.125 2023-12-04 07:11:13,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179200.0, ans=0.1 2023-12-04 07:11:16,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=179200.0, ans=0.07 2023-12-04 07:11:20,685 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=179266.66666666666, ans=0.125 2023-12-04 07:11:26,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=179266.66666666666, ans=0.125 2023-12-04 07:11:30,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=179266.66666666666, ans=0.125 2023-12-04 07:11:33,326 INFO [train.py:1087] (1/4) Epoch 31, batch 50, loss[loss=0.1598, simple_loss=0.2525, pruned_loss=0.03358, over 24772.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2576, pruned_loss=0.03797, over 1079020.95 frames. ], batch size: 71, lr: 7.94e-03, grad_scale: 32.0 2023-12-04 07:11:33,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=179333.33333333334, ans=0.2 2023-12-04 07:11:42,407 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:11:53,224 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.11 vs. limit=6.0 2023-12-04 07:12:03,385 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=179466.66666666666, ans=0.1 2023-12-04 07:12:13,929 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.356e+02 1.502e+02 1.739e+02 2.837e+02, threshold=3.004e+02, percent-clipped=0.0 2023-12-04 07:12:19,759 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179600.0, ans=0.1 2023-12-04 07:12:22,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=179600.0, ans=0.125 2023-12-04 07:12:28,789 INFO [train.py:1087] (1/4) Epoch 31, batch 100, loss[loss=0.1615, simple_loss=0.2539, pruned_loss=0.0346, over 24851.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2586, pruned_loss=0.03869, over 1897507.36 frames. ], batch size: 68, lr: 7.93e-03, grad_scale: 32.0 2023-12-04 07:12:46,313 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=179733.33333333334, ans=0.0 2023-12-04 07:12:52,384 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.10 vs. limit=15.0 2023-12-04 07:13:24,547 INFO [train.py:1087] (1/4) Epoch 31, batch 150, loss[loss=0.1718, simple_loss=0.2654, pruned_loss=0.03911, over 24615.00 frames. ], tot_loss[loss=0.1684, simple_loss=0.2591, pruned_loss=0.03887, over 2554270.47 frames. ], batch size: 68, lr: 7.92e-03, grad_scale: 32.0 2023-12-04 07:13:25,191 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.03 vs. limit=22.5 2023-12-04 07:13:39,799 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=180066.66666666666, ans=0.0 2023-12-04 07:13:58,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=180200.0, ans=0.0 2023-12-04 07:14:04,147 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180200.0, ans=0.1 2023-12-04 07:14:06,421 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.186e+02 1.368e+02 1.478e+02 1.660e+02 2.413e+02, threshold=2.957e+02, percent-clipped=0.0 2023-12-04 07:14:11,900 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:14:20,133 INFO [train.py:1087] (1/4) Epoch 31, batch 200, loss[loss=0.1769, simple_loss=0.2669, pruned_loss=0.0435, over 24043.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2593, pruned_loss=0.03916, over 3049425.39 frames. ], batch size: 87, lr: 7.92e-03, grad_scale: 32.0 2023-12-04 07:14:21,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=180333.33333333334, ans=0.0 2023-12-04 07:14:29,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=180333.33333333334, ans=0.125 2023-12-04 07:14:34,760 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.94 vs. limit=15.0 2023-12-04 07:14:41,353 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-12-04 07:15:15,746 INFO [train.py:1087] (1/4) Epoch 31, batch 250, loss[loss=0.1778, simple_loss=0.268, pruned_loss=0.04379, over 24174.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.259, pruned_loss=0.03909, over 3424627.97 frames. ], batch size: 82, lr: 7.91e-03, grad_scale: 32.0 2023-12-04 07:15:23,802 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=12.0 2023-12-04 07:15:24,470 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=180666.66666666666, ans=0.0 2023-12-04 07:15:39,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=180800.0, ans=0.0 2023-12-04 07:15:55,899 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.359e+02 1.512e+02 1.653e+02 2.162e+02, threshold=3.025e+02, percent-clipped=0.0 2023-12-04 07:16:00,740 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-12-04 07:16:02,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=180933.33333333334, ans=0.1 2023-12-04 07:16:10,754 INFO [train.py:1087] (1/4) Epoch 31, batch 300, loss[loss=0.2089, simple_loss=0.2873, pruned_loss=0.06526, over 17362.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2591, pruned_loss=0.03897, over 3731721.22 frames. ], batch size: 176, lr: 7.90e-03, grad_scale: 32.0 2023-12-04 07:16:41,956 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181133.33333333334, ans=0.1 2023-12-04 07:16:45,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=181200.0, ans=0.0 2023-12-04 07:16:47,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=181200.0, ans=0.125 2023-12-04 07:17:05,148 INFO [train.py:1087] (1/4) Epoch 31, batch 350, loss[loss=0.1699, simple_loss=0.254, pruned_loss=0.0429, over 24499.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.259, pruned_loss=0.03905, over 3976340.89 frames. ], batch size: 75, lr: 7.90e-03, grad_scale: 32.0 2023-12-04 07:17:05,713 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-12-04 07:17:15,545 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=181333.33333333334, ans=0.125 2023-12-04 07:17:23,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=181400.0, ans=0.0 2023-12-04 07:17:39,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=181533.33333333334, ans=0.1 2023-12-04 07:17:46,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=181533.33333333334, ans=22.5 2023-12-04 07:17:47,357 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.342e+02 1.424e+02 1.549e+02 2.896e+02, threshold=2.849e+02, percent-clipped=0.0 2023-12-04 07:17:48,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=181533.33333333334, ans=0.0 2023-12-04 07:18:01,138 INFO [train.py:1087] (1/4) Epoch 31, batch 400, loss[loss=0.1789, simple_loss=0.2705, pruned_loss=0.04362, over 24728.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2593, pruned_loss=0.03918, over 4158896.02 frames. ], batch size: 61, lr: 7.89e-03, grad_scale: 32.0 2023-12-04 07:18:26,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=181800.0, ans=0.5 2023-12-04 07:18:33,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=181866.66666666666, ans=0.0 2023-12-04 07:18:56,716 INFO [train.py:1087] (1/4) Epoch 31, batch 450, loss[loss=0.1572, simple_loss=0.245, pruned_loss=0.03472, over 24679.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.259, pruned_loss=0.03909, over 4299347.32 frames. ], batch size: 74, lr: 7.88e-03, grad_scale: 32.0 2023-12-04 07:18:57,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=182000.0, ans=0.0 2023-12-04 07:19:17,117 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-12-04 07:19:28,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=182133.33333333334, ans=0.1 2023-12-04 07:19:29,275 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:19:36,541 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=182200.0, ans=0.125 2023-12-04 07:19:36,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=182200.0, ans=0.1 2023-12-04 07:19:37,346 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.319e+02 1.420e+02 1.584e+02 2.231e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 07:19:37,634 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=182200.0, ans=10.0 2023-12-04 07:19:45,282 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2023-12-04 07:19:52,201 INFO [train.py:1087] (1/4) Epoch 31, batch 500, loss[loss=0.1577, simple_loss=0.2522, pruned_loss=0.03155, over 24697.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2588, pruned_loss=0.03914, over 4410673.37 frames. ], batch size: 74, lr: 7.88e-03, grad_scale: 32.0 2023-12-04 07:20:35,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=182600.0, ans=0.125 2023-12-04 07:20:40,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=182600.0, ans=0.07 2023-12-04 07:20:46,817 INFO [train.py:1087] (1/4) Epoch 31, batch 550, loss[loss=0.1733, simple_loss=0.2614, pruned_loss=0.04261, over 23506.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2587, pruned_loss=0.03896, over 4500198.21 frames. ], batch size: 94, lr: 7.87e-03, grad_scale: 32.0 2023-12-04 07:20:59,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=182733.33333333334, ans=0.0 2023-12-04 07:21:05,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=182733.33333333334, ans=0.125 2023-12-04 07:21:08,070 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.89 vs. limit=15.0 2023-12-04 07:21:09,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=182800.0, ans=0.0 2023-12-04 07:21:19,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=182866.66666666666, ans=0.0 2023-12-04 07:21:28,707 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.420e+02 1.529e+02 1.688e+02 2.421e+02, threshold=3.059e+02, percent-clipped=0.0 2023-12-04 07:21:40,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=182933.33333333334, ans=0.125 2023-12-04 07:21:41,997 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=183000.0, ans=0.125 2023-12-04 07:21:42,761 INFO [train.py:1087] (1/4) Epoch 31, batch 600, loss[loss=0.1607, simple_loss=0.2552, pruned_loss=0.03312, over 24614.00 frames. ], tot_loss[loss=0.1679, simple_loss=0.2584, pruned_loss=0.03872, over 4579191.48 frames. ], batch size: 68, lr: 7.86e-03, grad_scale: 32.0 2023-12-04 07:21:47,615 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-12-04 07:21:48,530 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=183000.0, ans=0.125 2023-12-04 07:22:20,006 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-12-04 07:22:28,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=183266.66666666666, ans=0.125 2023-12-04 07:22:30,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=183266.66666666666, ans=0.035 2023-12-04 07:22:35,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183266.66666666666, ans=0.1 2023-12-04 07:22:35,327 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-12-04 07:22:38,099 INFO [train.py:1087] (1/4) Epoch 31, batch 650, loss[loss=0.1705, simple_loss=0.2611, pruned_loss=0.03995, over 24211.00 frames. ], tot_loss[loss=0.1676, simple_loss=0.2581, pruned_loss=0.03854, over 4630869.65 frames. ], batch size: 82, lr: 7.86e-03, grad_scale: 32.0 2023-12-04 07:22:38,404 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=183333.33333333334, ans=0.125 2023-12-04 07:22:44,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=183333.33333333334, ans=0.125 2023-12-04 07:22:47,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=183333.33333333334, ans=0.125 2023-12-04 07:23:15,434 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.82 vs. limit=15.0 2023-12-04 07:23:20,167 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.330e+02 1.393e+02 1.586e+02 2.143e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-04 07:23:20,951 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2023-12-04 07:23:24,796 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.37 vs. limit=15.0 2023-12-04 07:23:28,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=183600.0, ans=0.0 2023-12-04 07:23:31,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=183600.0, ans=0.0 2023-12-04 07:23:33,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=183666.66666666666, ans=0.125 2023-12-04 07:23:34,292 INFO [train.py:1087] (1/4) Epoch 31, batch 700, loss[loss=0.1613, simple_loss=0.2496, pruned_loss=0.03654, over 24565.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2584, pruned_loss=0.0388, over 4661679.64 frames. ], batch size: 63, lr: 7.85e-03, grad_scale: 16.0 2023-12-04 07:23:48,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=183733.33333333334, ans=0.125 2023-12-04 07:23:50,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=183733.33333333334, ans=0.0 2023-12-04 07:24:09,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=183866.66666666666, ans=0.0 2023-12-04 07:24:13,653 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-12-04 07:24:21,054 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=183933.33333333334, ans=0.1 2023-12-04 07:24:30,230 INFO [train.py:1087] (1/4) Epoch 31, batch 750, loss[loss=0.1671, simple_loss=0.2593, pruned_loss=0.03744, over 24560.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2585, pruned_loss=0.03889, over 4703184.50 frames. ], batch size: 63, lr: 7.84e-03, grad_scale: 16.0 2023-12-04 07:24:34,002 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=184000.0, ans=0.2 2023-12-04 07:24:36,651 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-12-04 07:24:40,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=184066.66666666666, ans=0.0 2023-12-04 07:24:44,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=184066.66666666666, ans=0.125 2023-12-04 07:25:05,253 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=184200.0, ans=0.0 2023-12-04 07:25:12,337 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.325e+02 1.411e+02 1.526e+02 2.166e+02, threshold=2.822e+02, percent-clipped=0.0 2023-12-04 07:25:13,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=184266.66666666666, ans=0.0 2023-12-04 07:25:25,307 INFO [train.py:1087] (1/4) Epoch 31, batch 800, loss[loss=0.1619, simple_loss=0.2512, pruned_loss=0.03625, over 24753.00 frames. ], tot_loss[loss=0.1678, simple_loss=0.2582, pruned_loss=0.03867, over 4729960.03 frames. ], batch size: 66, lr: 7.84e-03, grad_scale: 32.0 2023-12-04 07:25:37,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=184400.0, ans=0.125 2023-12-04 07:25:40,097 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-12-04 07:25:44,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=184400.0, ans=15.0 2023-12-04 07:25:46,843 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184466.66666666666, ans=0.125 2023-12-04 07:25:47,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=184466.66666666666, ans=0.125 2023-12-04 07:25:49,116 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.10 vs. limit=22.5 2023-12-04 07:25:52,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=184466.66666666666, ans=0.1 2023-12-04 07:26:01,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=184533.33333333334, ans=0.07 2023-12-04 07:26:01,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184533.33333333334, ans=0.125 2023-12-04 07:26:04,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=184533.33333333334, ans=0.125 2023-12-04 07:26:16,534 INFO [train.py:1087] (1/4) Epoch 31, batch 850, loss[loss=0.1628, simple_loss=0.2558, pruned_loss=0.03492, over 24862.00 frames. ], tot_loss[loss=0.1685, simple_loss=0.2588, pruned_loss=0.03911, over 4721553.59 frames. ], batch size: 68, lr: 7.83e-03, grad_scale: 32.0 2023-12-04 07:26:23,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=184666.66666666666, ans=0.0 2023-12-04 07:26:52,121 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-12-04 07:26:54,694 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.395e+02 1.510e+02 1.646e+02 2.236e+02, threshold=3.020e+02, percent-clipped=0.0 2023-12-04 07:27:00,178 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=14.33 vs. limit=15.0 2023-12-04 07:27:16,718 INFO [train.py:1087] (1/4) Epoch 32, batch 0, loss[loss=0.1549, simple_loss=0.2482, pruned_loss=0.03078, over 24600.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2482, pruned_loss=0.03078, over 24600.00 frames. ], batch size: 68, lr: 7.70e-03, grad_scale: 32.0 2023-12-04 07:27:16,719 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 07:27:28,738 INFO [train.py:1119] (1/4) Epoch 32, validation: loss=0.154, simple_loss=0.2543, pruned_loss=0.02682, over 944034.00 frames. 2023-12-04 07:27:28,739 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 07:27:39,767 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=185033.33333333334, ans=0.125 2023-12-04 07:27:42,216 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.22 vs. limit=15.0 2023-12-04 07:27:57,332 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=185100.0, ans=0.125 2023-12-04 07:27:57,419 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=185100.0, ans=0.125 2023-12-04 07:28:01,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=185166.66666666666, ans=0.125 2023-12-04 07:28:01,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=185166.66666666666, ans=0.125 2023-12-04 07:28:03,232 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.59 vs. limit=10.0 2023-12-04 07:28:03,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=185166.66666666666, ans=0.5 2023-12-04 07:28:07,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185166.66666666666, ans=0.0 2023-12-04 07:28:18,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=185233.33333333334, ans=0.125 2023-12-04 07:28:18,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=185233.33333333334, ans=0.95 2023-12-04 07:28:20,939 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:28:22,743 INFO [train.py:1087] (1/4) Epoch 32, batch 50, loss[loss=0.1701, simple_loss=0.2623, pruned_loss=0.03895, over 24312.00 frames. ], tot_loss[loss=0.1672, simple_loss=0.2579, pruned_loss=0.03824, over 1090363.55 frames. ], batch size: 79, lr: 7.69e-03, grad_scale: 32.0 2023-12-04 07:28:24,418 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.62 vs. limit=15.0 2023-12-04 07:28:33,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=185366.66666666666, ans=0.125 2023-12-04 07:28:34,982 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-12-04 07:28:53,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=185433.33333333334, ans=0.025 2023-12-04 07:28:54,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=185433.33333333334, ans=0.0 2023-12-04 07:28:55,739 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.29 vs. limit=12.0 2023-12-04 07:28:58,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=185500.0, ans=0.125 2023-12-04 07:29:10,206 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.353e+02 1.432e+02 1.598e+02 2.705e+02, threshold=2.864e+02, percent-clipped=0.0 2023-12-04 07:29:12,923 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.64 vs. limit=15.0 2023-12-04 07:29:17,657 INFO [train.py:1087] (1/4) Epoch 32, batch 100, loss[loss=0.1722, simple_loss=0.2609, pruned_loss=0.04173, over 24241.00 frames. ], tot_loss[loss=0.1671, simple_loss=0.2578, pruned_loss=0.03823, over 1918602.73 frames. ], batch size: 82, lr: 7.69e-03, grad_scale: 32.0 2023-12-04 07:29:26,654 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-12-04 07:29:29,754 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-12-04 07:29:41,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=185766.66666666666, ans=0.125 2023-12-04 07:29:42,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185766.66666666666, ans=0.1 2023-12-04 07:29:53,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=185833.33333333334, ans=0.125 2023-12-04 07:29:55,159 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.24 vs. limit=10.0 2023-12-04 07:29:59,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=185833.33333333334, ans=0.125 2023-12-04 07:30:12,546 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=185966.66666666666, ans=0.125 2023-12-04 07:30:12,603 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185966.66666666666, ans=0.1 2023-12-04 07:30:13,317 INFO [train.py:1087] (1/4) Epoch 32, batch 150, loss[loss=0.1802, simple_loss=0.2652, pruned_loss=0.04758, over 24172.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2574, pruned_loss=0.03789, over 2581293.97 frames. ], batch size: 82, lr: 7.68e-03, grad_scale: 32.0 2023-12-04 07:30:32,591 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:30:35,086 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.16 vs. limit=12.0 2023-12-04 07:30:43,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=186100.0, ans=0.0 2023-12-04 07:31:01,349 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.314e+02 1.382e+02 1.497e+02 2.003e+02, threshold=2.764e+02, percent-clipped=0.0 2023-12-04 07:31:03,962 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.88 vs. limit=22.5 2023-12-04 07:31:08,725 INFO [train.py:1087] (1/4) Epoch 32, batch 200, loss[loss=0.1747, simple_loss=0.2707, pruned_loss=0.03934, over 21310.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2573, pruned_loss=0.03788, over 3092435.77 frames. ], batch size: 127, lr: 7.67e-03, grad_scale: 32.0 2023-12-04 07:31:10,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186300.0, ans=0.1 2023-12-04 07:31:24,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=186366.66666666666, ans=0.125 2023-12-04 07:31:25,170 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=186366.66666666666, ans=0.0 2023-12-04 07:31:29,222 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=186366.66666666666, ans=0.125 2023-12-04 07:31:42,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=186500.0, ans=0.125 2023-12-04 07:31:48,152 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=186500.0, ans=0.125 2023-12-04 07:31:49,093 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=186500.0, ans=0.125 2023-12-04 07:31:49,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=186500.0, ans=0.125 2023-12-04 07:31:52,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=186566.66666666666, ans=0.2 2023-12-04 07:32:04,823 INFO [train.py:1087] (1/4) Epoch 32, batch 250, loss[loss=0.2094, simple_loss=0.2906, pruned_loss=0.0641, over 16973.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2574, pruned_loss=0.03809, over 3462889.64 frames. ], batch size: 177, lr: 7.67e-03, grad_scale: 32.0 2023-12-04 07:32:06,064 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=186633.33333333334, ans=0.0 2023-12-04 07:32:49,564 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-12-04 07:32:52,302 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=186900.0, ans=0.125 2023-12-04 07:32:54,437 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.314e+02 1.423e+02 1.624e+02 2.429e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 07:33:01,845 INFO [train.py:1087] (1/4) Epoch 32, batch 300, loss[loss=0.1596, simple_loss=0.25, pruned_loss=0.03458, over 24752.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2574, pruned_loss=0.0381, over 3760243.49 frames. ], batch size: 63, lr: 7.66e-03, grad_scale: 32.0 2023-12-04 07:33:02,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=186966.66666666666, ans=0.1 2023-12-04 07:33:34,460 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:33:37,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=187166.66666666666, ans=0.125 2023-12-04 07:33:41,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=187166.66666666666, ans=0.125 2023-12-04 07:33:51,407 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=187233.33333333334, ans=0.125 2023-12-04 07:33:57,892 INFO [train.py:1087] (1/4) Epoch 32, batch 350, loss[loss=0.164, simple_loss=0.2557, pruned_loss=0.03617, over 24737.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2577, pruned_loss=0.03844, over 3979784.83 frames. ], batch size: 70, lr: 7.65e-03, grad_scale: 16.0 2023-12-04 07:34:46,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=187566.66666666666, ans=0.02 2023-12-04 07:34:46,780 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.198e+02 1.371e+02 1.458e+02 1.584e+02 2.134e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 07:34:48,147 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=187566.66666666666, ans=0.125 2023-12-04 07:34:51,186 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187566.66666666666, ans=0.125 2023-12-04 07:34:53,075 INFO [train.py:1087] (1/4) Epoch 32, batch 400, loss[loss=0.2053, simple_loss=0.2874, pruned_loss=0.0616, over 16922.00 frames. ], tot_loss[loss=0.1688, simple_loss=0.2589, pruned_loss=0.03933, over 4108003.48 frames. ], batch size: 179, lr: 7.65e-03, grad_scale: 32.0 2023-12-04 07:35:06,940 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-12-04 07:35:13,878 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=15.0 2023-12-04 07:35:24,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=187833.33333333334, ans=0.0 2023-12-04 07:35:32,401 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187833.33333333334, ans=0.125 2023-12-04 07:35:36,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=187900.0, ans=0.0 2023-12-04 07:35:48,527 INFO [train.py:1087] (1/4) Epoch 32, batch 450, loss[loss=0.1655, simple_loss=0.259, pruned_loss=0.03599, over 24696.00 frames. ], tot_loss[loss=0.1683, simple_loss=0.2586, pruned_loss=0.039, over 4256141.81 frames. ], batch size: 74, lr: 7.64e-03, grad_scale: 32.0 2023-12-04 07:35:53,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=187966.66666666666, ans=0.125 2023-12-04 07:36:23,729 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=188166.66666666666, ans=0.125 2023-12-04 07:36:29,168 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.61 vs. limit=15.0 2023-12-04 07:36:37,069 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.326e+02 1.426e+02 1.575e+02 2.216e+02, threshold=2.852e+02, percent-clipped=0.0 2023-12-04 07:36:43,756 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=188300.0, ans=0.125 2023-12-04 07:36:44,498 INFO [train.py:1087] (1/4) Epoch 32, batch 500, loss[loss=0.1602, simple_loss=0.2501, pruned_loss=0.03515, over 24564.00 frames. ], tot_loss[loss=0.1677, simple_loss=0.2579, pruned_loss=0.03875, over 4383834.74 frames. ], batch size: 66, lr: 7.64e-03, grad_scale: 32.0 2023-12-04 07:36:52,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=188300.0, ans=0.04949747468305833 2023-12-04 07:37:00,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=188366.66666666666, ans=0.125 2023-12-04 07:37:19,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=188500.0, ans=0.125 2023-12-04 07:37:38,690 INFO [train.py:1087] (1/4) Epoch 32, batch 550, loss[loss=0.1611, simple_loss=0.2518, pruned_loss=0.03516, over 24762.00 frames. ], tot_loss[loss=0.1681, simple_loss=0.2584, pruned_loss=0.03891, over 4467142.98 frames. ], batch size: 65, lr: 7.63e-03, grad_scale: 32.0 2023-12-04 07:38:02,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=188766.66666666666, ans=0.0 2023-12-04 07:38:13,053 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=188833.33333333334, ans=0.125 2023-12-04 07:38:27,414 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.197e+02 1.355e+02 1.536e+02 1.656e+02 2.166e+02, threshold=3.073e+02, percent-clipped=0.0 2023-12-04 07:38:33,698 INFO [train.py:1087] (1/4) Epoch 32, batch 600, loss[loss=0.166, simple_loss=0.2592, pruned_loss=0.03639, over 24723.00 frames. ], tot_loss[loss=0.1677, simple_loss=0.2582, pruned_loss=0.03866, over 4550929.31 frames. ], batch size: 63, lr: 7.62e-03, grad_scale: 32.0 2023-12-04 07:38:37,403 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.91 vs. limit=22.5 2023-12-04 07:38:40,386 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=188966.66666666666, ans=0.1 2023-12-04 07:38:51,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189033.33333333334, ans=0.1 2023-12-04 07:38:55,847 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-12-04 07:39:12,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=189166.66666666666, ans=0.125 2023-12-04 07:39:13,814 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:39:25,209 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189233.33333333334, ans=0.1 2023-12-04 07:39:29,171 INFO [train.py:1087] (1/4) Epoch 32, batch 650, loss[loss=0.2235, simple_loss=0.3004, pruned_loss=0.07327, over 17216.00 frames. ], tot_loss[loss=0.1677, simple_loss=0.2582, pruned_loss=0.03859, over 4600727.12 frames. ], batch size: 177, lr: 7.62e-03, grad_scale: 32.0 2023-12-04 07:39:30,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=189300.0, ans=0.0 2023-12-04 07:39:31,856 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.69 vs. limit=15.0 2023-12-04 07:39:57,190 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-12-04 07:39:57,863 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=189433.33333333334, ans=0.0 2023-12-04 07:40:08,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=189500.0, ans=0.0 2023-12-04 07:40:14,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=189566.66666666666, ans=0.125 2023-12-04 07:40:14,931 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=22.5 2023-12-04 07:40:16,289 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=189566.66666666666, ans=0.2 2023-12-04 07:40:18,428 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.335e+02 1.423e+02 1.569e+02 2.985e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 07:40:25,274 INFO [train.py:1087] (1/4) Epoch 32, batch 700, loss[loss=0.1586, simple_loss=0.2506, pruned_loss=0.03329, over 24754.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2575, pruned_loss=0.03817, over 4642182.85 frames. ], batch size: 65, lr: 7.61e-03, grad_scale: 32.0 2023-12-04 07:40:43,984 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=189700.0, ans=0.1 2023-12-04 07:40:47,893 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=22.5 2023-12-04 07:40:53,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=189766.66666666666, ans=0.0 2023-12-04 07:41:06,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=189833.33333333334, ans=0.125 2023-12-04 07:41:10,057 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2023-12-04 07:41:14,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=189900.0, ans=0.125 2023-12-04 07:41:16,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=189900.0, ans=0.125 2023-12-04 07:41:20,230 INFO [train.py:1087] (1/4) Epoch 32, batch 750, loss[loss=0.1654, simple_loss=0.2574, pruned_loss=0.03671, over 24709.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2575, pruned_loss=0.03816, over 4681781.67 frames. ], batch size: 74, lr: 7.60e-03, grad_scale: 32.0 2023-12-04 07:41:32,184 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-12-04 07:41:58,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=190166.66666666666, ans=0.07 2023-12-04 07:42:05,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190233.33333333334, ans=0.1 2023-12-04 07:42:09,087 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.040e+02 1.317e+02 1.435e+02 1.596e+02 2.082e+02, threshold=2.871e+02, percent-clipped=0.0 2023-12-04 07:42:15,438 INFO [train.py:1087] (1/4) Epoch 32, batch 800, loss[loss=0.1759, simple_loss=0.2677, pruned_loss=0.04207, over 24320.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2573, pruned_loss=0.03811, over 4719310.81 frames. ], batch size: 79, lr: 7.60e-03, grad_scale: 32.0 2023-12-04 07:42:17,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190300.0, ans=0.1 2023-12-04 07:42:19,955 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=190300.0, ans=0.2 2023-12-04 07:42:29,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190366.66666666666, ans=0.1 2023-12-04 07:42:29,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=190366.66666666666, ans=0.2 2023-12-04 07:42:31,115 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.39 vs. limit=22.5 2023-12-04 07:42:31,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=190366.66666666666, ans=0.1 2023-12-04 07:42:37,908 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.55 vs. limit=22.5 2023-12-04 07:42:48,490 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=190500.0, ans=0.125 2023-12-04 07:42:52,937 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.24 vs. limit=15.0 2023-12-04 07:42:54,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=190500.0, ans=0.125 2023-12-04 07:42:55,499 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=190566.66666666666, ans=0.0 2023-12-04 07:42:58,634 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-12-04 07:43:06,178 INFO [train.py:1087] (1/4) Epoch 32, batch 850, loss[loss=0.1648, simple_loss=0.2571, pruned_loss=0.03631, over 24856.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2573, pruned_loss=0.03814, over 4737843.34 frames. ], batch size: 68, lr: 7.59e-03, grad_scale: 32.0 2023-12-04 07:43:09,817 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=12.0 2023-12-04 07:43:13,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=190633.33333333334, ans=0.125 2023-12-04 07:43:28,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=190766.66666666666, ans=0.2 2023-12-04 07:43:30,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=190766.66666666666, ans=0.0 2023-12-04 07:43:33,618 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:43:47,993 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.22 vs. limit=22.5 2023-12-04 07:43:57,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190933.33333333334, ans=0.1 2023-12-04 07:44:04,808 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.355e+02 1.461e+02 1.606e+02 3.064e+02, threshold=2.922e+02, percent-clipped=2.0 2023-12-04 07:44:04,836 INFO [train.py:1087] (1/4) Epoch 33, batch 0, loss[loss=0.1697, simple_loss=0.2602, pruned_loss=0.03958, over 24076.00 frames. ], tot_loss[loss=0.1697, simple_loss=0.2602, pruned_loss=0.03958, over 24076.00 frames. ], batch size: 58, lr: 7.47e-03, grad_scale: 32.0 2023-12-04 07:44:04,836 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 07:44:16,802 INFO [train.py:1119] (1/4) Epoch 33, validation: loss=0.154, simple_loss=0.2541, pruned_loss=0.02696, over 944034.00 frames. 2023-12-04 07:44:16,802 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 07:44:42,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=191066.66666666666, ans=0.2 2023-12-04 07:44:49,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=191133.33333333334, ans=0.125 2023-12-04 07:44:50,238 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=22.5 2023-12-04 07:44:54,994 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=191133.33333333334, ans=0.2 2023-12-04 07:45:00,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=191200.0, ans=10.0 2023-12-04 07:45:12,127 INFO [train.py:1087] (1/4) Epoch 33, batch 50, loss[loss=0.1802, simple_loss=0.2669, pruned_loss=0.04673, over 24554.00 frames. ], tot_loss[loss=0.168, simple_loss=0.2584, pruned_loss=0.0388, over 1079911.02 frames. ], batch size: 63, lr: 7.46e-03, grad_scale: 32.0 2023-12-04 07:45:12,329 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191266.66666666666, ans=0.1 2023-12-04 07:45:17,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=191266.66666666666, ans=0.125 2023-12-04 07:45:27,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=191333.33333333334, ans=0.0 2023-12-04 07:45:27,721 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-12-04 07:45:31,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=191333.33333333334, ans=0.125 2023-12-04 07:45:35,432 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=191400.0, ans=0.125 2023-12-04 07:45:35,711 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.18 vs. limit=10.0 2023-12-04 07:45:39,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=191400.0, ans=0.2 2023-12-04 07:46:07,477 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.212e+02 1.378e+02 1.539e+02 1.731e+02 2.265e+02, threshold=3.079e+02, percent-clipped=0.0 2023-12-04 07:46:07,503 INFO [train.py:1087] (1/4) Epoch 33, batch 100, loss[loss=0.1696, simple_loss=0.2662, pruned_loss=0.0365, over 21819.00 frames. ], tot_loss[loss=0.1686, simple_loss=0.2591, pruned_loss=0.03901, over 1902456.45 frames. ], batch size: 127, lr: 7.46e-03, grad_scale: 32.0 2023-12-04 07:46:14,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=191600.0, ans=0.2 2023-12-04 07:46:15,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=191600.0, ans=0.125 2023-12-04 07:46:16,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=191600.0, ans=0.125 2023-12-04 07:46:19,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=191666.66666666666, ans=0.125 2023-12-04 07:46:20,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=191666.66666666666, ans=0.125 2023-12-04 07:46:20,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=191666.66666666666, ans=0.125 2023-12-04 07:46:26,328 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-12-04 07:46:31,291 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=191733.33333333334, ans=0.07 2023-12-04 07:46:43,963 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=191800.0, ans=0.0 2023-12-04 07:47:01,780 INFO [train.py:1087] (1/4) Epoch 33, batch 150, loss[loss=0.1713, simple_loss=0.2637, pruned_loss=0.03947, over 24734.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2576, pruned_loss=0.03801, over 2550424.71 frames. ], batch size: 63, lr: 7.45e-03, grad_scale: 32.0 2023-12-04 07:47:06,841 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=191933.33333333334, ans=0.0 2023-12-04 07:47:11,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=191933.33333333334, ans=0.125 2023-12-04 07:47:12,990 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2023-12-04 07:47:29,815 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.67 vs. limit=15.0 2023-12-04 07:47:31,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=192066.66666666666, ans=0.5 2023-12-04 07:47:40,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.87 vs. limit=22.5 2023-12-04 07:47:43,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=192133.33333333334, ans=0.125 2023-12-04 07:47:44,746 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=12.0 2023-12-04 07:47:52,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=192200.0, ans=0.05 2023-12-04 07:47:58,020 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.177e+02 1.283e+02 1.371e+02 1.480e+02 2.275e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 07:47:58,047 INFO [train.py:1087] (1/4) Epoch 33, batch 200, loss[loss=0.1745, simple_loss=0.2607, pruned_loss=0.0442, over 23789.00 frames. ], tot_loss[loss=0.1673, simple_loss=0.2576, pruned_loss=0.03846, over 3031400.51 frames. ], batch size: 57, lr: 7.44e-03, grad_scale: 32.0 2023-12-04 07:48:01,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192266.66666666666, ans=0.1 2023-12-04 07:48:30,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192466.66666666666, ans=0.125 2023-12-04 07:48:34,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=192466.66666666666, ans=0.0 2023-12-04 07:48:46,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=192533.33333333334, ans=0.125 2023-12-04 07:48:53,314 INFO [train.py:1087] (1/4) Epoch 33, batch 250, loss[loss=0.1615, simple_loss=0.252, pruned_loss=0.03556, over 24789.00 frames. ], tot_loss[loss=0.1663, simple_loss=0.2571, pruned_loss=0.0378, over 3436391.59 frames. ], batch size: 62, lr: 7.44e-03, grad_scale: 32.0 2023-12-04 07:49:05,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=192666.66666666666, ans=0.125 2023-12-04 07:49:05,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=192666.66666666666, ans=0.125 2023-12-04 07:49:13,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=192666.66666666666, ans=0.1 2023-12-04 07:49:28,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=192800.0, ans=0.125 2023-12-04 07:49:40,819 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-12-04 07:49:42,722 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=12.0 2023-12-04 07:49:44,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=192866.66666666666, ans=0.0 2023-12-04 07:49:49,260 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.292e+02 1.428e+02 1.566e+02 2.178e+02, threshold=2.856e+02, percent-clipped=0.0 2023-12-04 07:49:49,286 INFO [train.py:1087] (1/4) Epoch 33, batch 300, loss[loss=0.1643, simple_loss=0.2538, pruned_loss=0.03737, over 24760.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2569, pruned_loss=0.0377, over 3742094.52 frames. ], batch size: 64, lr: 7.43e-03, grad_scale: 32.0 2023-12-04 07:50:00,381 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.36 vs. limit=15.0 2023-12-04 07:50:06,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193000.0, ans=0.1 2023-12-04 07:50:13,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=193066.66666666666, ans=0.125 2023-12-04 07:50:44,310 INFO [train.py:1087] (1/4) Epoch 33, batch 350, loss[loss=0.1623, simple_loss=0.2533, pruned_loss=0.03564, over 24801.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2571, pruned_loss=0.03816, over 3978856.00 frames. ], batch size: 62, lr: 7.43e-03, grad_scale: 32.0 2023-12-04 07:50:51,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193266.66666666666, ans=0.1 2023-12-04 07:51:07,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=193400.0, ans=0.125 2023-12-04 07:51:29,520 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=193533.33333333334, ans=0.125 2023-12-04 07:51:39,779 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.316e+02 1.411e+02 1.589e+02 2.112e+02, threshold=2.823e+02, percent-clipped=0.0 2023-12-04 07:51:39,805 INFO [train.py:1087] (1/4) Epoch 33, batch 400, loss[loss=0.1739, simple_loss=0.2586, pruned_loss=0.04463, over 24191.00 frames. ], tot_loss[loss=0.1669, simple_loss=0.2573, pruned_loss=0.03819, over 4150747.93 frames. ], batch size: 82, lr: 7.42e-03, grad_scale: 32.0 2023-12-04 07:51:40,097 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=193600.0, ans=0.0 2023-12-04 07:51:49,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193600.0, ans=0.1 2023-12-04 07:51:58,443 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=193666.66666666666, ans=0.0 2023-12-04 07:51:59,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=193666.66666666666, ans=0.2 2023-12-04 07:52:14,355 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.09 vs. limit=12.0 2023-12-04 07:52:16,105 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=193800.0, ans=0.035 2023-12-04 07:52:16,225 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193800.0, ans=0.1 2023-12-04 07:52:16,534 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-12-04 07:52:30,477 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-12-04 07:52:34,510 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 07:52:35,361 INFO [train.py:1087] (1/4) Epoch 33, batch 450, loss[loss=0.1793, simple_loss=0.2705, pruned_loss=0.04408, over 23504.00 frames. ], tot_loss[loss=0.1668, simple_loss=0.2575, pruned_loss=0.03806, over 4312302.13 frames. ], batch size: 94, lr: 7.41e-03, grad_scale: 32.0 2023-12-04 07:52:40,224 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-12-04 07:53:10,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=194133.33333333334, ans=0.125 2023-12-04 07:53:14,570 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=194133.33333333334, ans=0.125 2023-12-04 07:53:18,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=194200.0, ans=0.125 2023-12-04 07:53:23,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=194200.0, ans=0.2 2023-12-04 07:53:28,897 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.70 vs. limit=10.0 2023-12-04 07:53:30,272 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.218e+02 1.385e+02 1.524e+02 1.698e+02 2.358e+02, threshold=3.049e+02, percent-clipped=0.0 2023-12-04 07:53:30,299 INFO [train.py:1087] (1/4) Epoch 33, batch 500, loss[loss=0.1639, simple_loss=0.2575, pruned_loss=0.03517, over 24706.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2573, pruned_loss=0.03808, over 4427018.73 frames. ], batch size: 69, lr: 7.41e-03, grad_scale: 32.0 2023-12-04 07:53:34,088 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2023-12-04 07:53:54,965 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-12-04 07:54:04,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=194466.66666666666, ans=0.2 2023-12-04 07:54:20,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=194533.33333333334, ans=0.125 2023-12-04 07:54:24,730 INFO [train.py:1087] (1/4) Epoch 33, batch 550, loss[loss=0.1633, simple_loss=0.2584, pruned_loss=0.03411, over 24551.00 frames. ], tot_loss[loss=0.1664, simple_loss=0.2571, pruned_loss=0.03788, over 4509308.89 frames. ], batch size: 66, lr: 7.40e-03, grad_scale: 32.0 2023-12-04 07:54:37,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=194666.66666666666, ans=0.125 2023-12-04 07:54:53,895 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=194733.33333333334, ans=0.2 2023-12-04 07:55:18,107 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.12 vs. limit=6.0 2023-12-04 07:55:19,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=194933.33333333334, ans=0.0 2023-12-04 07:55:20,564 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.284e+02 1.392e+02 1.505e+02 1.929e+02, threshold=2.784e+02, percent-clipped=0.0 2023-12-04 07:55:20,590 INFO [train.py:1087] (1/4) Epoch 33, batch 600, loss[loss=0.1592, simple_loss=0.2494, pruned_loss=0.03451, over 24707.00 frames. ], tot_loss[loss=0.1666, simple_loss=0.2573, pruned_loss=0.03799, over 4563058.07 frames. ], batch size: 69, lr: 7.40e-03, grad_scale: 32.0 2023-12-04 07:55:41,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=195000.0, ans=0.1 2023-12-04 07:56:01,768 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-12-04 07:56:07,887 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=195200.0, ans=0.125 2023-12-04 07:56:14,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=195200.0, ans=0.2 2023-12-04 07:56:16,546 INFO [train.py:1087] (1/4) Epoch 33, batch 650, loss[loss=0.1663, simple_loss=0.257, pruned_loss=0.03778, over 23754.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2568, pruned_loss=0.03759, over 4626283.71 frames. ], batch size: 57, lr: 7.39e-03, grad_scale: 32.0 2023-12-04 07:56:23,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=195266.66666666666, ans=0.0 2023-12-04 07:56:42,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=195400.0, ans=0.0 2023-12-04 07:56:56,744 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.76 vs. limit=15.0 2023-12-04 07:56:58,551 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=195466.66666666666, ans=0.0 2023-12-04 07:57:09,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=195533.33333333334, ans=0.05 2023-12-04 07:57:10,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195533.33333333334, ans=0.125 2023-12-04 07:57:12,034 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.333e+02 1.459e+02 1.652e+02 2.138e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 07:57:12,061 INFO [train.py:1087] (1/4) Epoch 33, batch 700, loss[loss=0.1734, simple_loss=0.2636, pruned_loss=0.0416, over 23746.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.2574, pruned_loss=0.03801, over 4633878.70 frames. ], batch size: 95, lr: 7.38e-03, grad_scale: 32.0 2023-12-04 07:57:27,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195666.66666666666, ans=0.1 2023-12-04 07:57:45,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=195800.0, ans=0.04949747468305833 2023-12-04 07:57:45,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=195800.0, ans=0.0 2023-12-04 07:58:06,542 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=195933.33333333334, ans=0.125 2023-12-04 07:58:06,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=195933.33333333334, ans=0.1 2023-12-04 07:58:07,767 INFO [train.py:1087] (1/4) Epoch 33, batch 750, loss[loss=0.1579, simple_loss=0.2524, pruned_loss=0.03177, over 24470.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2568, pruned_loss=0.0377, over 4676594.27 frames. ], batch size: 77, lr: 7.38e-03, grad_scale: 32.0 2023-12-04 07:58:15,161 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=195933.33333333334, ans=15.0 2023-12-04 07:58:21,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=196000.0, ans=0.125 2023-12-04 07:58:27,050 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=196000.0, ans=0.125 2023-12-04 07:58:27,981 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=196000.0, ans=0.035 2023-12-04 07:58:28,566 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-12-04 07:58:47,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=196133.33333333334, ans=0.125 2023-12-04 07:58:59,500 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.07 vs. limit=22.5 2023-12-04 07:59:03,166 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.171e+02 1.329e+02 1.417e+02 1.532e+02 2.129e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 07:59:03,195 INFO [train.py:1087] (1/4) Epoch 33, batch 800, loss[loss=0.1593, simple_loss=0.2507, pruned_loss=0.03396, over 24755.00 frames. ], tot_loss[loss=0.1662, simple_loss=0.257, pruned_loss=0.03775, over 4699390.87 frames. ], batch size: 66, lr: 7.37e-03, grad_scale: 32.0 2023-12-04 07:59:25,985 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2023-12-04 07:59:29,072 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.40 vs. limit=10.0 2023-12-04 07:59:30,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=196400.0, ans=0.2 2023-12-04 07:59:39,783 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=196466.66666666666, ans=0.125 2023-12-04 07:59:47,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=196533.33333333334, ans=0.125 2023-12-04 07:59:54,535 INFO [train.py:1087] (1/4) Epoch 33, batch 850, loss[loss=0.1589, simple_loss=0.2511, pruned_loss=0.03334, over 24758.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2567, pruned_loss=0.03758, over 4740342.52 frames. ], batch size: 70, lr: 7.37e-03, grad_scale: 32.0 2023-12-04 08:00:26,449 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=196800.0, ans=15.0 2023-12-04 08:00:52,437 INFO [train.py:1087] (1/4) Epoch 34, batch 0, loss[loss=0.1504, simple_loss=0.2447, pruned_loss=0.028, over 24211.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2447, pruned_loss=0.028, over 24211.00 frames. ], batch size: 82, lr: 7.25e-03, grad_scale: 32.0 2023-12-04 08:00:52,438 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 08:01:04,659 INFO [train.py:1119] (1/4) Epoch 34, validation: loss=0.1541, simple_loss=0.2542, pruned_loss=0.02698, over 944034.00 frames. 2023-12-04 08:01:04,660 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 08:01:09,873 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.334e+02 1.484e+02 1.658e+02 2.498e+02, threshold=2.968e+02, percent-clipped=0.0 2023-12-04 08:01:21,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=196966.66666666666, ans=0.125 2023-12-04 08:01:23,614 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.02 vs. limit=22.5 2023-12-04 08:01:24,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=196966.66666666666, ans=0.0 2023-12-04 08:01:54,842 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.16 vs. limit=12.0 2023-12-04 08:01:59,553 INFO [train.py:1087] (1/4) Epoch 34, batch 50, loss[loss=0.1581, simple_loss=0.2474, pruned_loss=0.0344, over 24760.00 frames. ], tot_loss[loss=0.1667, simple_loss=0.258, pruned_loss=0.03769, over 1082250.54 frames. ], batch size: 66, lr: 7.24e-03, grad_scale: 32.0 2023-12-04 08:02:12,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-12-04 08:02:27,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=197366.66666666666, ans=0.125 2023-12-04 08:02:29,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=197366.66666666666, ans=0.1 2023-12-04 08:02:31,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.19 vs. limit=12.0 2023-12-04 08:02:39,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=22.5 2023-12-04 08:02:55,517 INFO [train.py:1087] (1/4) Epoch 34, batch 100, loss[loss=0.1603, simple_loss=0.2552, pruned_loss=0.03267, over 24580.00 frames. ], tot_loss[loss=0.1674, simple_loss=0.2581, pruned_loss=0.03832, over 1893457.30 frames. ], batch size: 65, lr: 7.24e-03, grad_scale: 32.0 2023-12-04 08:03:00,825 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.348e+02 1.470e+02 1.630e+02 2.768e+02, threshold=2.940e+02, percent-clipped=0.0 2023-12-04 08:03:05,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=197633.33333333334, ans=0.125 2023-12-04 08:03:13,262 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:03:14,720 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.93 vs. limit=15.0 2023-12-04 08:03:43,023 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=197833.33333333334, ans=0.125 2023-12-04 08:03:50,199 INFO [train.py:1087] (1/4) Epoch 34, batch 150, loss[loss=0.1481, simple_loss=0.2379, pruned_loss=0.02909, over 24564.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2568, pruned_loss=0.03761, over 2536553.91 frames. ], batch size: 64, lr: 7.23e-03, grad_scale: 32.0 2023-12-04 08:04:03,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=197966.66666666666, ans=0.125 2023-12-04 08:04:10,119 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-12-04 08:04:16,416 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:04:29,745 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:04:36,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=198166.66666666666, ans=0.07 2023-12-04 08:04:38,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=198166.66666666666, ans=0.125 2023-12-04 08:04:45,518 INFO [train.py:1087] (1/4) Epoch 34, batch 200, loss[loss=0.1668, simple_loss=0.2572, pruned_loss=0.03821, over 24799.00 frames. ], tot_loss[loss=0.1657, simple_loss=0.2567, pruned_loss=0.0374, over 3056399.97 frames. ], batch size: 72, lr: 7.23e-03, grad_scale: 32.0 2023-12-04 08:04:51,204 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.341e+02 1.461e+02 1.618e+02 2.204e+02, threshold=2.921e+02, percent-clipped=0.0 2023-12-04 08:04:52,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198233.33333333334, ans=0.1 2023-12-04 08:04:56,148 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-12-04 08:04:58,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=198300.0, ans=0.0 2023-12-04 08:05:08,130 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=198366.66666666666, ans=0.2 2023-12-04 08:05:14,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=198366.66666666666, ans=0.0 2023-12-04 08:05:22,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=198433.33333333334, ans=0.125 2023-12-04 08:05:39,216 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=198500.0, ans=15.0 2023-12-04 08:05:39,984 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=198566.66666666666, ans=0.125 2023-12-04 08:05:40,864 INFO [train.py:1087] (1/4) Epoch 34, batch 250, loss[loss=0.1535, simple_loss=0.2466, pruned_loss=0.03021, over 24594.00 frames. ], tot_loss[loss=0.166, simple_loss=0.2569, pruned_loss=0.03755, over 3435538.76 frames. ], batch size: 68, lr: 7.22e-03, grad_scale: 32.0 2023-12-04 08:05:49,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198566.66666666666, ans=0.125 2023-12-04 08:05:55,384 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=198633.33333333334, ans=0.125 2023-12-04 08:05:55,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198633.33333333334, ans=0.1 2023-12-04 08:06:20,291 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=198766.66666666666, ans=0.2 2023-12-04 08:06:27,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=198833.33333333334, ans=0.125 2023-12-04 08:06:30,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=198833.33333333334, ans=0.125 2023-12-04 08:06:37,412 INFO [train.py:1087] (1/4) Epoch 34, batch 300, loss[loss=0.162, simple_loss=0.2566, pruned_loss=0.0337, over 24756.00 frames. ], tot_loss[loss=0.1658, simple_loss=0.2567, pruned_loss=0.03748, over 3737493.38 frames. ], batch size: 66, lr: 7.22e-03, grad_scale: 32.0 2023-12-04 08:06:42,638 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.316e+02 1.407e+02 1.530e+02 2.162e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 08:07:14,764 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199100.0, ans=0.125 2023-12-04 08:07:33,256 INFO [train.py:1087] (1/4) Epoch 34, batch 350, loss[loss=0.1603, simple_loss=0.2493, pruned_loss=0.0357, over 24783.00 frames. ], tot_loss[loss=0.1656, simple_loss=0.2565, pruned_loss=0.03739, over 3980288.13 frames. ], batch size: 71, lr: 7.21e-03, grad_scale: 32.0 2023-12-04 08:07:41,883 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:07:58,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=199366.66666666666, ans=0.0 2023-12-04 08:07:59,331 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.11 vs. limit=22.5 2023-12-04 08:08:04,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=199366.66666666666, ans=0.125 2023-12-04 08:08:29,130 INFO [train.py:1087] (1/4) Epoch 34, batch 400, loss[loss=0.1663, simple_loss=0.2594, pruned_loss=0.03665, over 24768.00 frames. ], tot_loss[loss=0.1655, simple_loss=0.2564, pruned_loss=0.03734, over 4154355.81 frames. ], batch size: 64, lr: 7.20e-03, grad_scale: 32.0 2023-12-04 08:08:29,573 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=22.5 2023-12-04 08:08:32,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=199566.66666666666, ans=0.125 2023-12-04 08:08:34,928 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.322e+02 1.440e+02 1.552e+02 2.302e+02, threshold=2.880e+02, percent-clipped=0.0 2023-12-04 08:08:36,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199566.66666666666, ans=0.125 2023-12-04 08:08:43,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=199633.33333333334, ans=0.0 2023-12-04 08:08:53,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=199700.0, ans=0.125 2023-12-04 08:09:08,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=199766.66666666666, ans=0.1 2023-12-04 08:09:22,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=199833.33333333334, ans=0.0 2023-12-04 08:09:23,083 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=199833.33333333334, ans=0.2 2023-12-04 08:09:25,007 INFO [train.py:1087] (1/4) Epoch 34, batch 450, loss[loss=0.2028, simple_loss=0.286, pruned_loss=0.0598, over 17174.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2567, pruned_loss=0.0375, over 4282365.33 frames. ], batch size: 177, lr: 7.20e-03, grad_scale: 32.0 2023-12-04 08:09:26,555 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-12-04 08:09:42,097 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=199966.66666666666, ans=0.125 2023-12-04 08:09:43,231 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=199966.66666666666, ans=0.125 2023-12-04 08:10:01,016 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-12-04 08:10:06,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=200100.0, ans=0.125 2023-12-04 08:10:20,533 INFO [train.py:1087] (1/4) Epoch 34, batch 500, loss[loss=0.1717, simple_loss=0.2598, pruned_loss=0.04182, over 23945.00 frames. ], tot_loss[loss=0.1661, simple_loss=0.2569, pruned_loss=0.03764, over 4404355.25 frames. ], batch size: 87, lr: 7.19e-03, grad_scale: 32.0 2023-12-04 08:10:25,726 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.353e+02 1.445e+02 1.601e+02 2.190e+02, threshold=2.891e+02, percent-clipped=0.0 2023-12-04 08:10:29,125 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=200233.33333333334, ans=0.0 2023-12-04 08:10:45,015 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.69 vs. limit=15.0 2023-12-04 08:11:15,040 INFO [train.py:1087] (1/4) Epoch 34, batch 550, loss[loss=0.1739, simple_loss=0.2651, pruned_loss=0.04138, over 24175.00 frames. ], tot_loss[loss=0.1665, simple_loss=0.2572, pruned_loss=0.03789, over 4474740.30 frames. ], batch size: 82, lr: 7.19e-03, grad_scale: 32.0 2023-12-04 08:11:29,930 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=200633.33333333334, ans=0.0 2023-12-04 08:11:48,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=200766.66666666666, ans=0.09899494936611666 2023-12-04 08:11:51,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=200766.66666666666, ans=0.125 2023-12-04 08:11:58,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=200766.66666666666, ans=0.125 2023-12-04 08:12:03,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=200833.33333333334, ans=0.0 2023-12-04 08:12:10,644 INFO [train.py:1087] (1/4) Epoch 34, batch 600, loss[loss=0.1615, simple_loss=0.2545, pruned_loss=0.03427, over 24545.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2568, pruned_loss=0.03756, over 4552151.58 frames. ], batch size: 62, lr: 7.18e-03, grad_scale: 32.0 2023-12-04 08:12:16,340 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.191e+02 1.333e+02 1.438e+02 1.560e+02 2.058e+02, threshold=2.876e+02, percent-clipped=0.0 2023-12-04 08:12:18,231 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.27 vs. limit=6.0 2023-12-04 08:12:26,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=200966.66666666666, ans=0.0 2023-12-04 08:12:26,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=200966.66666666666, ans=0.0 2023-12-04 08:12:44,246 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=201100.0, ans=0.125 2023-12-04 08:13:03,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=201166.66666666666, ans=0.02 2023-12-04 08:13:06,580 INFO [train.py:1087] (1/4) Epoch 34, batch 650, loss[loss=0.1558, simple_loss=0.2506, pruned_loss=0.03047, over 24789.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2567, pruned_loss=0.03752, over 4618218.98 frames. ], batch size: 70, lr: 7.18e-03, grad_scale: 32.0 2023-12-04 08:13:48,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=201433.33333333334, ans=0.95 2023-12-04 08:13:48,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=201433.33333333334, ans=0.125 2023-12-04 08:14:02,196 INFO [train.py:1087] (1/4) Epoch 34, batch 700, loss[loss=0.1689, simple_loss=0.2608, pruned_loss=0.03845, over 23419.00 frames. ], tot_loss[loss=0.1659, simple_loss=0.2567, pruned_loss=0.03761, over 4646669.52 frames. ], batch size: 94, lr: 7.17e-03, grad_scale: 32.0 2023-12-04 08:14:07,805 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.314e+02 1.390e+02 1.555e+02 2.195e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 08:14:18,740 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=201633.33333333334, ans=0.05 2023-12-04 08:14:19,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=201633.33333333334, ans=0.0 2023-12-04 08:14:26,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=201700.0, ans=0.125 2023-12-04 08:14:36,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=201766.66666666666, ans=0.025 2023-12-04 08:14:42,446 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=201766.66666666666, ans=0.035 2023-12-04 08:14:57,573 INFO [train.py:1087] (1/4) Epoch 34, batch 750, loss[loss=0.1596, simple_loss=0.2503, pruned_loss=0.03449, over 24294.00 frames. ], tot_loss[loss=0.1657, simple_loss=0.2565, pruned_loss=0.03746, over 4684844.43 frames. ], batch size: 79, lr: 7.17e-03, grad_scale: 32.0 2023-12-04 08:15:00,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=201900.0, ans=0.125 2023-12-04 08:15:41,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=202166.66666666666, ans=0.125 2023-12-04 08:15:44,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=202166.66666666666, ans=0.0 2023-12-04 08:15:53,547 INFO [train.py:1087] (1/4) Epoch 34, batch 800, loss[loss=0.1545, simple_loss=0.2484, pruned_loss=0.0303, over 24795.00 frames. ], tot_loss[loss=0.1652, simple_loss=0.256, pruned_loss=0.03714, over 4711889.19 frames. ], batch size: 73, lr: 7.16e-03, grad_scale: 32.0 2023-12-04 08:15:59,241 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.175e+02 1.341e+02 1.452e+02 1.557e+02 2.379e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 08:16:05,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202300.0, ans=0.1 2023-12-04 08:16:19,124 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=202366.66666666666, ans=0.125 2023-12-04 08:16:19,201 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=202366.66666666666, ans=0.125 2023-12-04 08:16:24,570 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.82 vs. limit=22.5 2023-12-04 08:16:32,460 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=12.0 2023-12-04 08:16:45,294 INFO [train.py:1087] (1/4) Epoch 34, batch 850, loss[loss=0.1566, simple_loss=0.2511, pruned_loss=0.03104, over 24761.00 frames. ], tot_loss[loss=0.1648, simple_loss=0.2558, pruned_loss=0.03695, over 4742590.75 frames. ], batch size: 70, lr: 7.15e-03, grad_scale: 32.0 2023-12-04 08:16:47,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=202566.66666666666, ans=0.0 2023-12-04 08:16:52,307 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.60 vs. limit=10.0 2023-12-04 08:17:04,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=202633.33333333334, ans=0.125 2023-12-04 08:17:20,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=202766.66666666666, ans=0.2 2023-12-04 08:17:45,356 INFO [train.py:1087] (1/4) Epoch 35, batch 0, loss[loss=0.1543, simple_loss=0.2466, pruned_loss=0.03096, over 24801.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2466, pruned_loss=0.03096, over 24801.00 frames. ], batch size: 72, lr: 7.04e-03, grad_scale: 32.0 2023-12-04 08:17:45,356 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 08:17:57,371 INFO [train.py:1119] (1/4) Epoch 35, validation: loss=0.1534, simple_loss=0.2532, pruned_loss=0.02686, over 944034.00 frames. 2023-12-04 08:17:57,372 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 08:18:02,023 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.73 vs. limit=15.0 2023-12-04 08:18:05,173 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-12-04 08:18:05,252 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.01 vs. limit=12.0 2023-12-04 08:18:07,931 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.315e+02 1.438e+02 1.571e+02 2.308e+02, threshold=2.875e+02, percent-clipped=0.0 2023-12-04 08:18:10,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=202933.33333333334, ans=0.0 2023-12-04 08:18:29,983 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:18:40,908 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.45 vs. limit=15.0 2023-12-04 08:18:53,067 INFO [train.py:1087] (1/4) Epoch 35, batch 50, loss[loss=0.1587, simple_loss=0.2506, pruned_loss=0.03337, over 24765.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.256, pruned_loss=0.03672, over 1102654.50 frames. ], batch size: 64, lr: 7.04e-03, grad_scale: 32.0 2023-12-04 08:18:54,568 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.32 vs. limit=15.0 2023-12-04 08:19:07,139 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=203266.66666666666, ans=0.125 2023-12-04 08:19:26,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=203400.0, ans=0.125 2023-12-04 08:19:36,062 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2023-12-04 08:19:46,873 INFO [train.py:1087] (1/4) Epoch 35, batch 100, loss[loss=0.1589, simple_loss=0.2545, pruned_loss=0.03169, over 24566.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.2551, pruned_loss=0.03637, over 1942883.70 frames. ], batch size: 63, lr: 7.03e-03, grad_scale: 32.0 2023-12-04 08:19:58,778 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.180e+02 1.339e+02 1.417e+02 1.550e+02 3.139e+02, threshold=2.834e+02, percent-clipped=1.0 2023-12-04 08:19:59,235 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-12-04 08:20:23,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=203733.33333333334, ans=0.2 2023-12-04 08:20:42,195 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=22.5 2023-12-04 08:20:42,458 INFO [train.py:1087] (1/4) Epoch 35, batch 150, loss[loss=0.1489, simple_loss=0.2455, pruned_loss=0.02619, over 24763.00 frames. ], tot_loss[loss=0.1642, simple_loss=0.2552, pruned_loss=0.03656, over 2583962.28 frames. ], batch size: 66, lr: 7.03e-03, grad_scale: 32.0 2023-12-04 08:20:48,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=203866.66666666666, ans=0.125 2023-12-04 08:20:58,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=203933.33333333334, ans=0.1 2023-12-04 08:20:58,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=203933.33333333334, ans=0.125 2023-12-04 08:20:59,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=203933.33333333334, ans=0.2 2023-12-04 08:21:08,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204000.0, ans=0.1 2023-12-04 08:21:10,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=204000.0, ans=0.125 2023-12-04 08:21:38,507 INFO [train.py:1087] (1/4) Epoch 35, batch 200, loss[loss=0.2065, simple_loss=0.2878, pruned_loss=0.06264, over 16466.00 frames. ], tot_loss[loss=0.1641, simple_loss=0.2552, pruned_loss=0.03646, over 3070939.57 frames. ], batch size: 178, lr: 7.02e-03, grad_scale: 32.0 2023-12-04 08:21:39,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=204200.0, ans=0.125 2023-12-04 08:21:45,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=204200.0, ans=0.125 2023-12-04 08:21:49,182 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.298e+02 1.418e+02 1.532e+02 2.225e+02, threshold=2.837e+02, percent-clipped=0.0 2023-12-04 08:21:52,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204266.66666666666, ans=0.1 2023-12-04 08:21:57,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=204266.66666666666, ans=0.1 2023-12-04 08:21:59,397 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-12-04 08:22:08,778 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.83 vs. limit=6.0 2023-12-04 08:22:26,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=204466.66666666666, ans=0.125 2023-12-04 08:22:34,153 INFO [train.py:1087] (1/4) Epoch 35, batch 250, loss[loss=0.1645, simple_loss=0.2576, pruned_loss=0.03565, over 24703.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.255, pruned_loss=0.03643, over 3463360.99 frames. ], batch size: 69, lr: 7.02e-03, grad_scale: 64.0 2023-12-04 08:22:41,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=204533.33333333334, ans=0.125 2023-12-04 08:23:04,745 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=12.0 2023-12-04 08:23:07,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204733.33333333334, ans=0.1 2023-12-04 08:23:10,756 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=204733.33333333334, ans=0.125 2023-12-04 08:23:16,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204733.33333333334, ans=0.1 2023-12-04 08:23:29,181 INFO [train.py:1087] (1/4) Epoch 35, batch 300, loss[loss=0.1703, simple_loss=0.2609, pruned_loss=0.03985, over 24215.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2549, pruned_loss=0.03623, over 3765690.04 frames. ], batch size: 82, lr: 7.01e-03, grad_scale: 32.0 2023-12-04 08:23:38,219 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-12-04 08:23:41,635 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.318e+02 1.389e+02 1.563e+02 2.065e+02, threshold=2.778e+02, percent-clipped=0.0 2023-12-04 08:24:07,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.73 vs. limit=22.5 2023-12-04 08:24:13,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205133.33333333334, ans=0.1 2023-12-04 08:24:20,345 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.93 vs. limit=22.5 2023-12-04 08:24:24,355 INFO [train.py:1087] (1/4) Epoch 35, batch 350, loss[loss=0.1505, simple_loss=0.2471, pruned_loss=0.02691, over 24790.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.255, pruned_loss=0.03644, over 3997611.96 frames. ], batch size: 72, lr: 7.01e-03, grad_scale: 32.0 2023-12-04 08:24:27,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=205200.0, ans=0.0 2023-12-04 08:24:45,555 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:24:57,719 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=205400.0, ans=0.2 2023-12-04 08:25:18,318 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.51 vs. limit=22.5 2023-12-04 08:25:18,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=205533.33333333334, ans=0.04949747468305833 2023-12-04 08:25:19,669 INFO [train.py:1087] (1/4) Epoch 35, batch 400, loss[loss=0.1625, simple_loss=0.2556, pruned_loss=0.03471, over 24705.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.255, pruned_loss=0.03634, over 4185006.06 frames. ], batch size: 69, lr: 7.00e-03, grad_scale: 32.0 2023-12-04 08:25:24,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=205533.33333333334, ans=0.125 2023-12-04 08:25:25,520 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.02 vs. limit=15.0 2023-12-04 08:25:29,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=205600.0, ans=0.0 2023-12-04 08:25:31,779 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.342e+02 1.464e+02 1.631e+02 2.167e+02, threshold=2.928e+02, percent-clipped=0.0 2023-12-04 08:25:39,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=205600.0, ans=0.125 2023-12-04 08:25:39,727 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-12-04 08:26:05,268 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=205800.0, ans=0.125 2023-12-04 08:26:15,233 INFO [train.py:1087] (1/4) Epoch 35, batch 450, loss[loss=0.1625, simple_loss=0.253, pruned_loss=0.03603, over 24611.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2549, pruned_loss=0.03625, over 4325739.01 frames. ], batch size: 68, lr: 7.00e-03, grad_scale: 32.0 2023-12-04 08:26:16,978 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=22.5 2023-12-04 08:26:22,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=205866.66666666666, ans=0.1 2023-12-04 08:27:02,450 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.70 vs. limit=10.0 2023-12-04 08:27:03,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=206133.33333333334, ans=0.09899494936611666 2023-12-04 08:27:03,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=206133.33333333334, ans=0.125 2023-12-04 08:27:10,534 INFO [train.py:1087] (1/4) Epoch 35, batch 500, loss[loss=0.1591, simple_loss=0.2482, pruned_loss=0.03504, over 24738.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2552, pruned_loss=0.03644, over 4420516.12 frames. ], batch size: 63, lr: 6.99e-03, grad_scale: 32.0 2023-12-04 08:27:22,124 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.328e+02 1.465e+02 1.669e+02 2.352e+02, threshold=2.929e+02, percent-clipped=0.0 2023-12-04 08:27:22,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=206266.66666666666, ans=0.125 2023-12-04 08:27:26,687 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:27:27,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=206266.66666666666, ans=0.125 2023-12-04 08:27:41,162 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=206333.33333333334, ans=0.0 2023-12-04 08:28:04,620 INFO [train.py:1087] (1/4) Epoch 35, batch 550, loss[loss=0.1538, simple_loss=0.2435, pruned_loss=0.03204, over 24767.00 frames. ], tot_loss[loss=0.1641, simple_loss=0.2552, pruned_loss=0.03653, over 4502817.04 frames. ], batch size: 65, lr: 6.98e-03, grad_scale: 32.0 2023-12-04 08:28:06,013 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=206533.33333333334, ans=0.0 2023-12-04 08:28:06,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.24 vs. limit=15.0 2023-12-04 08:28:20,538 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:28:21,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=206600.0, ans=0.0 2023-12-04 08:28:31,495 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=206666.66666666666, ans=0.2 2023-12-04 08:28:33,974 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.24 vs. limit=10.0 2023-12-04 08:28:45,333 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.32 vs. limit=15.0 2023-12-04 08:28:50,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206800.0, ans=0.1 2023-12-04 08:28:59,957 INFO [train.py:1087] (1/4) Epoch 35, batch 600, loss[loss=0.1542, simple_loss=0.2453, pruned_loss=0.03151, over 24724.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.255, pruned_loss=0.03635, over 4575793.48 frames. ], batch size: 69, lr: 6.98e-03, grad_scale: 32.0 2023-12-04 08:29:12,021 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.323e+02 1.412e+02 1.529e+02 2.400e+02, threshold=2.825e+02, percent-clipped=0.0 2023-12-04 08:29:15,986 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-12-04 08:29:39,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=207066.66666666666, ans=0.95 2023-12-04 08:29:43,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=207133.33333333334, ans=0.1 2023-12-04 08:29:47,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=207133.33333333334, ans=0.125 2023-12-04 08:29:55,541 INFO [train.py:1087] (1/4) Epoch 35, batch 650, loss[loss=0.1588, simple_loss=0.2524, pruned_loss=0.03261, over 24786.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2547, pruned_loss=0.03614, over 4639517.11 frames. ], batch size: 73, lr: 6.97e-03, grad_scale: 32.0 2023-12-04 08:29:55,711 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=207200.0, ans=0.2 2023-12-04 08:30:01,952 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=207200.0, ans=0.1 2023-12-04 08:30:08,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207266.66666666666, ans=0.1 2023-12-04 08:30:15,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=207266.66666666666, ans=0.125 2023-12-04 08:30:44,915 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=207466.66666666666, ans=0.1 2023-12-04 08:30:46,536 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-12-04 08:30:47,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=207466.66666666666, ans=0.125 2023-12-04 08:30:47,410 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=207466.66666666666, ans=0.2 2023-12-04 08:30:48,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207466.66666666666, ans=0.125 2023-12-04 08:30:50,706 INFO [train.py:1087] (1/4) Epoch 35, batch 700, loss[loss=0.1826, simple_loss=0.2706, pruned_loss=0.04731, over 23466.00 frames. ], tot_loss[loss=0.1641, simple_loss=0.2553, pruned_loss=0.03648, over 4669803.68 frames. ], batch size: 94, lr: 6.97e-03, grad_scale: 32.0 2023-12-04 08:30:50,884 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=207533.33333333334, ans=0.125 2023-12-04 08:31:02,811 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.354e+02 1.423e+02 1.583e+02 2.167e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 08:31:12,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=207666.66666666666, ans=0.0 2023-12-04 08:31:31,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=207733.33333333334, ans=0.07 2023-12-04 08:31:37,749 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.74 vs. limit=22.5 2023-12-04 08:31:39,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=207800.0, ans=0.0 2023-12-04 08:31:40,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=207800.0, ans=0.0 2023-12-04 08:31:45,652 INFO [train.py:1087] (1/4) Epoch 35, batch 750, loss[loss=0.1538, simple_loss=0.2463, pruned_loss=0.03058, over 24710.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.2558, pruned_loss=0.03685, over 4675199.27 frames. ], batch size: 69, lr: 6.96e-03, grad_scale: 32.0 2023-12-04 08:31:56,500 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207933.33333333334, ans=0.1 2023-12-04 08:32:05,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=207933.33333333334, ans=0.2 2023-12-04 08:32:15,103 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=208000.0, ans=0.125 2023-12-04 08:32:19,615 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=208066.66666666666, ans=0.125 2023-12-04 08:32:21,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=208066.66666666666, ans=0.125 2023-12-04 08:32:41,129 INFO [train.py:1087] (1/4) Epoch 35, batch 800, loss[loss=0.1776, simple_loss=0.2682, pruned_loss=0.04348, over 23484.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2559, pruned_loss=0.03692, over 4705272.40 frames. ], batch size: 94, lr: 6.96e-03, grad_scale: 32.0 2023-12-04 08:32:52,905 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.339e+02 1.480e+02 1.653e+02 2.128e+02, threshold=2.959e+02, percent-clipped=0.0 2023-12-04 08:32:54,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=208266.66666666666, ans=0.125 2023-12-04 08:32:57,830 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.28 vs. limit=10.0 2023-12-04 08:33:14,831 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.56 vs. limit=15.0 2023-12-04 08:33:32,762 INFO [train.py:1087] (1/4) Epoch 35, batch 850, loss[loss=0.1673, simple_loss=0.2602, pruned_loss=0.03725, over 24200.00 frames. ], tot_loss[loss=0.1649, simple_loss=0.2559, pruned_loss=0.03699, over 4738340.75 frames. ], batch size: 82, lr: 6.95e-03, grad_scale: 32.0 2023-12-04 08:33:51,190 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.85 vs. limit=15.0 2023-12-04 08:33:51,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=208666.66666666666, ans=0.1 2023-12-04 08:34:33,681 INFO [train.py:1087] (1/4) Epoch 36, batch 0, loss[loss=0.1532, simple_loss=0.2485, pruned_loss=0.02896, over 24803.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2485, pruned_loss=0.02896, over 24803.00 frames. ], batch size: 73, lr: 6.85e-03, grad_scale: 32.0 2023-12-04 08:34:33,681 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 08:34:45,808 INFO [train.py:1119] (1/4) Epoch 36, validation: loss=0.1524, simple_loss=0.2526, pruned_loss=0.0261, over 944034.00 frames. 2023-12-04 08:34:45,808 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 08:34:50,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=208833.33333333334, ans=0.125 2023-12-04 08:34:50,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=208833.33333333334, ans=22.5 2023-12-04 08:34:51,814 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-12-04 08:35:00,755 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:35:00,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=208900.0, ans=0.07 2023-12-04 08:35:02,243 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=208900.0, ans=0.1 2023-12-04 08:35:03,387 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.345e+02 1.425e+02 1.631e+02 2.922e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 08:35:08,799 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=208966.66666666666, ans=0.125 2023-12-04 08:35:16,486 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.41 vs. limit=15.0 2023-12-04 08:35:41,315 INFO [train.py:1087] (1/4) Epoch 36, batch 50, loss[loss=0.1848, simple_loss=0.2671, pruned_loss=0.05126, over 17018.00 frames. ], tot_loss[loss=0.1644, simple_loss=0.2555, pruned_loss=0.03663, over 1085371.47 frames. ], batch size: 177, lr: 6.84e-03, grad_scale: 32.0 2023-12-04 08:35:49,093 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=209166.66666666666, ans=0.0 2023-12-04 08:35:50,473 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.09 vs. limit=22.5 2023-12-04 08:36:02,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=209300.0, ans=0.125 2023-12-04 08:36:02,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=209300.0, ans=0.05 2023-12-04 08:36:05,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=209300.0, ans=0.125 2023-12-04 08:36:18,490 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-12-04 08:36:35,711 INFO [train.py:1087] (1/4) Epoch 36, batch 100, loss[loss=0.1535, simple_loss=0.2471, pruned_loss=0.02998, over 24810.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2555, pruned_loss=0.03589, over 1916801.94 frames. ], batch size: 72, lr: 6.84e-03, grad_scale: 32.0 2023-12-04 08:36:51,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=209566.66666666666, ans=0.125 2023-12-04 08:36:52,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=209566.66666666666, ans=0.07 2023-12-04 08:36:53,708 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.355e+02 1.450e+02 1.613e+02 2.179e+02, threshold=2.899e+02, percent-clipped=0.0 2023-12-04 08:37:02,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=209633.33333333334, ans=10.0 2023-12-04 08:37:05,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=209633.33333333334, ans=0.125 2023-12-04 08:37:17,097 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=209700.0, ans=0.1 2023-12-04 08:37:24,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209766.66666666666, ans=0.1 2023-12-04 08:37:30,865 INFO [train.py:1087] (1/4) Epoch 36, batch 150, loss[loss=0.1545, simple_loss=0.2476, pruned_loss=0.03071, over 24732.00 frames. ], tot_loss[loss=0.1638, simple_loss=0.2553, pruned_loss=0.0362, over 2564439.88 frames. ], batch size: 67, lr: 6.83e-03, grad_scale: 32.0 2023-12-04 08:37:31,123 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:37:39,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=209833.33333333334, ans=0.125 2023-12-04 08:37:40,291 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.22 vs. limit=10.0 2023-12-04 08:38:26,340 INFO [train.py:1087] (1/4) Epoch 36, batch 200, loss[loss=0.1745, simple_loss=0.2632, pruned_loss=0.04294, over 21342.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2552, pruned_loss=0.0364, over 3053335.01 frames. ], batch size: 127, lr: 6.83e-03, grad_scale: 32.0 2023-12-04 08:38:28,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=210166.66666666666, ans=0.125 2023-12-04 08:38:39,490 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=210233.33333333334, ans=0.0 2023-12-04 08:38:44,192 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.333e+02 1.452e+02 1.584e+02 2.539e+02, threshold=2.905e+02, percent-clipped=0.0 2023-12-04 08:39:01,897 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.82 vs. limit=15.0 2023-12-04 08:39:21,282 INFO [train.py:1087] (1/4) Epoch 36, batch 250, loss[loss=0.1599, simple_loss=0.253, pruned_loss=0.03336, over 24552.00 frames. ], tot_loss[loss=0.1646, simple_loss=0.2555, pruned_loss=0.03684, over 3423247.32 frames. ], batch size: 63, lr: 6.82e-03, grad_scale: 16.0 2023-12-04 08:39:21,434 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210500.0, ans=0.125 2023-12-04 08:39:38,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210566.66666666666, ans=0.1 2023-12-04 08:40:16,966 INFO [train.py:1087] (1/4) Epoch 36, batch 300, loss[loss=0.1686, simple_loss=0.2621, pruned_loss=0.03753, over 24566.00 frames. ], tot_loss[loss=0.164, simple_loss=0.2551, pruned_loss=0.0365, over 3729016.55 frames. ], batch size: 62, lr: 6.82e-03, grad_scale: 16.0 2023-12-04 08:40:19,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210833.33333333334, ans=0.1 2023-12-04 08:40:31,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=210900.0, ans=0.125 2023-12-04 08:40:35,393 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.298e+02 1.397e+02 1.525e+02 2.396e+02, threshold=2.794e+02, percent-clipped=0.0 2023-12-04 08:40:43,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210966.66666666666, ans=0.125 2023-12-04 08:40:50,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=211033.33333333334, ans=0.2 2023-12-04 08:40:51,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=211033.33333333334, ans=0.2 2023-12-04 08:40:53,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=211033.33333333334, ans=0.125 2023-12-04 08:41:03,237 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=15.0 2023-12-04 08:41:11,509 INFO [train.py:1087] (1/4) Epoch 36, batch 350, loss[loss=0.1492, simple_loss=0.2452, pruned_loss=0.02666, over 24762.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.2551, pruned_loss=0.03637, over 3973027.64 frames. ], batch size: 65, lr: 6.81e-03, grad_scale: 16.0 2023-12-04 08:41:32,705 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.84 vs. limit=10.0 2023-12-04 08:41:36,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=211300.0, ans=0.1 2023-12-04 08:41:55,785 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=211433.33333333334, ans=0.125 2023-12-04 08:41:58,859 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:42:07,147 INFO [train.py:1087] (1/4) Epoch 36, batch 400, loss[loss=0.1504, simple_loss=0.2433, pruned_loss=0.02873, over 24768.00 frames. ], tot_loss[loss=0.1638, simple_loss=0.2549, pruned_loss=0.03636, over 4155837.86 frames. ], batch size: 70, lr: 6.81e-03, grad_scale: 32.0 2023-12-04 08:42:21,174 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=211566.66666666666, ans=0.125 2023-12-04 08:42:26,290 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.148e+02 1.324e+02 1.435e+02 1.583e+02 2.493e+02, threshold=2.870e+02, percent-clipped=0.0 2023-12-04 08:42:26,824 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.83 vs. limit=15.0 2023-12-04 08:42:33,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=211633.33333333334, ans=0.125 2023-12-04 08:43:02,175 INFO [train.py:1087] (1/4) Epoch 36, batch 450, loss[loss=0.1717, simple_loss=0.2619, pruned_loss=0.0408, over 24332.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.255, pruned_loss=0.03637, over 4287396.89 frames. ], batch size: 79, lr: 6.80e-03, grad_scale: 32.0 2023-12-04 08:43:29,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=211966.66666666666, ans=0.2 2023-12-04 08:43:37,140 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:43:40,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=212033.33333333334, ans=0.0 2023-12-04 08:43:40,598 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.97 vs. limit=22.5 2023-12-04 08:43:50,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212100.0, ans=0.1 2023-12-04 08:43:57,323 INFO [train.py:1087] (1/4) Epoch 36, batch 500, loss[loss=0.149, simple_loss=0.2445, pruned_loss=0.02669, over 24597.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2549, pruned_loss=0.03621, over 4417159.62 frames. ], batch size: 68, lr: 6.80e-03, grad_scale: 32.0 2023-12-04 08:44:05,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=212166.66666666666, ans=0.125 2023-12-04 08:44:10,375 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=212233.33333333334, ans=0.0 2023-12-04 08:44:15,373 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.299e+02 1.402e+02 1.525e+02 1.849e+02, threshold=2.804e+02, percent-clipped=0.0 2023-12-04 08:44:21,403 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:44:26,632 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.47 vs. limit=15.0 2023-12-04 08:44:30,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=212366.66666666666, ans=0.0 2023-12-04 08:44:49,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=212433.33333333334, ans=0.0 2023-12-04 08:44:51,611 INFO [train.py:1087] (1/4) Epoch 36, batch 550, loss[loss=0.1658, simple_loss=0.2545, pruned_loss=0.03856, over 24561.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2547, pruned_loss=0.03604, over 4516490.86 frames. ], batch size: 63, lr: 6.79e-03, grad_scale: 32.0 2023-12-04 08:44:59,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=212500.0, ans=0.0 2023-12-04 08:45:02,483 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=212566.66666666666, ans=0.2 2023-12-04 08:45:02,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=212566.66666666666, ans=0.1 2023-12-04 08:45:08,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=212566.66666666666, ans=0.0 2023-12-04 08:45:11,266 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-12-04 08:45:30,386 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.48 vs. limit=15.0 2023-12-04 08:45:40,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=212766.66666666666, ans=0.05 2023-12-04 08:45:45,297 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=212766.66666666666, ans=0.05 2023-12-04 08:45:47,206 INFO [train.py:1087] (1/4) Epoch 36, batch 600, loss[loss=0.145, simple_loss=0.2327, pruned_loss=0.02867, over 24732.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2546, pruned_loss=0.03599, over 4575036.35 frames. ], batch size: 69, lr: 6.79e-03, grad_scale: 32.0 2023-12-04 08:45:50,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=212833.33333333334, ans=0.0 2023-12-04 08:46:05,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=212900.0, ans=0.125 2023-12-04 08:46:06,704 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.299e+02 1.430e+02 1.549e+02 1.919e+02, threshold=2.861e+02, percent-clipped=0.0 2023-12-04 08:46:32,434 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.68 vs. limit=15.0 2023-12-04 08:46:35,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=213100.0, ans=12.0 2023-12-04 08:46:39,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.37 vs. limit=22.5 2023-12-04 08:46:42,912 INFO [train.py:1087] (1/4) Epoch 36, batch 650, loss[loss=0.1685, simple_loss=0.2576, pruned_loss=0.03968, over 24456.00 frames. ], tot_loss[loss=0.1637, simple_loss=0.2548, pruned_loss=0.03624, over 4623913.85 frames. ], batch size: 77, lr: 6.78e-03, grad_scale: 32.0 2023-12-04 08:46:47,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=213166.66666666666, ans=0.1 2023-12-04 08:46:59,758 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:47:05,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=213300.0, ans=0.125 2023-12-04 08:47:22,909 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=213366.66666666666, ans=0.125 2023-12-04 08:47:26,242 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-12-04 08:47:28,183 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=213433.33333333334, ans=0.2 2023-12-04 08:47:40,335 INFO [train.py:1087] (1/4) Epoch 36, batch 700, loss[loss=0.1639, simple_loss=0.2545, pruned_loss=0.03663, over 24713.00 frames. ], tot_loss[loss=0.1639, simple_loss=0.2551, pruned_loss=0.03641, over 4640904.21 frames. ], batch size: 67, lr: 6.78e-03, grad_scale: 32.0 2023-12-04 08:47:47,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=213500.0, ans=0.1 2023-12-04 08:47:58,633 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.184e+02 1.337e+02 1.455e+02 1.588e+02 2.425e+02, threshold=2.911e+02, percent-clipped=0.0 2023-12-04 08:48:18,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=213700.0, ans=0.05 2023-12-04 08:48:22,820 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=213700.0, ans=0.0 2023-12-04 08:48:34,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=213766.66666666666, ans=0.125 2023-12-04 08:48:35,889 INFO [train.py:1087] (1/4) Epoch 36, batch 750, loss[loss=0.1832, simple_loss=0.2766, pruned_loss=0.0449, over 23670.00 frames. ], tot_loss[loss=0.1638, simple_loss=0.255, pruned_loss=0.03628, over 4673870.84 frames. ], batch size: 94, lr: 6.77e-03, grad_scale: 32.0 2023-12-04 08:48:46,106 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=213900.0, ans=0.2 2023-12-04 08:48:47,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=213900.0, ans=15.0 2023-12-04 08:48:53,717 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=213900.0, ans=0.125 2023-12-04 08:48:57,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=213966.66666666666, ans=0.035 2023-12-04 08:49:07,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=213966.66666666666, ans=0.0 2023-12-04 08:49:15,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=214033.33333333334, ans=0.125 2023-12-04 08:49:28,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=214100.0, ans=0.07 2023-12-04 08:49:28,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=214100.0, ans=0.0 2023-12-04 08:49:31,189 INFO [train.py:1087] (1/4) Epoch 36, batch 800, loss[loss=0.1657, simple_loss=0.2538, pruned_loss=0.03877, over 24768.00 frames. ], tot_loss[loss=0.1641, simple_loss=0.2551, pruned_loss=0.03649, over 4708310.44 frames. ], batch size: 64, lr: 6.77e-03, grad_scale: 32.0 2023-12-04 08:49:31,470 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=214166.66666666666, ans=0.0 2023-12-04 08:49:42,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=214233.33333333334, ans=0.125 2023-12-04 08:49:49,602 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.303e+02 1.430e+02 1.543e+02 2.060e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 08:49:55,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=214300.0, ans=0.0 2023-12-04 08:50:06,634 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:50:16,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=214433.33333333334, ans=0.2 2023-12-04 08:50:17,626 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=214433.33333333334, ans=0.2 2023-12-04 08:50:22,399 INFO [train.py:1087] (1/4) Epoch 36, batch 850, loss[loss=0.1647, simple_loss=0.2611, pruned_loss=0.03414, over 24784.00 frames. ], tot_loss[loss=0.1643, simple_loss=0.2551, pruned_loss=0.03672, over 4720019.10 frames. ], batch size: 73, lr: 6.76e-03, grad_scale: 32.0 2023-12-04 08:50:28,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214500.0, ans=0.1 2023-12-04 08:50:29,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=214500.0, ans=15.0 2023-12-04 08:50:34,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=214566.66666666666, ans=0.125 2023-12-04 08:50:40,106 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-12-04 08:50:45,847 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=214633.33333333334, ans=0.0 2023-12-04 08:51:23,543 INFO [train.py:1087] (1/4) Epoch 37, batch 0, loss[loss=0.1541, simple_loss=0.2488, pruned_loss=0.02974, over 21963.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2488, pruned_loss=0.02974, over 21963.00 frames. ], batch size: 128, lr: 6.67e-03, grad_scale: 32.0 2023-12-04 08:51:23,544 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 08:51:35,844 INFO [train.py:1119] (1/4) Epoch 37, validation: loss=0.1531, simple_loss=0.2525, pruned_loss=0.02684, over 944034.00 frames. 2023-12-04 08:51:35,845 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 08:51:38,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=214800.0, ans=0.125 2023-12-04 08:51:38,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=214800.0, ans=0.2 2023-12-04 08:51:42,885 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-12-04 08:51:46,134 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-12-04 08:51:47,739 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=214866.66666666666, ans=0.125 2023-12-04 08:51:52,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=214866.66666666666, ans=0.0 2023-12-04 08:52:00,135 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.176e+02 1.321e+02 1.427e+02 1.586e+02 2.182e+02, threshold=2.854e+02, percent-clipped=0.0 2023-12-04 08:52:08,237 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=215000.0, ans=0.125 2023-12-04 08:52:08,367 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=22.5 2023-12-04 08:52:25,534 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=215066.66666666666, ans=0.0 2023-12-04 08:52:27,937 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.43 vs. limit=22.5 2023-12-04 08:52:28,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=215066.66666666666, ans=0.035 2023-12-04 08:52:30,590 INFO [train.py:1087] (1/4) Epoch 37, batch 50, loss[loss=0.1623, simple_loss=0.2534, pruned_loss=0.03554, over 24767.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2554, pruned_loss=0.0357, over 1090768.65 frames. ], batch size: 64, lr: 6.66e-03, grad_scale: 32.0 2023-12-04 08:52:33,282 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-12-04 08:52:39,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=215133.33333333334, ans=0.125 2023-12-04 08:52:44,025 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215200.0, ans=0.1 2023-12-04 08:53:21,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=215400.0, ans=0.0 2023-12-04 08:53:24,526 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=215400.0, ans=0.0 2023-12-04 08:53:26,350 INFO [train.py:1087] (1/4) Epoch 37, batch 100, loss[loss=0.1516, simple_loss=0.2448, pruned_loss=0.02919, over 24774.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2551, pruned_loss=0.03585, over 1919693.48 frames. ], batch size: 71, lr: 6.66e-03, grad_scale: 32.0 2023-12-04 08:53:30,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=215466.66666666666, ans=0.2 2023-12-04 08:53:41,935 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:53:46,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=215533.33333333334, ans=0.07 2023-12-04 08:53:50,476 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.65 vs. limit=6.0 2023-12-04 08:53:50,663 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.319e+02 1.413e+02 1.518e+02 2.064e+02, threshold=2.826e+02, percent-clipped=0.0 2023-12-04 08:54:01,582 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-12-04 08:54:13,140 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.63 vs. limit=22.5 2023-12-04 08:54:21,509 INFO [train.py:1087] (1/4) Epoch 37, batch 150, loss[loss=0.1587, simple_loss=0.2504, pruned_loss=0.03353, over 24851.00 frames. ], tot_loss[loss=0.1631, simple_loss=0.2548, pruned_loss=0.03572, over 2556634.81 frames. ], batch size: 68, lr: 6.65e-03, grad_scale: 16.0 2023-12-04 08:54:31,217 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.29 vs. limit=15.0 2023-12-04 08:54:51,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=215933.33333333334, ans=0.2 2023-12-04 08:55:03,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=216000.0, ans=0.09899494936611666 2023-12-04 08:55:07,527 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=216066.66666666666, ans=0.125 2023-12-04 08:55:07,534 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=216066.66666666666, ans=0.125 2023-12-04 08:55:16,884 INFO [train.py:1087] (1/4) Epoch 37, batch 200, loss[loss=0.1607, simple_loss=0.2519, pruned_loss=0.03474, over 24775.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2544, pruned_loss=0.03573, over 3068971.97 frames. ], batch size: 62, lr: 6.65e-03, grad_scale: 16.0 2023-12-04 08:55:20,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=216133.33333333334, ans=0.125 2023-12-04 08:55:23,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=216133.33333333334, ans=0.0 2023-12-04 08:55:32,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=216200.0, ans=0.1 2023-12-04 08:55:42,489 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.342e+02 1.437e+02 1.645e+02 2.318e+02, threshold=2.874e+02, percent-clipped=0.0 2023-12-04 08:55:47,318 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-12-04 08:55:53,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=216333.33333333334, ans=0.0 2023-12-04 08:56:01,618 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=216400.0, ans=0.0 2023-12-04 08:56:12,552 INFO [train.py:1087] (1/4) Epoch 37, batch 250, loss[loss=0.1831, simple_loss=0.2747, pruned_loss=0.04582, over 22865.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2542, pruned_loss=0.03575, over 3462002.39 frames. ], batch size: 106, lr: 6.64e-03, grad_scale: 16.0 2023-12-04 08:56:19,152 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=216466.66666666666, ans=0.2 2023-12-04 08:56:52,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=216666.66666666666, ans=0.0 2023-12-04 08:56:58,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=216733.33333333334, ans=0.125 2023-12-04 08:57:08,621 INFO [train.py:1087] (1/4) Epoch 37, batch 300, loss[loss=0.1481, simple_loss=0.2411, pruned_loss=0.02754, over 24580.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.2538, pruned_loss=0.03567, over 3756011.46 frames. ], batch size: 65, lr: 6.64e-03, grad_scale: 16.0 2023-12-04 08:57:12,064 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 08:57:24,334 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-12-04 08:57:33,742 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.199e+02 1.407e+02 1.504e+02 1.639e+02 2.110e+02, threshold=3.009e+02, percent-clipped=0.0 2023-12-04 08:57:51,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=217066.66666666666, ans=0.2 2023-12-04 08:58:03,933 INFO [train.py:1087] (1/4) Epoch 37, batch 350, loss[loss=0.1753, simple_loss=0.2635, pruned_loss=0.04356, over 24284.00 frames. ], tot_loss[loss=0.1632, simple_loss=0.2543, pruned_loss=0.03608, over 3979101.48 frames. ], batch size: 79, lr: 6.63e-03, grad_scale: 16.0 2023-12-04 08:58:05,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=217133.33333333334, ans=0.125 2023-12-04 08:58:11,693 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.92 vs. limit=15.0 2023-12-04 08:58:19,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=217200.0, ans=0.2 2023-12-04 08:58:25,240 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=217266.66666666666, ans=0.125 2023-12-04 08:58:38,165 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.82 vs. limit=22.5 2023-12-04 08:58:39,307 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.24 vs. limit=12.0 2023-12-04 08:58:44,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217333.33333333334, ans=0.1 2023-12-04 08:58:51,925 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=217400.0, ans=0.2 2023-12-04 08:58:59,068 INFO [train.py:1087] (1/4) Epoch 37, batch 400, loss[loss=0.1688, simple_loss=0.2597, pruned_loss=0.03892, over 24791.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2541, pruned_loss=0.03592, over 4146296.32 frames. ], batch size: 62, lr: 6.63e-03, grad_scale: 32.0 2023-12-04 08:59:05,294 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.03 vs. limit=15.0 2023-12-04 08:59:24,984 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.319e+02 1.432e+02 1.652e+02 2.173e+02, threshold=2.865e+02, percent-clipped=0.0 2023-12-04 08:59:34,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=217666.66666666666, ans=0.125 2023-12-04 08:59:36,352 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=217666.66666666666, ans=0.1 2023-12-04 08:59:55,242 INFO [train.py:1087] (1/4) Epoch 37, batch 450, loss[loss=0.1719, simple_loss=0.262, pruned_loss=0.04092, over 21264.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2541, pruned_loss=0.03589, over 4299118.92 frames. ], batch size: 127, lr: 6.62e-03, grad_scale: 32.0 2023-12-04 08:59:55,525 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=217800.0, ans=0.125 2023-12-04 09:00:02,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=217800.0, ans=0.2 2023-12-04 09:00:15,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=217866.66666666666, ans=0.0 2023-12-04 09:00:25,587 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217933.33333333334, ans=0.1 2023-12-04 09:00:27,088 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-12-04 09:00:38,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=218066.66666666666, ans=0.0 2023-12-04 09:00:51,480 INFO [train.py:1087] (1/4) Epoch 37, batch 500, loss[loss=0.1567, simple_loss=0.2486, pruned_loss=0.03245, over 24598.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2545, pruned_loss=0.03621, over 4423024.61 frames. ], batch size: 68, lr: 6.62e-03, grad_scale: 16.0 2023-12-04 09:00:58,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=218133.33333333334, ans=0.125 2023-12-04 09:01:12,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218266.66666666666, ans=0.125 2023-12-04 09:01:18,058 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.330e+02 1.427e+02 1.554e+02 2.935e+02, threshold=2.854e+02, percent-clipped=1.0 2023-12-04 09:01:36,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=218400.0, ans=0.2 2023-12-04 09:01:46,705 INFO [train.py:1087] (1/4) Epoch 37, batch 550, loss[loss=0.1582, simple_loss=0.249, pruned_loss=0.03377, over 24575.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2546, pruned_loss=0.03608, over 4509000.33 frames. ], batch size: 65, lr: 6.61e-03, grad_scale: 16.0 2023-12-04 09:01:52,329 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=218466.66666666666, ans=0.0 2023-12-04 09:01:54,667 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.53 vs. limit=15.0 2023-12-04 09:01:55,831 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.73 vs. limit=15.0 2023-12-04 09:02:03,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=218533.33333333334, ans=0.125 2023-12-04 09:02:20,541 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:02:21,672 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=218666.66666666666, ans=0.0 2023-12-04 09:02:41,952 INFO [train.py:1087] (1/4) Epoch 37, batch 600, loss[loss=0.1689, simple_loss=0.2562, pruned_loss=0.04085, over 24353.00 frames. ], tot_loss[loss=0.1635, simple_loss=0.2547, pruned_loss=0.03614, over 4574844.43 frames. ], batch size: 79, lr: 6.61e-03, grad_scale: 16.0 2023-12-04 09:02:45,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=218800.0, ans=0.125 2023-12-04 09:02:56,957 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=218866.66666666666, ans=10.0 2023-12-04 09:03:08,805 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.340e+02 1.431e+02 1.602e+02 2.000e+02, threshold=2.862e+02, percent-clipped=0.0 2023-12-04 09:03:33,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=219066.66666666666, ans=0.2 2023-12-04 09:03:36,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=219133.33333333334, ans=0.125 2023-12-04 09:03:38,139 INFO [train.py:1087] (1/4) Epoch 37, batch 650, loss[loss=0.1542, simple_loss=0.2479, pruned_loss=0.0302, over 24769.00 frames. ], tot_loss[loss=0.1631, simple_loss=0.2544, pruned_loss=0.03592, over 4635795.03 frames. ], batch size: 64, lr: 6.60e-03, grad_scale: 16.0 2023-12-04 09:04:04,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=219266.66666666666, ans=0.125 2023-12-04 09:04:06,125 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.31 vs. limit=10.0 2023-12-04 09:04:26,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=219400.0, ans=0.0 2023-12-04 09:04:33,168 INFO [train.py:1087] (1/4) Epoch 37, batch 700, loss[loss=0.1645, simple_loss=0.253, pruned_loss=0.03803, over 24017.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.2544, pruned_loss=0.03611, over 4657922.06 frames. ], batch size: 87, lr: 6.60e-03, grad_scale: 16.0 2023-12-04 09:04:49,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=219533.33333333334, ans=0.125 2023-12-04 09:04:59,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=219600.0, ans=0.125 2023-12-04 09:04:59,779 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.349e+02 1.440e+02 1.623e+02 2.299e+02, threshold=2.881e+02, percent-clipped=0.0 2023-12-04 09:05:05,293 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=219666.66666666666, ans=0.0 2023-12-04 09:05:10,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=219666.66666666666, ans=0.125 2023-12-04 09:05:18,795 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.08 vs. limit=22.5 2023-12-04 09:05:28,560 INFO [train.py:1087] (1/4) Epoch 37, batch 750, loss[loss=0.155, simple_loss=0.2485, pruned_loss=0.03075, over 24706.00 frames. ], tot_loss[loss=0.1629, simple_loss=0.2541, pruned_loss=0.0358, over 4690865.03 frames. ], batch size: 69, lr: 6.59e-03, grad_scale: 16.0 2023-12-04 09:05:30,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=219800.0, ans=0.0 2023-12-04 09:05:57,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=219933.33333333334, ans=0.0 2023-12-04 09:06:18,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220066.66666666666, ans=0.1 2023-12-04 09:06:24,252 INFO [train.py:1087] (1/4) Epoch 37, batch 800, loss[loss=0.1632, simple_loss=0.2558, pruned_loss=0.03526, over 24699.00 frames. ], tot_loss[loss=0.1622, simple_loss=0.2535, pruned_loss=0.03543, over 4718828.75 frames. ], batch size: 69, lr: 6.59e-03, grad_scale: 32.0 2023-12-04 09:06:28,276 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.11 vs. limit=15.0 2023-12-04 09:06:39,267 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-12-04 09:06:41,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=220200.0, ans=0.125 2023-12-04 09:06:49,584 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.291e+02 1.369e+02 1.487e+02 2.326e+02, threshold=2.738e+02, percent-clipped=0.0 2023-12-04 09:06:57,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=220333.33333333334, ans=0.5 2023-12-04 09:07:07,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=220400.0, ans=0.2 2023-12-04 09:07:11,798 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220400.0, ans=0.1 2023-12-04 09:07:14,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=220466.66666666666, ans=0.125 2023-12-04 09:07:14,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=220466.66666666666, ans=0.2 2023-12-04 09:07:15,677 INFO [train.py:1087] (1/4) Epoch 37, batch 850, loss[loss=0.1668, simple_loss=0.2614, pruned_loss=0.03606, over 24221.00 frames. ], tot_loss[loss=0.1624, simple_loss=0.2535, pruned_loss=0.03562, over 4734484.70 frames. ], batch size: 58, lr: 6.58e-03, grad_scale: 32.0 2023-12-04 09:07:26,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=220533.33333333334, ans=0.0 2023-12-04 09:07:39,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=220600.0, ans=0.09899494936611666 2023-12-04 09:08:14,610 INFO [train.py:1087] (1/4) Epoch 38, batch 0, loss[loss=0.1544, simple_loss=0.2444, pruned_loss=0.03217, over 24746.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2444, pruned_loss=0.03217, over 24746.00 frames. ], batch size: 66, lr: 6.49e-03, grad_scale: 32.0 2023-12-04 09:08:14,611 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 09:08:26,960 INFO [train.py:1119] (1/4) Epoch 38, validation: loss=0.1535, simple_loss=0.2525, pruned_loss=0.02723, over 944034.00 frames. 2023-12-04 09:08:26,960 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 09:08:33,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=220766.66666666666, ans=0.125 2023-12-04 09:08:41,350 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=22.5 2023-12-04 09:08:43,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=220833.33333333334, ans=0.125 2023-12-04 09:08:51,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220900.0, ans=0.1 2023-12-04 09:08:58,603 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.075e+02 1.382e+02 1.554e+02 1.707e+02 2.505e+02, threshold=3.109e+02, percent-clipped=0.0 2023-12-04 09:09:11,736 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=221033.33333333334, ans=0.0 2023-12-04 09:09:22,249 INFO [train.py:1087] (1/4) Epoch 38, batch 50, loss[loss=0.1611, simple_loss=0.2521, pruned_loss=0.03506, over 24759.00 frames. ], tot_loss[loss=0.1648, simple_loss=0.2563, pruned_loss=0.03671, over 1077757.13 frames. ], batch size: 65, lr: 6.49e-03, grad_scale: 32.0 2023-12-04 09:09:33,596 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.02 vs. limit=15.0 2023-12-04 09:09:34,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.74 vs. limit=22.5 2023-12-04 09:09:35,482 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:09:36,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=221166.66666666666, ans=0.0 2023-12-04 09:09:36,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=221166.66666666666, ans=0.2 2023-12-04 09:09:56,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=221300.0, ans=0.0 2023-12-04 09:09:57,095 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.08 vs. limit=15.0 2023-12-04 09:10:02,748 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-12-04 09:10:07,862 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.05 vs. limit=12.0 2023-12-04 09:10:10,782 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:10:16,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=221433.33333333334, ans=0.0 2023-12-04 09:10:17,215 INFO [train.py:1087] (1/4) Epoch 38, batch 100, loss[loss=0.1616, simple_loss=0.2543, pruned_loss=0.03447, over 24767.00 frames. ], tot_loss[loss=0.1646, simple_loss=0.256, pruned_loss=0.03663, over 1893312.90 frames. ], batch size: 64, lr: 6.48e-03, grad_scale: 32.0 2023-12-04 09:10:18,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=221433.33333333334, ans=0.125 2023-12-04 09:10:27,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=221500.0, ans=0.2 2023-12-04 09:10:27,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=221500.0, ans=0.2 2023-12-04 09:10:35,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=221500.0, ans=0.125 2023-12-04 09:10:49,264 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.289e+02 1.400e+02 1.532e+02 1.887e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-04 09:10:55,222 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=221633.33333333334, ans=0.0 2023-12-04 09:10:57,875 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-12-04 09:10:59,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=221633.33333333334, ans=0.125 2023-12-04 09:11:12,388 INFO [train.py:1087] (1/4) Epoch 38, batch 150, loss[loss=0.1702, simple_loss=0.2586, pruned_loss=0.04087, over 21730.00 frames. ], tot_loss[loss=0.1633, simple_loss=0.255, pruned_loss=0.03583, over 2542860.81 frames. ], batch size: 127, lr: 6.48e-03, grad_scale: 32.0 2023-12-04 09:11:18,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=221766.66666666666, ans=0.1 2023-12-04 09:11:27,740 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=6.86 vs. limit=12.0 2023-12-04 09:11:40,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=221900.0, ans=0.2 2023-12-04 09:11:41,705 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.25 vs. limit=22.5 2023-12-04 09:11:43,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=221900.0, ans=0.125 2023-12-04 09:11:48,906 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=221966.66666666666, ans=0.2 2023-12-04 09:12:03,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=222033.33333333334, ans=22.5 2023-12-04 09:12:05,900 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.45 vs. limit=15.0 2023-12-04 09:12:07,458 INFO [train.py:1087] (1/4) Epoch 38, batch 200, loss[loss=0.1772, simple_loss=0.2656, pruned_loss=0.04444, over 23506.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2543, pruned_loss=0.03567, over 3047056.74 frames. ], batch size: 94, lr: 6.47e-03, grad_scale: 32.0 2023-12-04 09:12:39,652 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.338e+02 1.405e+02 1.498e+02 2.257e+02, threshold=2.810e+02, percent-clipped=0.0 2023-12-04 09:12:44,530 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=222300.0, ans=0.09899494936611666 2023-12-04 09:12:49,089 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.13 vs. limit=10.0 2023-12-04 09:13:03,357 INFO [train.py:1087] (1/4) Epoch 38, batch 250, loss[loss=0.169, simple_loss=0.2582, pruned_loss=0.03986, over 23497.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2548, pruned_loss=0.03617, over 3417900.15 frames. ], batch size: 94, lr: 6.47e-03, grad_scale: 32.0 2023-12-04 09:13:21,271 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-12-04 09:13:28,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=222566.66666666666, ans=0.05 2023-12-04 09:13:40,583 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=222633.33333333334, ans=0.1 2023-12-04 09:13:58,205 INFO [train.py:1087] (1/4) Epoch 38, batch 300, loss[loss=0.1563, simple_loss=0.2482, pruned_loss=0.03223, over 24720.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2545, pruned_loss=0.03573, over 3734270.10 frames. ], batch size: 67, lr: 6.46e-03, grad_scale: 32.0 2023-12-04 09:14:10,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=222833.33333333334, ans=0.125 2023-12-04 09:14:30,428 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.310e+02 1.414e+02 1.549e+02 2.107e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-04 09:14:44,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=223033.33333333334, ans=0.125 2023-12-04 09:14:51,226 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=223033.33333333334, ans=0.1 2023-12-04 09:14:53,392 INFO [train.py:1087] (1/4) Epoch 38, batch 350, loss[loss=0.1705, simple_loss=0.261, pruned_loss=0.03998, over 23957.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2543, pruned_loss=0.03587, over 3970455.41 frames. ], batch size: 87, lr: 6.46e-03, grad_scale: 32.0 2023-12-04 09:14:53,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=223100.0, ans=10.0 2023-12-04 09:15:06,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223166.66666666666, ans=0.125 2023-12-04 09:15:48,428 INFO [train.py:1087] (1/4) Epoch 38, batch 400, loss[loss=0.1492, simple_loss=0.2399, pruned_loss=0.02925, over 24553.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2545, pruned_loss=0.03609, over 4140817.82 frames. ], batch size: 66, lr: 6.45e-03, grad_scale: 32.0 2023-12-04 09:15:48,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=223433.33333333334, ans=0.0 2023-12-04 09:16:06,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223500.0, ans=0.1 2023-12-04 09:16:09,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=223566.66666666666, ans=0.125 2023-12-04 09:16:15,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=223566.66666666666, ans=0.0 2023-12-04 09:16:20,067 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.304e+02 1.431e+02 1.583e+02 2.238e+02, threshold=2.862e+02, percent-clipped=0.0 2023-12-04 09:16:32,888 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.45 vs. limit=15.0 2023-12-04 09:16:35,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223700.0, ans=0.1 2023-12-04 09:16:39,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=223700.0, ans=0.035 2023-12-04 09:16:43,665 INFO [train.py:1087] (1/4) Epoch 38, batch 450, loss[loss=0.174, simple_loss=0.2668, pruned_loss=0.04061, over 24560.00 frames. ], tot_loss[loss=0.1634, simple_loss=0.2546, pruned_loss=0.03603, over 4272887.28 frames. ], batch size: 62, lr: 6.45e-03, grad_scale: 32.0 2023-12-04 09:16:49,352 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=223766.66666666666, ans=0.125 2023-12-04 09:17:11,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=223900.0, ans=0.125 2023-12-04 09:17:17,043 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-12-04 09:17:25,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=223966.66666666666, ans=0.125 2023-12-04 09:17:25,784 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.29 vs. limit=15.0 2023-12-04 09:17:29,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=224033.33333333334, ans=0.0 2023-12-04 09:17:34,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=224033.33333333334, ans=0.2 2023-12-04 09:17:39,387 INFO [train.py:1087] (1/4) Epoch 38, batch 500, loss[loss=0.159, simple_loss=0.2502, pruned_loss=0.0339, over 24763.00 frames. ], tot_loss[loss=0.1636, simple_loss=0.2548, pruned_loss=0.03621, over 4390287.74 frames. ], batch size: 70, lr: 6.44e-03, grad_scale: 16.0 2023-12-04 09:17:40,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=224100.0, ans=0.2 2023-12-04 09:18:12,356 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.035e+02 1.312e+02 1.439e+02 1.556e+02 2.116e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 09:18:14,819 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=224300.0, ans=0.02 2023-12-04 09:18:19,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224300.0, ans=0.125 2023-12-04 09:18:24,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=224366.66666666666, ans=0.0 2023-12-04 09:18:31,041 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.46 vs. limit=15.0 2023-12-04 09:18:31,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=224366.66666666666, ans=0.0 2023-12-04 09:18:33,760 INFO [train.py:1087] (1/4) Epoch 38, batch 550, loss[loss=0.1672, simple_loss=0.2601, pruned_loss=0.03715, over 24745.00 frames. ], tot_loss[loss=0.1629, simple_loss=0.2542, pruned_loss=0.03586, over 4500123.16 frames. ], batch size: 63, lr: 6.44e-03, grad_scale: 16.0 2023-12-04 09:18:34,392 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.47 vs. limit=15.0 2023-12-04 09:18:58,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=224566.66666666666, ans=0.0 2023-12-04 09:19:01,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=224566.66666666666, ans=0.035 2023-12-04 09:19:04,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=224566.66666666666, ans=0.125 2023-12-04 09:19:18,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=224700.0, ans=0.125 2023-12-04 09:19:28,930 INFO [train.py:1087] (1/4) Epoch 38, batch 600, loss[loss=0.1539, simple_loss=0.2491, pruned_loss=0.02937, over 24691.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2539, pruned_loss=0.03561, over 4564692.13 frames. ], batch size: 74, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:19:34,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224766.66666666666, ans=0.1 2023-12-04 09:19:56,359 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.23 vs. limit=15.0 2023-12-04 09:19:59,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=224900.0, ans=0.125 2023-12-04 09:20:02,202 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.317e+02 1.414e+02 1.597e+02 2.110e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-04 09:20:04,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=224966.66666666666, ans=0.0 2023-12-04 09:20:12,755 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:20:15,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=225033.33333333334, ans=15.0 2023-12-04 09:20:24,620 INFO [train.py:1087] (1/4) Epoch 38, batch 650, loss[loss=0.1623, simple_loss=0.2592, pruned_loss=0.03273, over 24754.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.254, pruned_loss=0.03578, over 4619868.15 frames. ], batch size: 66, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:20:41,347 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.93 vs. limit=15.0 2023-12-04 09:20:48,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=225233.33333333334, ans=0.125 2023-12-04 09:21:06,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225300.0, ans=0.1 2023-12-04 09:21:09,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=225366.66666666666, ans=0.0 2023-12-04 09:21:20,127 INFO [train.py:1087] (1/4) Epoch 38, batch 700, loss[loss=0.1581, simple_loss=0.2489, pruned_loss=0.03361, over 24711.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.2539, pruned_loss=0.0356, over 4654524.55 frames. ], batch size: 69, lr: 6.43e-03, grad_scale: 16.0 2023-12-04 09:21:42,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=225566.66666666666, ans=0.2 2023-12-04 09:21:52,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=225633.33333333334, ans=0.125 2023-12-04 09:21:53,076 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.338e+02 1.467e+02 1.594e+02 2.367e+02, threshold=2.933e+02, percent-clipped=0.0 2023-12-04 09:21:57,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=225633.33333333334, ans=0.125 2023-12-04 09:22:15,034 INFO [train.py:1087] (1/4) Epoch 38, batch 750, loss[loss=0.1438, simple_loss=0.2383, pruned_loss=0.02465, over 24570.00 frames. ], tot_loss[loss=0.1629, simple_loss=0.2541, pruned_loss=0.03581, over 4678682.95 frames. ], batch size: 64, lr: 6.42e-03, grad_scale: 16.0 2023-12-04 09:22:16,848 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-12-04 09:22:26,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=225833.33333333334, ans=0.1 2023-12-04 09:22:48,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=225966.66666666666, ans=0.125 2023-12-04 09:23:01,105 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.99 vs. limit=5.0 2023-12-04 09:23:10,309 INFO [train.py:1087] (1/4) Epoch 38, batch 800, loss[loss=0.1535, simple_loss=0.2419, pruned_loss=0.0326, over 24742.00 frames. ], tot_loss[loss=0.1622, simple_loss=0.2535, pruned_loss=0.03547, over 4715988.33 frames. ], batch size: 63, lr: 6.42e-03, grad_scale: 32.0 2023-12-04 09:23:18,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=226100.0, ans=0.0 2023-12-04 09:23:28,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=226166.66666666666, ans=0.125 2023-12-04 09:23:31,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=226233.33333333334, ans=0.125 2023-12-04 09:23:40,475 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=226300.0, ans=0.0 2023-12-04 09:23:41,318 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.328e+02 1.439e+02 1.587e+02 2.613e+02, threshold=2.879e+02, percent-clipped=0.0 2023-12-04 09:23:41,829 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.28 vs. limit=15.0 2023-12-04 09:23:49,483 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:24:01,258 INFO [train.py:1087] (1/4) Epoch 38, batch 850, loss[loss=0.1707, simple_loss=0.2613, pruned_loss=0.04, over 24184.00 frames. ], tot_loss[loss=0.1625, simple_loss=0.2536, pruned_loss=0.03568, over 4741973.30 frames. ], batch size: 82, lr: 6.41e-03, grad_scale: 16.0 2023-12-04 09:24:21,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=226566.66666666666, ans=0.0 2023-12-04 09:24:26,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=226566.66666666666, ans=0.1 2023-12-04 09:24:33,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226633.33333333334, ans=0.125 2023-12-04 09:24:57,875 INFO [train.py:1087] (1/4) Epoch 39, batch 0, loss[loss=0.1546, simple_loss=0.2477, pruned_loss=0.03068, over 24775.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2477, pruned_loss=0.03068, over 24775.00 frames. ], batch size: 70, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:24:57,875 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 09:25:09,956 INFO [train.py:1119] (1/4) Epoch 39, validation: loss=0.1525, simple_loss=0.252, pruned_loss=0.02647, over 944034.00 frames. 2023-12-04 09:25:09,956 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 09:25:11,433 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=22.5 2023-12-04 09:25:23,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=226800.0, ans=0.0 2023-12-04 09:25:32,559 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:25:43,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=226933.33333333334, ans=0.125 2023-12-04 09:25:45,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=226933.33333333334, ans=0.125 2023-12-04 09:25:48,267 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.98 vs. limit=10.0 2023-12-04 09:25:49,547 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.344e+02 1.473e+02 1.644e+02 2.722e+02, threshold=2.946e+02, percent-clipped=0.0 2023-12-04 09:26:05,137 INFO [train.py:1087] (1/4) Epoch 39, batch 50, loss[loss=0.176, simple_loss=0.266, pruned_loss=0.04294, over 24229.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.2543, pruned_loss=0.03552, over 1076498.83 frames. ], batch size: 82, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:26:33,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=227200.0, ans=0.1 2023-12-04 09:26:59,243 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=227400.0, ans=0.0 2023-12-04 09:26:59,972 INFO [train.py:1087] (1/4) Epoch 39, batch 100, loss[loss=0.1664, simple_loss=0.2595, pruned_loss=0.03668, over 24507.00 frames. ], tot_loss[loss=0.1628, simple_loss=0.2546, pruned_loss=0.03548, over 1898392.13 frames. ], batch size: 75, lr: 6.32e-03, grad_scale: 32.0 2023-12-04 09:27:18,306 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-12-04 09:27:18,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=227466.66666666666, ans=0.0 2023-12-04 09:27:31,646 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.24 vs. limit=15.0 2023-12-04 09:27:41,491 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.334e+02 1.448e+02 1.608e+02 2.260e+02, threshold=2.895e+02, percent-clipped=0.0 2023-12-04 09:27:44,250 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=15.0 2023-12-04 09:27:51,072 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=15.0 2023-12-04 09:27:54,540 INFO [train.py:1087] (1/4) Epoch 39, batch 150, loss[loss=0.171, simple_loss=0.2633, pruned_loss=0.03934, over 24318.00 frames. ], tot_loss[loss=0.163, simple_loss=0.2547, pruned_loss=0.03569, over 2537693.07 frames. ], batch size: 79, lr: 6.31e-03, grad_scale: 8.0 2023-12-04 09:27:54,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=227733.33333333334, ans=0.1 2023-12-04 09:28:01,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=227733.33333333334, ans=0.0 2023-12-04 09:28:04,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=227800.0, ans=0.125 2023-12-04 09:28:23,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=227866.66666666666, ans=0.125 2023-12-04 09:28:36,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=227933.33333333334, ans=0.125 2023-12-04 09:28:49,447 INFO [train.py:1087] (1/4) Epoch 39, batch 200, loss[loss=0.1677, simple_loss=0.2557, pruned_loss=0.03984, over 24478.00 frames. ], tot_loss[loss=0.1627, simple_loss=0.2541, pruned_loss=0.03562, over 3033992.16 frames. ], batch size: 77, lr: 6.31e-03, grad_scale: 8.0 2023-12-04 09:29:31,859 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.311e+02 1.418e+02 1.591e+02 2.509e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 09:29:44,857 INFO [train.py:1087] (1/4) Epoch 39, batch 250, loss[loss=0.168, simple_loss=0.2608, pruned_loss=0.03756, over 23618.00 frames. ], tot_loss[loss=0.1623, simple_loss=0.2539, pruned_loss=0.03532, over 3430118.83 frames. ], batch size: 94, lr: 6.30e-03, grad_scale: 8.0 2023-12-04 09:29:59,197 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:30:20,694 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=228600.0, ans=0.0 2023-12-04 09:30:22,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=228600.0, ans=0.5 2023-12-04 09:30:27,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=228666.66666666666, ans=0.125 2023-12-04 09:30:29,705 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-12-04 09:30:29,823 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.00 vs. limit=15.0 2023-12-04 09:30:34,260 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-12-04 09:30:37,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=228666.66666666666, ans=0.125 2023-12-04 09:30:39,543 INFO [train.py:1087] (1/4) Epoch 39, batch 300, loss[loss=0.1637, simple_loss=0.2552, pruned_loss=0.03607, over 24710.00 frames. ], tot_loss[loss=0.1619, simple_loss=0.2534, pruned_loss=0.03522, over 3757304.17 frames. ], batch size: 74, lr: 6.30e-03, grad_scale: 8.0 2023-12-04 09:30:45,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=228733.33333333334, ans=0.025 2023-12-04 09:30:52,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=228800.0, ans=0.125 2023-12-04 09:31:08,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=228866.66666666666, ans=0.125 2023-12-04 09:31:21,184 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.172e+02 1.313e+02 1.429e+02 1.539e+02 2.008e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 09:31:24,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=229000.0, ans=0.0 2023-12-04 09:31:25,805 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=229000.0, ans=0.0 2023-12-04 09:31:33,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=229066.66666666666, ans=0.0 2023-12-04 09:31:34,280 INFO [train.py:1087] (1/4) Epoch 39, batch 350, loss[loss=0.159, simple_loss=0.2469, pruned_loss=0.03556, over 24791.00 frames. ], tot_loss[loss=0.1615, simple_loss=0.2531, pruned_loss=0.03498, over 3985642.33 frames. ], batch size: 73, lr: 6.29e-03, grad_scale: 8.0 2023-12-04 09:31:44,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=229133.33333333334, ans=0.125 2023-12-04 09:32:08,658 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=229266.66666666666, ans=0.1 2023-12-04 09:32:15,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=229266.66666666666, ans=0.125 2023-12-04 09:32:28,985 INFO [train.py:1087] (1/4) Epoch 39, batch 400, loss[loss=0.15, simple_loss=0.2432, pruned_loss=0.02845, over 24839.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2527, pruned_loss=0.03487, over 4163649.02 frames. ], batch size: 68, lr: 6.29e-03, grad_scale: 16.0 2023-12-04 09:32:33,583 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=229400.0, ans=0.125 2023-12-04 09:32:44,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.40 vs. limit=15.0 2023-12-04 09:32:57,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=229533.33333333334, ans=0.125 2023-12-04 09:33:05,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=229600.0, ans=0.125 2023-12-04 09:33:10,762 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.303e+02 1.447e+02 1.594e+02 2.149e+02, threshold=2.894e+02, percent-clipped=0.0 2023-12-04 09:33:11,586 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=22.5 2023-12-04 09:33:24,246 INFO [train.py:1087] (1/4) Epoch 39, batch 450, loss[loss=0.1503, simple_loss=0.2421, pruned_loss=0.02922, over 24715.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2525, pruned_loss=0.0348, over 4305472.48 frames. ], batch size: 69, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:33:35,251 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229800.0, ans=0.1 2023-12-04 09:33:38,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=229800.0, ans=0.1 2023-12-04 09:33:49,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=229866.66666666666, ans=0.95 2023-12-04 09:34:08,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230000.0, ans=0.1 2023-12-04 09:34:10,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=230000.0, ans=0.1 2023-12-04 09:34:19,959 INFO [train.py:1087] (1/4) Epoch 39, batch 500, loss[loss=0.163, simple_loss=0.2568, pruned_loss=0.03462, over 24799.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.2528, pruned_loss=0.03506, over 4413620.16 frames. ], batch size: 71, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:34:23,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=230066.66666666666, ans=0.125 2023-12-04 09:34:23,361 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=230066.66666666666, ans=0.125 2023-12-04 09:34:51,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=230200.0, ans=0.125 2023-12-04 09:35:01,528 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.182e+02 1.360e+02 1.542e+02 1.706e+02 2.514e+02, threshold=3.084e+02, percent-clipped=0.0 2023-12-04 09:35:12,881 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-12-04 09:35:14,377 INFO [train.py:1087] (1/4) Epoch 39, batch 550, loss[loss=0.1748, simple_loss=0.2631, pruned_loss=0.0433, over 24848.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.2529, pruned_loss=0.03491, over 4514743.29 frames. ], batch size: 68, lr: 6.28e-03, grad_scale: 16.0 2023-12-04 09:35:22,139 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=230400.0, ans=0.125 2023-12-04 09:35:23,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=230400.0, ans=0.1 2023-12-04 09:35:33,335 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=230466.66666666666, ans=0.2 2023-12-04 09:35:37,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=230533.33333333334, ans=0.0 2023-12-04 09:35:41,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230533.33333333334, ans=0.125 2023-12-04 09:35:42,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=230533.33333333334, ans=0.125 2023-12-04 09:35:52,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.05 vs. limit=15.0 2023-12-04 09:36:11,138 INFO [train.py:1087] (1/4) Epoch 39, batch 600, loss[loss=0.1682, simple_loss=0.2551, pruned_loss=0.04063, over 24428.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2528, pruned_loss=0.03483, over 4595969.84 frames. ], batch size: 77, lr: 6.27e-03, grad_scale: 16.0 2023-12-04 09:36:11,829 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-12-04 09:36:38,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=230866.66666666666, ans=0.125 2023-12-04 09:36:51,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=230933.33333333334, ans=0.125 2023-12-04 09:36:53,450 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.292e+02 1.379e+02 1.519e+02 2.695e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-04 09:37:07,290 INFO [train.py:1087] (1/4) Epoch 39, batch 650, loss[loss=0.1506, simple_loss=0.2437, pruned_loss=0.02875, over 24542.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2524, pruned_loss=0.03471, over 4638243.87 frames. ], batch size: 63, lr: 6.27e-03, grad_scale: 16.0 2023-12-04 09:37:29,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2023-12-04 09:37:39,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=231266.66666666666, ans=0.125 2023-12-04 09:37:44,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=231266.66666666666, ans=0.0 2023-12-04 09:37:54,024 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.69 vs. limit=22.5 2023-12-04 09:37:59,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=231333.33333333334, ans=0.0 2023-12-04 09:38:02,650 INFO [train.py:1087] (1/4) Epoch 39, batch 700, loss[loss=0.1669, simple_loss=0.2578, pruned_loss=0.03805, over 24028.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2526, pruned_loss=0.03479, over 4677393.36 frames. ], batch size: 87, lr: 6.26e-03, grad_scale: 16.0 2023-12-04 09:38:11,806 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231400.0, ans=0.1 2023-12-04 09:38:16,487 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.01 vs. limit=15.0 2023-12-04 09:38:45,052 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.352e+02 1.512e+02 1.676e+02 2.495e+02, threshold=3.025e+02, percent-clipped=0.0 2023-12-04 09:38:45,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=231600.0, ans=10.0 2023-12-04 09:38:51,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=231666.66666666666, ans=0.125 2023-12-04 09:38:58,348 INFO [train.py:1087] (1/4) Epoch 39, batch 750, loss[loss=0.1446, simple_loss=0.2419, pruned_loss=0.02365, over 24785.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.2528, pruned_loss=0.03499, over 4704115.53 frames. ], batch size: 73, lr: 6.26e-03, grad_scale: 16.0 2023-12-04 09:39:00,016 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=231733.33333333334, ans=0.1 2023-12-04 09:39:06,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231733.33333333334, ans=0.0 2023-12-04 09:39:19,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=231866.66666666666, ans=22.5 2023-12-04 09:39:22,122 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.98 vs. limit=15.0 2023-12-04 09:39:28,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=231866.66666666666, ans=0.0 2023-12-04 09:39:39,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=231933.33333333334, ans=0.0 2023-12-04 09:39:46,004 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=232000.0, ans=0.0 2023-12-04 09:39:53,562 INFO [train.py:1087] (1/4) Epoch 39, batch 800, loss[loss=0.1461, simple_loss=0.2409, pruned_loss=0.02566, over 24752.00 frames. ], tot_loss[loss=0.1617, simple_loss=0.2531, pruned_loss=0.03516, over 4723766.72 frames. ], batch size: 70, lr: 6.25e-03, grad_scale: 32.0 2023-12-04 09:40:03,483 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=232133.33333333334, ans=0.07 2023-12-04 09:40:22,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=232200.0, ans=0.035 2023-12-04 09:40:25,847 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.45 vs. limit=15.0 2023-12-04 09:40:32,427 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.187e+02 1.325e+02 1.411e+02 1.635e+02 2.142e+02, threshold=2.822e+02, percent-clipped=0.0 2023-12-04 09:40:43,763 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.91 vs. limit=15.0 2023-12-04 09:40:44,333 INFO [train.py:1087] (1/4) Epoch 39, batch 850, loss[loss=0.1649, simple_loss=0.2578, pruned_loss=0.03606, over 24072.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.2529, pruned_loss=0.03487, over 4751819.68 frames. ], batch size: 87, lr: 6.25e-03, grad_scale: 32.0 2023-12-04 09:40:46,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=232400.0, ans=0.07 2023-12-04 09:40:56,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=232466.66666666666, ans=0.125 2023-12-04 09:41:00,711 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232466.66666666666, ans=0.1 2023-12-04 09:41:03,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=232533.33333333334, ans=0.125 2023-12-04 09:41:09,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=232533.33333333334, ans=0.125 2023-12-04 09:41:12,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232533.33333333334, ans=0.1 2023-12-04 09:41:42,989 INFO [train.py:1087] (1/4) Epoch 40, batch 0, loss[loss=0.1594, simple_loss=0.2488, pruned_loss=0.035, over 24512.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.2488, pruned_loss=0.035, over 24512.00 frames. ], batch size: 75, lr: 6.17e-03, grad_scale: 32.0 2023-12-04 09:41:42,990 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 09:41:55,144 INFO [train.py:1119] (1/4) Epoch 40, validation: loss=0.1533, simple_loss=0.2521, pruned_loss=0.02723, over 944034.00 frames. 2023-12-04 09:41:55,145 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 09:41:59,974 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.34 vs. limit=15.0 2023-12-04 09:42:29,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=232900.0, ans=0.125 2023-12-04 09:42:39,841 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=232966.66666666666, ans=0.0 2023-12-04 09:42:42,084 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.284e+02 1.404e+02 1.572e+02 2.699e+02, threshold=2.808e+02, percent-clipped=0.0 2023-12-04 09:42:50,007 INFO [train.py:1087] (1/4) Epoch 40, batch 50, loss[loss=0.161, simple_loss=0.2566, pruned_loss=0.03265, over 24004.00 frames. ], tot_loss[loss=0.1626, simple_loss=0.254, pruned_loss=0.03564, over 1079549.35 frames. ], batch size: 87, lr: 6.16e-03, grad_scale: 32.0 2023-12-04 09:42:52,604 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-12-04 09:43:22,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=233233.33333333334, ans=0.2 2023-12-04 09:43:23,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=233233.33333333334, ans=0.0 2023-12-04 09:43:24,612 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.60 vs. limit=15.0 2023-12-04 09:43:45,069 INFO [train.py:1087] (1/4) Epoch 40, batch 100, loss[loss=0.1584, simple_loss=0.2513, pruned_loss=0.0328, over 24721.00 frames. ], tot_loss[loss=0.1618, simple_loss=0.2533, pruned_loss=0.03512, over 1899941.06 frames. ], batch size: 67, lr: 6.16e-03, grad_scale: 16.0 2023-12-04 09:43:56,266 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=233433.33333333334, ans=0.2 2023-12-04 09:44:04,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=233433.33333333334, ans=0.125 2023-12-04 09:44:30,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=233633.33333333334, ans=0.0 2023-12-04 09:44:32,786 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.305e+02 1.412e+02 1.578e+02 2.389e+02, threshold=2.824e+02, percent-clipped=0.0 2023-12-04 09:44:33,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=233633.33333333334, ans=0.125 2023-12-04 09:44:36,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=233633.33333333334, ans=0.125 2023-12-04 09:44:39,542 INFO [train.py:1087] (1/4) Epoch 40, batch 150, loss[loss=0.1453, simple_loss=0.2376, pruned_loss=0.02657, over 24543.00 frames. ], tot_loss[loss=0.1615, simple_loss=0.253, pruned_loss=0.035, over 2553791.13 frames. ], batch size: 66, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:44:40,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=233700.0, ans=0.02 2023-12-04 09:44:43,039 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233700.0, ans=0.1 2023-12-04 09:45:22,499 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=233900.0, ans=0.0 2023-12-04 09:45:34,897 INFO [train.py:1087] (1/4) Epoch 40, batch 200, loss[loss=0.1571, simple_loss=0.2525, pruned_loss=0.03085, over 24717.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2526, pruned_loss=0.03481, over 3068668.50 frames. ], batch size: 67, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:45:45,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=234100.0, ans=0.0 2023-12-04 09:45:48,751 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234100.0, ans=0.1 2023-12-04 09:45:48,868 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-12-04 09:45:58,133 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.39 vs. limit=15.0 2023-12-04 09:45:59,183 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-12-04 09:46:24,458 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.063e+02 1.299e+02 1.417e+02 1.559e+02 2.264e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 09:46:30,821 INFO [train.py:1087] (1/4) Epoch 40, batch 250, loss[loss=0.2116, simple_loss=0.2913, pruned_loss=0.06597, over 16733.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.253, pruned_loss=0.03479, over 3445103.74 frames. ], batch size: 177, lr: 6.15e-03, grad_scale: 16.0 2023-12-04 09:46:32,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=234366.66666666666, ans=0.125 2023-12-04 09:46:41,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234433.33333333334, ans=0.1 2023-12-04 09:46:41,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=234433.33333333334, ans=0.2 2023-12-04 09:46:55,847 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=234500.0, ans=0.1 2023-12-04 09:46:56,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=234500.0, ans=0.0 2023-12-04 09:46:59,381 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.07 vs. limit=22.5 2023-12-04 09:47:09,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=234566.66666666666, ans=0.025 2023-12-04 09:47:10,883 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:47:16,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=234633.33333333334, ans=0.04949747468305833 2023-12-04 09:47:27,385 INFO [train.py:1087] (1/4) Epoch 40, batch 300, loss[loss=0.1549, simple_loss=0.2438, pruned_loss=0.03298, over 24765.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2527, pruned_loss=0.0347, over 3756855.86 frames. ], batch size: 66, lr: 6.14e-03, grad_scale: 16.0 2023-12-04 09:47:42,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=234766.66666666666, ans=0.125 2023-12-04 09:47:43,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=234766.66666666666, ans=0.2 2023-12-04 09:47:52,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=234833.33333333334, ans=0.0 2023-12-04 09:48:09,396 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:48:15,706 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.298e+02 1.389e+02 1.491e+02 2.155e+02, threshold=2.778e+02, percent-clipped=0.0 2023-12-04 09:48:22,231 INFO [train.py:1087] (1/4) Epoch 40, batch 350, loss[loss=0.1475, simple_loss=0.2394, pruned_loss=0.02774, over 24743.00 frames. ], tot_loss[loss=0.1608, simple_loss=0.2523, pruned_loss=0.03463, over 3992093.63 frames. ], batch size: 63, lr: 6.14e-03, grad_scale: 16.0 2023-12-04 09:48:28,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=235033.33333333334, ans=0.0 2023-12-04 09:48:40,831 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:48:46,293 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:49:02,338 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=235233.33333333334, ans=0.0 2023-12-04 09:49:03,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=235233.33333333334, ans=0.0 2023-12-04 09:49:17,930 INFO [train.py:1087] (1/4) Epoch 40, batch 400, loss[loss=0.1699, simple_loss=0.2632, pruned_loss=0.03829, over 24555.00 frames. ], tot_loss[loss=0.1616, simple_loss=0.2531, pruned_loss=0.03506, over 4166673.33 frames. ], batch size: 62, lr: 6.13e-03, grad_scale: 32.0 2023-12-04 09:49:23,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=235366.66666666666, ans=0.0 2023-12-04 09:49:48,474 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235500.0, ans=0.1 2023-12-04 09:50:07,959 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.305e+02 1.392e+02 1.533e+02 2.243e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 09:50:14,523 INFO [train.py:1087] (1/4) Epoch 40, batch 450, loss[loss=0.1504, simple_loss=0.2378, pruned_loss=0.03154, over 24746.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2527, pruned_loss=0.03486, over 4322639.13 frames. ], batch size: 61, lr: 6.13e-03, grad_scale: 32.0 2023-12-04 09:50:26,386 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=235766.66666666666, ans=0.2 2023-12-04 09:50:36,614 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.87 vs. limit=15.0 2023-12-04 09:50:49,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=235900.0, ans=0.125 2023-12-04 09:50:53,716 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=235900.0, ans=0.2 2023-12-04 09:51:10,712 INFO [train.py:1087] (1/4) Epoch 40, batch 500, loss[loss=0.1576, simple_loss=0.2459, pruned_loss=0.03458, over 24587.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2527, pruned_loss=0.0347, over 4444655.33 frames. ], batch size: 64, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:51:14,170 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=236033.33333333334, ans=0.125 2023-12-04 09:51:17,677 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=236033.33333333334, ans=0.2 2023-12-04 09:51:19,994 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236033.33333333334, ans=0.1 2023-12-04 09:51:29,478 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=236100.0, ans=0.2 2023-12-04 09:51:38,828 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=236166.66666666666, ans=0.025 2023-12-04 09:51:50,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=236233.33333333334, ans=0.125 2023-12-04 09:51:57,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=236300.0, ans=0.0 2023-12-04 09:51:59,775 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.340e+02 1.474e+02 1.657e+02 2.312e+02, threshold=2.948e+02, percent-clipped=0.0 2023-12-04 09:52:06,733 INFO [train.py:1087] (1/4) Epoch 40, batch 550, loss[loss=0.1578, simple_loss=0.2498, pruned_loss=0.03289, over 24572.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.2531, pruned_loss=0.03479, over 4524638.02 frames. ], batch size: 65, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:52:18,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=236433.33333333334, ans=0.125 2023-12-04 09:52:30,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=236500.0, ans=0.125 2023-12-04 09:52:43,345 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 09:52:48,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236566.66666666666, ans=0.1 2023-12-04 09:52:51,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=236633.33333333334, ans=0.0 2023-12-04 09:52:58,475 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=236633.33333333334, ans=0.125 2023-12-04 09:52:58,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=236633.33333333334, ans=0.0 2023-12-04 09:52:58,695 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-12-04 09:53:02,505 INFO [train.py:1087] (1/4) Epoch 40, batch 600, loss[loss=0.1512, simple_loss=0.2454, pruned_loss=0.0285, over 24721.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2531, pruned_loss=0.03471, over 4584668.22 frames. ], batch size: 67, lr: 6.12e-03, grad_scale: 32.0 2023-12-04 09:53:13,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236766.66666666666, ans=0.1 2023-12-04 09:53:21,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=236766.66666666666, ans=0.0 2023-12-04 09:53:49,092 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=236966.66666666666, ans=0.125 2023-12-04 09:53:52,195 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.328e+02 1.464e+02 1.572e+02 2.479e+02, threshold=2.928e+02, percent-clipped=0.0 2023-12-04 09:53:59,054 INFO [train.py:1087] (1/4) Epoch 40, batch 650, loss[loss=0.1708, simple_loss=0.2609, pruned_loss=0.04038, over 24489.00 frames. ], tot_loss[loss=0.1613, simple_loss=0.2531, pruned_loss=0.03478, over 4627433.35 frames. ], batch size: 77, lr: 6.11e-03, grad_scale: 32.0 2023-12-04 09:54:04,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=237033.33333333334, ans=0.1 2023-12-04 09:54:10,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=237100.0, ans=0.5 2023-12-04 09:54:16,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=237100.0, ans=0.125 2023-12-04 09:54:25,244 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=237166.66666666666, ans=0.07 2023-12-04 09:54:37,248 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-12-04 09:54:55,141 INFO [train.py:1087] (1/4) Epoch 40, batch 700, loss[loss=0.1794, simple_loss=0.2669, pruned_loss=0.04596, over 24499.00 frames. ], tot_loss[loss=0.1614, simple_loss=0.253, pruned_loss=0.03484, over 4672326.72 frames. ], batch size: 75, lr: 6.11e-03, grad_scale: 32.0 2023-12-04 09:54:55,370 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=5.035e-03 2023-12-04 09:55:10,060 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-12-04 09:55:40,534 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-12-04 09:55:44,494 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.347e+02 1.457e+02 1.642e+02 2.295e+02, threshold=2.914e+02, percent-clipped=0.0 2023-12-04 09:55:51,366 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-12-04 09:55:51,578 INFO [train.py:1087] (1/4) Epoch 40, batch 750, loss[loss=0.1557, simple_loss=0.2486, pruned_loss=0.03133, over 24713.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2529, pruned_loss=0.03473, over 4696805.60 frames. ], batch size: 74, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:56:14,553 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=237833.33333333334, ans=0.2 2023-12-04 09:56:24,799 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=237900.0, ans=0.125 2023-12-04 09:56:35,752 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.48 vs. limit=22.5 2023-12-04 09:56:36,516 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=237966.66666666666, ans=0.125 2023-12-04 09:56:46,403 INFO [train.py:1087] (1/4) Epoch 40, batch 800, loss[loss=0.1737, simple_loss=0.2616, pruned_loss=0.04288, over 21291.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2527, pruned_loss=0.03464, over 4728733.29 frames. ], batch size: 127, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:56:46,603 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=238033.33333333334, ans=0.0 2023-12-04 09:56:48,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=238033.33333333334, ans=0.0 2023-12-04 09:56:51,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=238033.33333333334, ans=0.125 2023-12-04 09:56:55,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=238033.33333333334, ans=0.125 2023-12-04 09:57:09,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=238166.66666666666, ans=10.0 2023-12-04 09:57:11,826 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.30 vs. limit=15.0 2023-12-04 09:57:29,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=238300.0, ans=0.2 2023-12-04 09:57:31,419 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.267e+02 1.373e+02 1.459e+02 2.137e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-04 09:57:37,470 INFO [train.py:1087] (1/4) Epoch 40, batch 850, loss[loss=0.147, simple_loss=0.2372, pruned_loss=0.02842, over 24762.00 frames. ], tot_loss[loss=0.1607, simple_loss=0.2523, pruned_loss=0.03454, over 4765847.22 frames. ], batch size: 64, lr: 6.10e-03, grad_scale: 32.0 2023-12-04 09:57:49,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=238433.33333333334, ans=0.0 2023-12-04 09:57:58,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=238500.0, ans=0.125 2023-12-04 09:57:59,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=238500.0, ans=0.0 2023-12-04 09:58:11,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=238566.66666666666, ans=0.125 2023-12-04 09:58:36,662 INFO [train.py:1087] (1/4) Epoch 41, batch 0, loss[loss=0.1564, simple_loss=0.2514, pruned_loss=0.03073, over 24554.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2514, pruned_loss=0.03073, over 24554.00 frames. ], batch size: 66, lr: 6.02e-03, grad_scale: 32.0 2023-12-04 09:58:36,663 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 09:58:48,638 INFO [train.py:1119] (1/4) Epoch 41, validation: loss=0.1523, simple_loss=0.2513, pruned_loss=0.02667, over 944034.00 frames. 2023-12-04 09:58:48,639 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 09:59:01,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=238733.33333333334, ans=0.04949747468305833 2023-12-04 09:59:18,095 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=238800.0, ans=0.035 2023-12-04 09:59:43,913 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.329e+02 1.429e+02 1.632e+02 2.414e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 09:59:43,939 INFO [train.py:1087] (1/4) Epoch 41, batch 50, loss[loss=0.1694, simple_loss=0.2614, pruned_loss=0.03871, over 24287.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2524, pruned_loss=0.03424, over 1103148.55 frames. ], batch size: 79, lr: 6.01e-03, grad_scale: 32.0 2023-12-04 09:59:51,778 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=15.0 2023-12-04 10:00:13,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=239133.33333333334, ans=0.0 2023-12-04 10:00:27,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=239266.66666666666, ans=0.025 2023-12-04 10:00:28,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239266.66666666666, ans=0.1 2023-12-04 10:00:32,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239266.66666666666, ans=0.125 2023-12-04 10:00:35,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=239266.66666666666, ans=0.0 2023-12-04 10:00:39,565 INFO [train.py:1087] (1/4) Epoch 41, batch 100, loss[loss=0.1509, simple_loss=0.2438, pruned_loss=0.029, over 24816.00 frames. ], tot_loss[loss=0.161, simple_loss=0.2527, pruned_loss=0.03469, over 1915322.94 frames. ], batch size: 72, lr: 6.01e-03, grad_scale: 16.0 2023-12-04 10:00:43,573 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-12-04 10:00:52,145 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=15.0 2023-12-04 10:01:00,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=239400.0, ans=0.0 2023-12-04 10:01:01,418 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=239466.66666666666, ans=0.07 2023-12-04 10:01:16,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=239533.33333333334, ans=0.0 2023-12-04 10:01:32,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=239600.0, ans=0.0 2023-12-04 10:01:33,124 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-12-04 10:01:35,547 INFO [train.py:1087] (1/4) Epoch 41, batch 150, loss[loss=0.1603, simple_loss=0.2493, pruned_loss=0.0357, over 24804.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.252, pruned_loss=0.03429, over 2560466.82 frames. ], batch size: 62, lr: 6.00e-03, grad_scale: 16.0 2023-12-04 10:01:36,539 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.192e+02 1.310e+02 1.399e+02 1.503e+02 2.273e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-04 10:02:01,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=239800.0, ans=0.04949747468305833 2023-12-04 10:02:03,155 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=15.0 2023-12-04 10:02:33,840 INFO [train.py:1087] (1/4) Epoch 41, batch 200, loss[loss=0.1516, simple_loss=0.242, pruned_loss=0.03056, over 24789.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.2521, pruned_loss=0.03433, over 3066739.24 frames. ], batch size: 71, lr: 6.00e-03, grad_scale: 16.0 2023-12-04 10:02:36,616 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-12-04 10:02:59,996 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.95 vs. limit=15.0 2023-12-04 10:03:00,573 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=240133.33333333334, ans=0.0 2023-12-04 10:03:01,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=240133.33333333334, ans=0.2 2023-12-04 10:03:29,198 INFO [train.py:1087] (1/4) Epoch 41, batch 250, loss[loss=0.1657, simple_loss=0.261, pruned_loss=0.0352, over 24546.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2519, pruned_loss=0.03405, over 3462250.72 frames. ], batch size: 63, lr: 6.00e-03, grad_scale: 8.0 2023-12-04 10:03:31,681 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.301e+02 1.390e+02 1.563e+02 2.001e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 10:03:52,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=240466.66666666666, ans=0.125 2023-12-04 10:04:20,953 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=240600.0, ans=0.0 2023-12-04 10:04:25,025 INFO [train.py:1087] (1/4) Epoch 41, batch 300, loss[loss=0.1482, simple_loss=0.2423, pruned_loss=0.02708, over 24805.00 frames. ], tot_loss[loss=0.16, simple_loss=0.252, pruned_loss=0.03404, over 3758791.01 frames. ], batch size: 72, lr: 5.99e-03, grad_scale: 8.0 2023-12-04 10:04:28,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=240666.66666666666, ans=0.1 2023-12-04 10:04:39,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240733.33333333334, ans=0.125 2023-12-04 10:05:02,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=240866.66666666666, ans=0.125 2023-12-04 10:05:13,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240933.33333333334, ans=0.125 2023-12-04 10:05:17,693 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.97 vs. limit=15.0 2023-12-04 10:05:20,351 INFO [train.py:1087] (1/4) Epoch 41, batch 350, loss[loss=0.1562, simple_loss=0.2463, pruned_loss=0.03304, over 24472.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2525, pruned_loss=0.03482, over 3960267.20 frames. ], batch size: 77, lr: 5.99e-03, grad_scale: 8.0 2023-12-04 10:05:20,618 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=241000.0, ans=0.125 2023-12-04 10:05:22,891 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.319e+02 1.439e+02 1.596e+02 2.414e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 10:05:23,152 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=241000.0, ans=0.0 2023-12-04 10:05:25,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=241000.0, ans=0.2 2023-12-04 10:05:44,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=241133.33333333334, ans=0.125 2023-12-04 10:05:51,988 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=241133.33333333334, ans=0.09899494936611666 2023-12-04 10:05:56,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=241200.0, ans=0.1 2023-12-04 10:05:56,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=241200.0, ans=0.125 2023-12-04 10:06:09,079 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=241266.66666666666, ans=0.125 2023-12-04 10:06:13,396 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=241266.66666666666, ans=0.125 2023-12-04 10:06:14,827 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.84 vs. limit=22.5 2023-12-04 10:06:15,426 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=241333.33333333334, ans=0.0 2023-12-04 10:06:16,600 INFO [train.py:1087] (1/4) Epoch 41, batch 400, loss[loss=0.1527, simple_loss=0.246, pruned_loss=0.02969, over 24564.00 frames. ], tot_loss[loss=0.1611, simple_loss=0.2525, pruned_loss=0.03481, over 4145614.13 frames. ], batch size: 65, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:06:22,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=241333.33333333334, ans=0.125 2023-12-04 10:06:37,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=241400.0, ans=0.125 2023-12-04 10:06:50,152 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=15.0 2023-12-04 10:07:01,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=241600.0, ans=0.0 2023-12-04 10:07:12,054 INFO [train.py:1087] (1/4) Epoch 41, batch 450, loss[loss=0.1549, simple_loss=0.2471, pruned_loss=0.03139, over 24773.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.252, pruned_loss=0.03437, over 4301752.39 frames. ], batch size: 64, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:07:14,201 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.276e+02 1.409e+02 1.560e+02 1.984e+02, threshold=2.818e+02, percent-clipped=0.0 2023-12-04 10:07:16,572 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=241666.66666666666, ans=0.125 2023-12-04 10:07:27,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=241733.33333333334, ans=0.0 2023-12-04 10:07:52,356 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.44 vs. limit=15.0 2023-12-04 10:08:07,667 INFO [train.py:1087] (1/4) Epoch 41, batch 500, loss[loss=0.1681, simple_loss=0.2554, pruned_loss=0.04041, over 24539.00 frames. ], tot_loss[loss=0.1604, simple_loss=0.252, pruned_loss=0.03442, over 4416701.15 frames. ], batch size: 63, lr: 5.98e-03, grad_scale: 16.0 2023-12-04 10:08:30,806 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=242133.33333333334, ans=0.125 2023-12-04 10:08:57,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=242266.66666666666, ans=0.125 2023-12-04 10:09:02,998 INFO [train.py:1087] (1/4) Epoch 41, batch 550, loss[loss=0.1691, simple_loss=0.2581, pruned_loss=0.04003, over 24568.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2514, pruned_loss=0.03425, over 4510149.31 frames. ], batch size: 66, lr: 5.97e-03, grad_scale: 16.0 2023-12-04 10:09:05,084 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.265e+02 1.332e+02 1.489e+02 1.835e+02, threshold=2.663e+02, percent-clipped=0.0 2023-12-04 10:09:08,016 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-12-04 10:09:30,113 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2023-12-04 10:09:33,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=242466.66666666666, ans=0.125 2023-12-04 10:09:35,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=242533.33333333334, ans=0.0 2023-12-04 10:09:55,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=242600.0, ans=0.0 2023-12-04 10:09:58,807 INFO [train.py:1087] (1/4) Epoch 41, batch 600, loss[loss=0.1696, simple_loss=0.2574, pruned_loss=0.04093, over 22958.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2515, pruned_loss=0.03427, over 4579421.69 frames. ], batch size: 55, lr: 5.97e-03, grad_scale: 16.0 2023-12-04 10:10:10,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242733.33333333334, ans=0.1 2023-12-04 10:10:28,382 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.40 vs. limit=15.0 2023-12-04 10:10:31,673 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-12-04 10:10:42,610 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=242933.33333333334, ans=0.125 2023-12-04 10:10:42,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=242933.33333333334, ans=0.125 2023-12-04 10:10:47,523 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-12-04 10:10:53,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=243000.0, ans=0.125 2023-12-04 10:10:54,635 INFO [train.py:1087] (1/4) Epoch 41, batch 650, loss[loss=0.1521, simple_loss=0.2445, pruned_loss=0.02984, over 23307.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2517, pruned_loss=0.03438, over 4611865.07 frames. ], batch size: 56, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:10:56,721 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.277e+02 1.430e+02 1.557e+02 2.173e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 10:10:58,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=243000.0, ans=0.2 2023-12-04 10:11:00,664 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=15.0 2023-12-04 10:11:31,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=243200.0, ans=0.125 2023-12-04 10:11:37,968 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:11:40,180 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-12-04 10:11:50,474 INFO [train.py:1087] (1/4) Epoch 41, batch 700, loss[loss=0.1659, simple_loss=0.2564, pruned_loss=0.0377, over 24478.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2513, pruned_loss=0.03399, over 4663996.09 frames. ], batch size: 75, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:12:34,532 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.63 vs. limit=22.5 2023-12-04 10:12:45,929 INFO [train.py:1087] (1/4) Epoch 41, batch 750, loss[loss=0.1541, simple_loss=0.2506, pruned_loss=0.02877, over 24613.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.2512, pruned_loss=0.03374, over 4714026.02 frames. ], batch size: 68, lr: 5.96e-03, grad_scale: 16.0 2023-12-04 10:12:47,830 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=22.5 2023-12-04 10:12:48,085 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.288e+02 1.387e+02 1.474e+02 2.010e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 10:12:51,105 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.36 vs. limit=15.0 2023-12-04 10:12:59,726 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:13:01,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=243733.33333333334, ans=0.125 2023-12-04 10:13:03,213 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:13:09,186 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-12-04 10:13:41,947 INFO [train.py:1087] (1/4) Epoch 41, batch 800, loss[loss=0.1508, simple_loss=0.2416, pruned_loss=0.03003, over 24861.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.2512, pruned_loss=0.03386, over 4736845.24 frames. ], batch size: 68, lr: 5.95e-03, grad_scale: 32.0 2023-12-04 10:13:44,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244000.0, ans=0.1 2023-12-04 10:13:58,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=244066.66666666666, ans=0.0 2023-12-04 10:14:31,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=244266.66666666666, ans=0.125 2023-12-04 10:14:33,541 INFO [train.py:1087] (1/4) Epoch 41, batch 850, loss[loss=0.1635, simple_loss=0.2544, pruned_loss=0.03632, over 24268.00 frames. ], tot_loss[loss=0.1602, simple_loss=0.2518, pruned_loss=0.03426, over 4730803.10 frames. ], batch size: 79, lr: 5.95e-03, grad_scale: 32.0 2023-12-04 10:14:35,533 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.320e+02 1.423e+02 1.595e+02 2.103e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 10:15:08,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=244533.33333333334, ans=0.2 2023-12-04 10:15:11,534 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=244533.33333333334, ans=10.0 2023-12-04 10:15:14,275 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=244600.0, ans=0.125 2023-12-04 10:15:34,189 INFO [train.py:1087] (1/4) Epoch 42, batch 0, loss[loss=0.1502, simple_loss=0.2447, pruned_loss=0.02786, over 24780.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2447, pruned_loss=0.02786, over 24780.00 frames. ], batch size: 73, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:15:34,190 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 10:15:46,435 INFO [train.py:1119] (1/4) Epoch 42, validation: loss=0.1528, simple_loss=0.2516, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 10:15:46,435 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 10:15:48,792 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=244633.33333333334, ans=0.1 2023-12-04 10:15:51,357 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.60 vs. limit=22.5 2023-12-04 10:16:01,989 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=244700.0, ans=10.0 2023-12-04 10:16:02,337 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=244700.0, ans=6.0 2023-12-04 10:16:21,554 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-12-04 10:16:40,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=244966.66666666666, ans=0.015 2023-12-04 10:16:41,514 INFO [train.py:1087] (1/4) Epoch 42, batch 50, loss[loss=0.1521, simple_loss=0.2431, pruned_loss=0.03051, over 24551.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.25, pruned_loss=0.03321, over 1087998.14 frames. ], batch size: 66, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:16:50,071 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.028e+02 1.299e+02 1.391e+02 1.562e+02 2.248e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-04 10:17:23,974 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=245166.66666666666, ans=0.125 2023-12-04 10:17:37,908 INFO [train.py:1087] (1/4) Epoch 42, batch 100, loss[loss=0.1654, simple_loss=0.2636, pruned_loss=0.03363, over 24737.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.2514, pruned_loss=0.03357, over 1907494.96 frames. ], batch size: 63, lr: 5.87e-03, grad_scale: 32.0 2023-12-04 10:18:10,746 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=245500.0, ans=0.1 2023-12-04 10:18:23,273 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=245566.66666666666, ans=0.125 2023-12-04 10:18:32,110 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:18:33,309 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245633.33333333334, ans=0.1 2023-12-04 10:18:34,005 INFO [train.py:1087] (1/4) Epoch 42, batch 150, loss[loss=0.1486, simple_loss=0.2389, pruned_loss=0.02913, over 24761.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.2524, pruned_loss=0.03412, over 2548062.46 frames. ], batch size: 70, lr: 5.86e-03, grad_scale: 32.0 2023-12-04 10:18:41,709 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.066e+02 1.301e+02 1.400e+02 1.505e+02 2.131e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-04 10:18:49,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=245700.0, ans=0.1 2023-12-04 10:18:51,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=245700.0, ans=0.2 2023-12-04 10:18:53,088 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=245700.0, ans=0.125 2023-12-04 10:19:05,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=245766.66666666666, ans=0.125 2023-12-04 10:19:10,024 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=245833.33333333334, ans=0.125 2023-12-04 10:19:19,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=245900.0, ans=0.125 2023-12-04 10:19:29,193 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.29 vs. limit=15.0 2023-12-04 10:19:29,590 INFO [train.py:1087] (1/4) Epoch 42, batch 200, loss[loss=0.1636, simple_loss=0.255, pruned_loss=0.03608, over 23998.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2518, pruned_loss=0.03385, over 3050119.25 frames. ], batch size: 87, lr: 5.86e-03, grad_scale: 16.0 2023-12-04 10:19:41,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=246033.33333333334, ans=0.125 2023-12-04 10:19:59,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=246100.0, ans=0.0 2023-12-04 10:20:13,454 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=246233.33333333334, ans=0.125 2023-12-04 10:20:25,032 INFO [train.py:1087] (1/4) Epoch 42, batch 250, loss[loss=0.1896, simple_loss=0.2725, pruned_loss=0.05338, over 16752.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2517, pruned_loss=0.03386, over 3429068.80 frames. ], batch size: 176, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:20:34,328 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.152e+02 1.303e+02 1.399e+02 1.549e+02 3.272e+02, threshold=2.797e+02, percent-clipped=1.0 2023-12-04 10:20:41,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=246366.66666666666, ans=0.1 2023-12-04 10:20:41,557 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-12-04 10:20:44,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=246366.66666666666, ans=0.0 2023-12-04 10:20:44,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=246366.66666666666, ans=0.0 2023-12-04 10:20:50,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=246433.33333333334, ans=0.125 2023-12-04 10:21:02,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=246500.0, ans=0.1 2023-12-04 10:21:09,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=246566.66666666666, ans=0.05 2023-12-04 10:21:11,192 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:21:19,605 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-12-04 10:21:21,141 INFO [train.py:1087] (1/4) Epoch 42, batch 300, loss[loss=0.1675, simple_loss=0.2567, pruned_loss=0.03918, over 24323.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2514, pruned_loss=0.03391, over 3732182.64 frames. ], batch size: 79, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:21:21,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=246633.33333333334, ans=0.0 2023-12-04 10:21:36,268 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=246700.0, ans=0.5 2023-12-04 10:21:41,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=246700.0, ans=0.1 2023-12-04 10:22:17,170 INFO [train.py:1087] (1/4) Epoch 42, batch 350, loss[loss=0.1577, simple_loss=0.2499, pruned_loss=0.03276, over 24553.00 frames. ], tot_loss[loss=0.1592, simple_loss=0.2511, pruned_loss=0.03367, over 3976523.85 frames. ], batch size: 66, lr: 5.85e-03, grad_scale: 16.0 2023-12-04 10:22:26,401 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.309e+02 1.424e+02 1.589e+02 2.700e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 10:23:13,408 INFO [train.py:1087] (1/4) Epoch 42, batch 400, loss[loss=0.1608, simple_loss=0.2549, pruned_loss=0.03334, over 23405.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.2514, pruned_loss=0.03365, over 4173146.61 frames. ], batch size: 94, lr: 5.84e-03, grad_scale: 32.0 2023-12-04 10:23:22,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=247300.0, ans=0.125 2023-12-04 10:23:27,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=247366.66666666666, ans=0.125 2023-12-04 10:24:08,929 INFO [train.py:1087] (1/4) Epoch 42, batch 450, loss[loss=0.1476, simple_loss=0.2443, pruned_loss=0.02548, over 24550.00 frames. ], tot_loss[loss=0.159, simple_loss=0.2511, pruned_loss=0.0334, over 4313926.36 frames. ], batch size: 62, lr: 5.84e-03, grad_scale: 32.0 2023-12-04 10:24:12,797 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=247633.33333333334, ans=0.0 2023-12-04 10:24:17,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=247633.33333333334, ans=0.0 2023-12-04 10:24:17,793 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.260e+02 1.360e+02 1.525e+02 2.031e+02, threshold=2.721e+02, percent-clipped=0.0 2023-12-04 10:24:25,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=247700.0, ans=0.125 2023-12-04 10:24:26,031 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-12-04 10:24:27,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=247700.0, ans=0.0 2023-12-04 10:24:41,050 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.74 vs. limit=15.0 2023-12-04 10:24:47,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=247833.33333333334, ans=0.1 2023-12-04 10:25:03,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=247966.66666666666, ans=0.0 2023-12-04 10:25:04,716 INFO [train.py:1087] (1/4) Epoch 42, batch 500, loss[loss=0.1467, simple_loss=0.2407, pruned_loss=0.02639, over 24793.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2517, pruned_loss=0.03377, over 4421678.33 frames. ], batch size: 62, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:25:09,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=247966.66666666666, ans=0.125 2023-12-04 10:25:20,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=248033.33333333334, ans=0.125 2023-12-04 10:25:39,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=248166.66666666666, ans=0.125 2023-12-04 10:25:46,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=248166.66666666666, ans=0.0 2023-12-04 10:25:54,001 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.10 vs. limit=15.0 2023-12-04 10:26:01,397 INFO [train.py:1087] (1/4) Epoch 42, batch 550, loss[loss=0.1555, simple_loss=0.2504, pruned_loss=0.0303, over 24751.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2518, pruned_loss=0.03383, over 4499450.57 frames. ], batch size: 65, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:26:09,898 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.331e+02 1.425e+02 1.593e+02 2.159e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 10:26:22,037 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=248366.66666666666, ans=0.125 2023-12-04 10:26:57,148 INFO [train.py:1087] (1/4) Epoch 42, batch 600, loss[loss=0.1467, simple_loss=0.239, pruned_loss=0.02722, over 24816.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2516, pruned_loss=0.03394, over 4571316.36 frames. ], batch size: 73, lr: 5.83e-03, grad_scale: 32.0 2023-12-04 10:26:59,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=248633.33333333334, ans=0.125 2023-12-04 10:27:29,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=248833.33333333334, ans=0.125 2023-12-04 10:27:30,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=248833.33333333334, ans=0.1 2023-12-04 10:27:31,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=248833.33333333334, ans=0.2 2023-12-04 10:27:34,197 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=248833.33333333334, ans=0.2 2023-12-04 10:27:37,621 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-12-04 10:27:39,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=248833.33333333334, ans=0.07 2023-12-04 10:27:46,849 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=248900.0, ans=0.0 2023-12-04 10:27:52,676 INFO [train.py:1087] (1/4) Epoch 42, batch 650, loss[loss=0.158, simple_loss=0.2495, pruned_loss=0.03325, over 24859.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2516, pruned_loss=0.03399, over 4643877.06 frames. ], batch size: 68, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:28:01,587 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.365e+02 1.477e+02 1.651e+02 2.256e+02, threshold=2.955e+02, percent-clipped=0.0 2023-12-04 10:28:01,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=248966.66666666666, ans=0.125 2023-12-04 10:28:15,095 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=249100.0, ans=0.2 2023-12-04 10:28:26,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=249166.66666666666, ans=0.125 2023-12-04 10:28:30,147 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=249166.66666666666, ans=0.1 2023-12-04 10:28:38,988 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=249233.33333333334, ans=0.125 2023-12-04 10:28:46,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=249233.33333333334, ans=0.125 2023-12-04 10:28:48,489 INFO [train.py:1087] (1/4) Epoch 42, batch 700, loss[loss=0.1477, simple_loss=0.2393, pruned_loss=0.0281, over 24693.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2516, pruned_loss=0.03394, over 4668911.36 frames. ], batch size: 74, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:28:53,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=249300.0, ans=0.125 2023-12-04 10:29:02,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=249366.66666666666, ans=0.125 2023-12-04 10:29:13,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=249433.33333333334, ans=0.1 2023-12-04 10:29:36,269 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=249566.66666666666, ans=0.0 2023-12-04 10:29:43,183 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=249566.66666666666, ans=0.1 2023-12-04 10:29:44,778 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=22.5 2023-12-04 10:29:45,063 INFO [train.py:1087] (1/4) Epoch 42, batch 750, loss[loss=0.201, simple_loss=0.2778, pruned_loss=0.06214, over 17493.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2517, pruned_loss=0.03406, over 4685395.33 frames. ], batch size: 177, lr: 5.82e-03, grad_scale: 32.0 2023-12-04 10:29:48,721 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-12-04 10:29:49,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=249633.33333333334, ans=0.125 2023-12-04 10:29:51,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=249633.33333333334, ans=0.09899494936611666 2023-12-04 10:29:53,628 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.308e+02 1.406e+02 1.557e+02 2.045e+02, threshold=2.813e+02, percent-clipped=0.0 2023-12-04 10:29:53,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=249633.33333333334, ans=0.125 2023-12-04 10:30:02,332 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=249700.0, ans=0.0 2023-12-04 10:30:10,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=249766.66666666666, ans=0.125 2023-12-04 10:30:29,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=249900.0, ans=0.125 2023-12-04 10:30:33,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=249900.0, ans=0.1 2023-12-04 10:30:33,654 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=12.0 2023-12-04 10:30:39,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=249900.0, ans=0.07 2023-12-04 10:30:42,239 INFO [train.py:1087] (1/4) Epoch 42, batch 800, loss[loss=0.1644, simple_loss=0.2588, pruned_loss=0.035, over 21550.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.2512, pruned_loss=0.0339, over 4721998.95 frames. ], batch size: 127, lr: 5.81e-03, grad_scale: 32.0 2023-12-04 10:30:45,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=249966.66666666666, ans=0.125 2023-12-04 10:30:53,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=250033.33333333334, ans=0.125 2023-12-04 10:31:14,428 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250166.66666666666, ans=0.1 2023-12-04 10:31:15,391 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=250166.66666666666, ans=0.1 2023-12-04 10:31:26,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=250233.33333333334, ans=0.1 2023-12-04 10:31:33,893 INFO [train.py:1087] (1/4) Epoch 42, batch 850, loss[loss=0.1532, simple_loss=0.2468, pruned_loss=0.02983, over 24773.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2513, pruned_loss=0.03411, over 4751724.74 frames. ], batch size: 64, lr: 5.81e-03, grad_scale: 32.0 2023-12-04 10:31:41,982 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.298e+02 1.360e+02 1.535e+02 1.939e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-04 10:31:49,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=250366.66666666666, ans=0.0 2023-12-04 10:32:32,950 INFO [train.py:1087] (1/4) Epoch 43, batch 0, loss[loss=0.1647, simple_loss=0.2587, pruned_loss=0.03532, over 23597.00 frames. ], tot_loss[loss=0.1647, simple_loss=0.2587, pruned_loss=0.03532, over 23597.00 frames. ], batch size: 94, lr: 5.74e-03, grad_scale: 32.0 2023-12-04 10:32:32,951 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 10:32:41,876 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.1440, 4.6658, 4.2102, 4.8924], device='cuda:1') 2023-12-04 10:32:45,209 INFO [train.py:1119] (1/4) Epoch 43, validation: loss=0.1523, simple_loss=0.2509, pruned_loss=0.02682, over 944034.00 frames. 2023-12-04 10:32:45,210 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 10:32:51,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=250600.0, ans=0.125 2023-12-04 10:32:55,546 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=250666.66666666666, ans=0.125 2023-12-04 10:32:57,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=250666.66666666666, ans=12.0 2023-12-04 10:33:18,045 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-12-04 10:33:27,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250800.0, ans=0.125 2023-12-04 10:33:40,670 INFO [train.py:1087] (1/4) Epoch 43, batch 50, loss[loss=0.1646, simple_loss=0.2582, pruned_loss=0.03549, over 24553.00 frames. ], tot_loss[loss=0.1603, simple_loss=0.2519, pruned_loss=0.03434, over 1098319.10 frames. ], batch size: 63, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:33:56,847 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.362e+02 1.452e+02 1.646e+02 2.202e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 10:34:08,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=251066.66666666666, ans=0.125 2023-12-04 10:34:12,894 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=251133.33333333334, ans=0.125 2023-12-04 10:34:15,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=251133.33333333334, ans=0.0 2023-12-04 10:34:36,623 INFO [train.py:1087] (1/4) Epoch 43, batch 100, loss[loss=0.1523, simple_loss=0.2457, pruned_loss=0.02947, over 24780.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.251, pruned_loss=0.03393, over 1923643.49 frames. ], batch size: 62, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:34:59,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=251400.0, ans=0.0 2023-12-04 10:35:09,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=251466.66666666666, ans=0.125 2023-12-04 10:35:18,311 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.78 vs. limit=12.0 2023-12-04 10:35:27,237 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=251533.33333333334, ans=0.0 2023-12-04 10:35:33,175 INFO [train.py:1087] (1/4) Epoch 43, batch 150, loss[loss=0.1528, simple_loss=0.2479, pruned_loss=0.02882, over 24758.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2509, pruned_loss=0.03343, over 2571325.61 frames. ], batch size: 65, lr: 5.73e-03, grad_scale: 32.0 2023-12-04 10:35:37,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251600.0, ans=0.125 2023-12-04 10:35:38,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251600.0, ans=0.125 2023-12-04 10:35:48,107 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.291e+02 1.391e+02 1.529e+02 2.044e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 10:36:05,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251800.0, ans=0.1 2023-12-04 10:36:28,321 INFO [train.py:1087] (1/4) Epoch 43, batch 200, loss[loss=0.1599, simple_loss=0.2509, pruned_loss=0.03443, over 24696.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.2512, pruned_loss=0.03382, over 3067705.66 frames. ], batch size: 74, lr: 5.72e-03, grad_scale: 32.0 2023-12-04 10:36:29,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=251933.33333333334, ans=0.2 2023-12-04 10:36:35,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=251933.33333333334, ans=0.0 2023-12-04 10:36:36,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=251933.33333333334, ans=0.0 2023-12-04 10:36:42,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=252000.0, ans=0.0 2023-12-04 10:36:56,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=252066.66666666666, ans=0.0 2023-12-04 10:37:03,719 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=252133.33333333334, ans=0.0 2023-12-04 10:37:14,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=252200.0, ans=0.0 2023-12-04 10:37:24,309 INFO [train.py:1087] (1/4) Epoch 43, batch 250, loss[loss=0.1573, simple_loss=0.2518, pruned_loss=0.03145, over 24712.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.251, pruned_loss=0.03374, over 3458212.87 frames. ], batch size: 69, lr: 5.72e-03, grad_scale: 16.0 2023-12-04 10:37:27,899 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-12-04 10:37:41,479 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.276e+02 1.375e+02 1.506e+02 1.790e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 10:38:06,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=252466.66666666666, ans=0.125 2023-12-04 10:38:09,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=252533.33333333334, ans=0.125 2023-12-04 10:38:20,494 INFO [train.py:1087] (1/4) Epoch 43, batch 300, loss[loss=0.1663, simple_loss=0.263, pruned_loss=0.03478, over 22849.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2514, pruned_loss=0.0339, over 3752215.86 frames. ], batch size: 106, lr: 5.71e-03, grad_scale: 16.0 2023-12-04 10:38:20,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=252600.0, ans=0.125 2023-12-04 10:38:22,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=252600.0, ans=0.125 2023-12-04 10:38:24,495 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-12-04 10:38:58,863 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-12-04 10:39:01,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=252800.0, ans=15.0 2023-12-04 10:39:08,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=252866.66666666666, ans=0.125 2023-12-04 10:39:16,593 INFO [train.py:1087] (1/4) Epoch 43, batch 350, loss[loss=0.1738, simple_loss=0.2653, pruned_loss=0.04113, over 24281.00 frames. ], tot_loss[loss=0.1596, simple_loss=0.2513, pruned_loss=0.03391, over 3986568.31 frames. ], batch size: 79, lr: 5.71e-03, grad_scale: 16.0 2023-12-04 10:39:33,092 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.311e+02 1.434e+02 1.594e+02 2.368e+02, threshold=2.867e+02, percent-clipped=0.0 2023-12-04 10:39:43,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=253066.66666666666, ans=0.125 2023-12-04 10:40:12,086 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-12-04 10:40:12,457 INFO [train.py:1087] (1/4) Epoch 43, batch 400, loss[loss=0.1589, simple_loss=0.2516, pruned_loss=0.03309, over 24135.00 frames. ], tot_loss[loss=0.1601, simple_loss=0.2519, pruned_loss=0.03418, over 4168779.78 frames. ], batch size: 82, lr: 5.71e-03, grad_scale: 32.0 2023-12-04 10:40:20,781 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=253266.66666666666, ans=0.0 2023-12-04 10:40:48,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=253466.66666666666, ans=0.02 2023-12-04 10:40:52,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253466.66666666666, ans=0.1 2023-12-04 10:40:52,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-12-04 10:40:52,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=12.0 2023-12-04 10:41:02,457 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-12-04 10:41:08,419 INFO [train.py:1087] (1/4) Epoch 43, batch 450, loss[loss=0.1534, simple_loss=0.2441, pruned_loss=0.03134, over 24613.00 frames. ], tot_loss[loss=0.1597, simple_loss=0.2514, pruned_loss=0.03402, over 4314686.53 frames. ], batch size: 68, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:41:25,332 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.283e+02 1.403e+02 1.492e+02 2.872e+02, threshold=2.807e+02, percent-clipped=1.0 2023-12-04 10:41:36,243 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=253733.33333333334, ans=0.125 2023-12-04 10:41:46,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=253800.0, ans=0.0 2023-12-04 10:41:50,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=253800.0, ans=0.07 2023-12-04 10:41:59,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=253866.66666666666, ans=0.2 2023-12-04 10:42:03,951 INFO [train.py:1087] (1/4) Epoch 43, batch 500, loss[loss=0.1665, simple_loss=0.2569, pruned_loss=0.03807, over 24475.00 frames. ], tot_loss[loss=0.1594, simple_loss=0.2509, pruned_loss=0.03392, over 4431850.04 frames. ], batch size: 77, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:42:06,756 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-12-04 10:42:10,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=253933.33333333334, ans=0.1 2023-12-04 10:42:18,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=254000.0, ans=0.125 2023-12-04 10:42:23,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254000.0, ans=0.1 2023-12-04 10:42:26,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=254066.66666666666, ans=0.0 2023-12-04 10:42:51,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=254200.0, ans=0.125 2023-12-04 10:42:54,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=254200.0, ans=0.125 2023-12-04 10:42:56,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254200.0, ans=0.1 2023-12-04 10:43:00,412 INFO [train.py:1087] (1/4) Epoch 43, batch 550, loss[loss=0.15, simple_loss=0.2423, pruned_loss=0.02882, over 24612.00 frames. ], tot_loss[loss=0.1593, simple_loss=0.2509, pruned_loss=0.03391, over 4523830.89 frames. ], batch size: 68, lr: 5.70e-03, grad_scale: 32.0 2023-12-04 10:43:06,083 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:43:06,117 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=254266.66666666666, ans=0.2 2023-12-04 10:43:16,751 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.284e+02 1.383e+02 1.508e+02 2.379e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 10:43:21,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=254400.0, ans=0.0 2023-12-04 10:43:35,474 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=254466.66666666666, ans=0.0 2023-12-04 10:43:35,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=254466.66666666666, ans=0.2 2023-12-04 10:43:39,768 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=254466.66666666666, ans=0.0 2023-12-04 10:43:56,355 INFO [train.py:1087] (1/4) Epoch 43, batch 600, loss[loss=0.1501, simple_loss=0.244, pruned_loss=0.02814, over 24681.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2506, pruned_loss=0.03383, over 4575956.53 frames. ], batch size: 74, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:43:58,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=254600.0, ans=0.1 2023-12-04 10:44:11,431 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=254666.66666666666, ans=0.015 2023-12-04 10:44:14,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254666.66666666666, ans=0.1 2023-12-04 10:44:24,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=254733.33333333334, ans=0.125 2023-12-04 10:44:34,505 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=254800.0, ans=0.0 2023-12-04 10:44:43,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=254866.66666666666, ans=0.0 2023-12-04 10:44:52,169 INFO [train.py:1087] (1/4) Epoch 43, batch 650, loss[loss=0.1593, simple_loss=0.2527, pruned_loss=0.0329, over 24773.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2505, pruned_loss=0.03362, over 4643998.47 frames. ], batch size: 64, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:44:59,163 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254933.33333333334, ans=0.1 2023-12-04 10:45:03,175 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.66 vs. limit=6.0 2023-12-04 10:45:06,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255000.0, ans=0.1 2023-12-04 10:45:10,801 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.306e+02 1.396e+02 1.562e+02 2.104e+02, threshold=2.792e+02, percent-clipped=0.0 2023-12-04 10:45:17,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=255066.66666666666, ans=0.125 2023-12-04 10:45:19,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=255066.66666666666, ans=0.2 2023-12-04 10:45:19,660 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=255066.66666666666, ans=0.125 2023-12-04 10:45:40,180 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=255200.0, ans=0.2 2023-12-04 10:45:42,317 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=255200.0, ans=0.125 2023-12-04 10:45:48,311 INFO [train.py:1087] (1/4) Epoch 43, batch 700, loss[loss=0.1568, simple_loss=0.2497, pruned_loss=0.032, over 24762.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2507, pruned_loss=0.03372, over 4679154.65 frames. ], batch size: 64, lr: 5.69e-03, grad_scale: 16.0 2023-12-04 10:46:15,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=255400.0, ans=0.2 2023-12-04 10:46:31,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255533.33333333334, ans=0.1 2023-12-04 10:46:44,375 INFO [train.py:1087] (1/4) Epoch 43, batch 750, loss[loss=0.1637, simple_loss=0.2538, pruned_loss=0.03679, over 24146.00 frames. ], tot_loss[loss=0.1595, simple_loss=0.251, pruned_loss=0.03398, over 4705658.98 frames. ], batch size: 82, lr: 5.68e-03, grad_scale: 16.0 2023-12-04 10:47:01,890 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.310e+02 1.439e+02 1.594e+02 2.322e+02, threshold=2.878e+02, percent-clipped=0.0 2023-12-04 10:47:10,411 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=15.0 2023-12-04 10:47:32,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255866.66666666666, ans=0.125 2023-12-04 10:47:39,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=255933.33333333334, ans=0.125 2023-12-04 10:47:40,074 INFO [train.py:1087] (1/4) Epoch 43, batch 800, loss[loss=0.1664, simple_loss=0.2552, pruned_loss=0.03878, over 24496.00 frames. ], tot_loss[loss=0.16, simple_loss=0.2514, pruned_loss=0.0343, over 4714702.25 frames. ], batch size: 75, lr: 5.68e-03, grad_scale: 32.0 2023-12-04 10:47:43,615 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=255933.33333333334, ans=0.125 2023-12-04 10:47:57,606 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2023-12-04 10:48:03,374 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=256066.66666666666, ans=10.0 2023-12-04 10:48:10,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=256066.66666666666, ans=0.125 2023-12-04 10:48:14,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=256133.33333333334, ans=0.125 2023-12-04 10:48:22,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=256200.0, ans=0.015 2023-12-04 10:48:28,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=256200.0, ans=0.125 2023-12-04 10:48:32,505 INFO [train.py:1087] (1/4) Epoch 43, batch 850, loss[loss=0.1642, simple_loss=0.2594, pruned_loss=0.0345, over 21369.00 frames. ], tot_loss[loss=0.1599, simple_loss=0.2514, pruned_loss=0.03424, over 4737596.51 frames. ], batch size: 127, lr: 5.67e-03, grad_scale: 32.0 2023-12-04 10:48:38,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.79 vs. limit=15.0 2023-12-04 10:48:38,956 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256266.66666666666, ans=0.1 2023-12-04 10:48:48,708 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.268e+02 1.363e+02 1.524e+02 1.914e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 10:49:06,435 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=256466.66666666666, ans=0.125 2023-12-04 10:49:16,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=256533.33333333334, ans=0.125 2023-12-04 10:49:31,984 INFO [train.py:1087] (1/4) Epoch 44, batch 0, loss[loss=0.1489, simple_loss=0.247, pruned_loss=0.02541, over 24716.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.247, pruned_loss=0.02541, over 24716.00 frames. ], batch size: 74, lr: 5.61e-03, grad_scale: 32.0 2023-12-04 10:49:31,984 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 10:49:44,325 INFO [train.py:1119] (1/4) Epoch 44, validation: loss=0.1512, simple_loss=0.2503, pruned_loss=0.02602, over 944034.00 frames. 2023-12-04 10:49:44,326 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 10:50:38,889 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 10:50:39,600 INFO [train.py:1087] (1/4) Epoch 44, batch 50, loss[loss=0.1716, simple_loss=0.2632, pruned_loss=0.03998, over 22784.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2499, pruned_loss=0.03289, over 1100835.32 frames. ], batch size: 106, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:50:43,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=256900.0, ans=0.1 2023-12-04 10:50:44,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=256900.0, ans=15.0 2023-12-04 10:50:49,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=256900.0, ans=0.0 2023-12-04 10:50:55,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=256966.66666666666, ans=0.125 2023-12-04 10:51:04,412 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.346e+02 1.454e+02 1.617e+02 2.444e+02, threshold=2.907e+02, percent-clipped=0.0 2023-12-04 10:51:06,832 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257033.33333333334, ans=0.1 2023-12-04 10:51:22,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=257100.0, ans=0.2 2023-12-04 10:51:24,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=257166.66666666666, ans=0.2 2023-12-04 10:51:35,076 INFO [train.py:1087] (1/4) Epoch 44, batch 100, loss[loss=0.1607, simple_loss=0.2553, pruned_loss=0.03305, over 24715.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2504, pruned_loss=0.03304, over 1931148.27 frames. ], batch size: 69, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:51:37,778 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=257233.33333333334, ans=0.125 2023-12-04 10:51:46,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=257300.0, ans=0.125 2023-12-04 10:51:54,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=257300.0, ans=0.125 2023-12-04 10:52:01,941 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=257366.66666666666, ans=0.125 2023-12-04 10:52:06,468 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-12-04 10:52:26,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=257500.0, ans=0.0 2023-12-04 10:52:31,059 INFO [train.py:1087] (1/4) Epoch 44, batch 150, loss[loss=0.1571, simple_loss=0.2481, pruned_loss=0.03302, over 24480.00 frames. ], tot_loss[loss=0.1585, simple_loss=0.2507, pruned_loss=0.03318, over 2558660.21 frames. ], batch size: 77, lr: 5.60e-03, grad_scale: 32.0 2023-12-04 10:52:31,379 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=257566.66666666666, ans=0.0 2023-12-04 10:52:46,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=257633.33333333334, ans=0.1 2023-12-04 10:52:48,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=257633.33333333334, ans=0.125 2023-12-04 10:52:55,813 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.310e+02 1.411e+02 1.524e+02 1.806e+02, threshold=2.823e+02, percent-clipped=0.0 2023-12-04 10:53:01,429 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257700.0, ans=0.1 2023-12-04 10:53:01,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=257700.0, ans=0.125 2023-12-04 10:53:01,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=257700.0, ans=0.125 2023-12-04 10:53:06,096 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.77 vs. limit=15.0 2023-12-04 10:53:11,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-12-04 10:53:26,991 INFO [train.py:1087] (1/4) Epoch 44, batch 200, loss[loss=0.1621, simple_loss=0.2491, pruned_loss=0.03753, over 24794.00 frames. ], tot_loss[loss=0.159, simple_loss=0.2509, pruned_loss=0.03355, over 3064062.77 frames. ], batch size: 62, lr: 5.59e-03, grad_scale: 32.0 2023-12-04 10:53:33,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=257900.0, ans=0.125 2023-12-04 10:54:04,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=258100.0, ans=0.0 2023-12-04 10:54:04,431 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=15.0 2023-12-04 10:54:04,493 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-12-04 10:54:15,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=258166.66666666666, ans=0.2 2023-12-04 10:54:22,602 INFO [train.py:1087] (1/4) Epoch 44, batch 250, loss[loss=0.1703, simple_loss=0.2575, pruned_loss=0.04154, over 24504.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2501, pruned_loss=0.03326, over 3454746.09 frames. ], batch size: 75, lr: 5.59e-03, grad_scale: 32.0 2023-12-04 10:54:33,181 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=258300.0, ans=0.125 2023-12-04 10:54:47,511 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.280e+02 1.390e+02 1.521e+02 2.279e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 10:54:59,186 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.13 vs. limit=10.0 2023-12-04 10:55:05,393 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=258433.33333333334, ans=0.125 2023-12-04 10:55:18,620 INFO [train.py:1087] (1/4) Epoch 44, batch 300, loss[loss=0.1577, simple_loss=0.2497, pruned_loss=0.03283, over 24759.00 frames. ], tot_loss[loss=0.1587, simple_loss=0.2507, pruned_loss=0.03337, over 3751778.59 frames. ], batch size: 70, lr: 5.58e-03, grad_scale: 16.0 2023-12-04 10:55:33,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=258633.33333333334, ans=0.125 2023-12-04 10:55:38,079 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258633.33333333334, ans=0.1 2023-12-04 10:55:44,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258700.0, ans=0.1 2023-12-04 10:56:15,406 INFO [train.py:1087] (1/4) Epoch 44, batch 350, loss[loss=0.1405, simple_loss=0.2344, pruned_loss=0.0233, over 24722.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2511, pruned_loss=0.03359, over 3969434.21 frames. ], batch size: 67, lr: 5.58e-03, grad_scale: 16.0 2023-12-04 10:56:18,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.03 vs. limit=10.0 2023-12-04 10:56:24,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=258900.0, ans=0.1 2023-12-04 10:56:40,381 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.276e+02 1.381e+02 1.490e+02 2.030e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 10:56:42,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=259033.33333333334, ans=0.2 2023-12-04 10:56:51,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=259100.0, ans=0.125 2023-12-04 10:56:57,244 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=259100.0, ans=0.0 2023-12-04 10:57:10,198 INFO [train.py:1087] (1/4) Epoch 44, batch 400, loss[loss=0.1437, simple_loss=0.2389, pruned_loss=0.02424, over 24798.00 frames. ], tot_loss[loss=0.1586, simple_loss=0.2505, pruned_loss=0.0333, over 4170882.09 frames. ], batch size: 73, lr: 5.58e-03, grad_scale: 32.0 2023-12-04 10:57:19,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=259233.33333333334, ans=0.0 2023-12-04 10:57:23,240 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=259300.0, ans=0.1 2023-12-04 10:57:26,757 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.09 vs. limit=22.5 2023-12-04 10:57:32,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=259366.66666666666, ans=0.125 2023-12-04 10:58:06,561 INFO [train.py:1087] (1/4) Epoch 44, batch 450, loss[loss=0.1456, simple_loss=0.2357, pruned_loss=0.0277, over 24754.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2503, pruned_loss=0.03306, over 4318896.18 frames. ], batch size: 65, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:58:22,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=259633.33333333334, ans=0.125 2023-12-04 10:58:23,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=259633.33333333334, ans=0.125 2023-12-04 10:58:31,262 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=259700.0, ans=0.0 2023-12-04 10:58:32,038 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.255e+02 1.336e+02 1.462e+02 2.228e+02, threshold=2.673e+02, percent-clipped=0.0 2023-12-04 10:59:02,279 INFO [train.py:1087] (1/4) Epoch 44, batch 500, loss[loss=0.1579, simple_loss=0.2462, pruned_loss=0.03481, over 24486.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.25, pruned_loss=0.03308, over 4441805.07 frames. ], batch size: 75, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:59:02,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=259900.0, ans=0.2 2023-12-04 10:59:26,975 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260033.33333333334, ans=0.1 2023-12-04 10:59:30,539 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=260033.33333333334, ans=0.125 2023-12-04 10:59:42,809 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.01 vs. limit=10.0 2023-12-04 10:59:44,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260100.0, ans=0.1 2023-12-04 10:59:57,569 INFO [train.py:1087] (1/4) Epoch 44, batch 550, loss[loss=0.159, simple_loss=0.2493, pruned_loss=0.03432, over 24774.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2505, pruned_loss=0.03349, over 4513527.90 frames. ], batch size: 70, lr: 5.57e-03, grad_scale: 32.0 2023-12-04 10:59:57,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=260233.33333333334, ans=0.125 2023-12-04 11:00:00,973 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=260233.33333333334, ans=0.2 2023-12-04 11:00:03,429 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=260233.33333333334, ans=0.125 2023-12-04 11:00:08,091 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.01 vs. limit=15.0 2023-12-04 11:00:22,738 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.162e+02 1.301e+02 1.379e+02 1.493e+02 2.705e+02, threshold=2.759e+02, percent-clipped=1.0 2023-12-04 11:00:23,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=260366.66666666666, ans=0.0 2023-12-04 11:00:53,047 INFO [train.py:1087] (1/4) Epoch 44, batch 600, loss[loss=0.149, simple_loss=0.246, pruned_loss=0.02602, over 24766.00 frames. ], tot_loss[loss=0.1586, simple_loss=0.2505, pruned_loss=0.03339, over 4583933.92 frames. ], batch size: 64, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:00:57,079 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.91 vs. limit=6.0 2023-12-04 11:01:24,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260700.0, ans=0.1 2023-12-04 11:01:35,955 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=260766.66666666666, ans=0.125 2023-12-04 11:01:48,801 INFO [train.py:1087] (1/4) Epoch 44, batch 650, loss[loss=0.1724, simple_loss=0.2638, pruned_loss=0.04049, over 24763.00 frames. ], tot_loss[loss=0.1587, simple_loss=0.2505, pruned_loss=0.03342, over 4630280.04 frames. ], batch size: 70, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:01:52,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=260900.0, ans=0.125 2023-12-04 11:02:13,734 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=261033.33333333334, ans=0.2 2023-12-04 11:02:14,553 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.257e+02 1.345e+02 1.436e+02 1.946e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 11:02:14,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=261033.33333333334, ans=0.125 2023-12-04 11:02:21,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=261100.0, ans=0.0 2023-12-04 11:02:25,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=261100.0, ans=0.1 2023-12-04 11:02:31,129 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=261100.0, ans=0.125 2023-12-04 11:02:34,914 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:02:38,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=261166.66666666666, ans=0.125 2023-12-04 11:02:44,567 INFO [train.py:1087] (1/4) Epoch 44, batch 700, loss[loss=0.1663, simple_loss=0.2571, pruned_loss=0.03776, over 23390.00 frames. ], tot_loss[loss=0.1584, simple_loss=0.2504, pruned_loss=0.03314, over 4669358.95 frames. ], batch size: 94, lr: 5.56e-03, grad_scale: 32.0 2023-12-04 11:02:48,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=261233.33333333334, ans=15.0 2023-12-04 11:02:58,494 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:03:04,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=261300.0, ans=0.09899494936611666 2023-12-04 11:03:40,858 INFO [train.py:1087] (1/4) Epoch 44, batch 750, loss[loss=0.1698, simple_loss=0.2562, pruned_loss=0.04169, over 24232.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2504, pruned_loss=0.03308, over 4714583.03 frames. ], batch size: 82, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:03:45,715 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:03:46,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=261566.66666666666, ans=0.2 2023-12-04 11:04:06,503 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.296e+02 1.395e+02 1.494e+02 1.834e+02, threshold=2.790e+02, percent-clipped=0.0 2023-12-04 11:04:13,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=261766.66666666666, ans=0.0 2023-12-04 11:04:28,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=261833.33333333334, ans=0.125 2023-12-04 11:04:36,804 INFO [train.py:1087] (1/4) Epoch 44, batch 800, loss[loss=0.1454, simple_loss=0.2451, pruned_loss=0.02286, over 24546.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2501, pruned_loss=0.03269, over 4749632.95 frames. ], batch size: 66, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:04:37,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=261900.0, ans=0.0 2023-12-04 11:05:07,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=262100.0, ans=0.05 2023-12-04 11:05:13,785 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.40 vs. limit=15.0 2023-12-04 11:05:18,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=262166.6666666667, ans=0.0 2023-12-04 11:05:19,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=262166.6666666667, ans=0.0 2023-12-04 11:05:26,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=262166.6666666667, ans=0.05 2023-12-04 11:05:28,968 INFO [train.py:1087] (1/4) Epoch 44, batch 850, loss[loss=0.1638, simple_loss=0.2562, pruned_loss=0.03565, over 22750.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2501, pruned_loss=0.03281, over 4754875.10 frames. ], batch size: 106, lr: 5.55e-03, grad_scale: 32.0 2023-12-04 11:05:52,164 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.290e+02 1.372e+02 1.489e+02 2.246e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-04 11:05:57,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=262366.6666666667, ans=0.125 2023-12-04 11:06:11,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262500.0, ans=0.1 2023-12-04 11:06:30,264 INFO [train.py:1087] (1/4) Epoch 45, batch 0, loss[loss=0.1472, simple_loss=0.2404, pruned_loss=0.02701, over 24868.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2404, pruned_loss=0.02701, over 24868.00 frames. ], batch size: 68, lr: 5.48e-03, grad_scale: 32.0 2023-12-04 11:06:30,265 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 11:06:42,726 INFO [train.py:1119] (1/4) Epoch 45, validation: loss=0.1525, simple_loss=0.2511, pruned_loss=0.02696, over 944034.00 frames. 2023-12-04 11:06:42,727 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 11:07:17,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=262733.3333333333, ans=0.95 2023-12-04 11:07:19,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=262733.3333333333, ans=0.1 2023-12-04 11:07:26,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=262800.0, ans=0.2 2023-12-04 11:07:26,718 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.15 vs. limit=15.0 2023-12-04 11:07:38,493 INFO [train.py:1087] (1/4) Epoch 45, batch 50, loss[loss=0.1616, simple_loss=0.253, pruned_loss=0.03511, over 24479.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2486, pruned_loss=0.03257, over 1073227.10 frames. ], batch size: 77, lr: 5.48e-03, grad_scale: 16.0 2023-12-04 11:07:43,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=262866.6666666667, ans=0.2 2023-12-04 11:07:45,053 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=15.0 2023-12-04 11:07:58,350 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262933.3333333333, ans=0.1 2023-12-04 11:08:05,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263000.0, ans=0.1 2023-12-04 11:08:09,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=263000.0, ans=0.09899494936611666 2023-12-04 11:08:10,395 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.054e+02 1.301e+02 1.423e+02 1.672e+02 2.279e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 11:08:18,770 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=263066.6666666667, ans=0.125 2023-12-04 11:08:22,124 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-12-04 11:08:33,269 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263200.0, ans=0.1 2023-12-04 11:08:34,112 INFO [train.py:1087] (1/4) Epoch 45, batch 100, loss[loss=0.1484, simple_loss=0.2433, pruned_loss=0.02678, over 21398.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2497, pruned_loss=0.0326, over 1894107.70 frames. ], batch size: 127, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:08:59,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=263333.3333333333, ans=0.125 2023-12-04 11:09:00,940 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.59 vs. limit=15.0 2023-12-04 11:09:06,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=263400.0, ans=0.0 2023-12-04 11:09:08,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263400.0, ans=0.1 2023-12-04 11:09:08,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=263400.0, ans=0.2 2023-12-04 11:09:10,903 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:09:29,390 INFO [train.py:1087] (1/4) Epoch 45, batch 150, loss[loss=0.1576, simple_loss=0.2487, pruned_loss=0.03327, over 24569.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2495, pruned_loss=0.03236, over 2551135.83 frames. ], batch size: 65, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:09:46,060 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-12-04 11:10:02,910 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.302e+02 1.364e+02 1.498e+02 1.785e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 11:10:15,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263800.0, ans=0.1 2023-12-04 11:10:26,278 INFO [train.py:1087] (1/4) Epoch 45, batch 200, loss[loss=0.1564, simple_loss=0.2547, pruned_loss=0.02902, over 24598.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2496, pruned_loss=0.03262, over 3047933.88 frames. ], batch size: 68, lr: 5.47e-03, grad_scale: 16.0 2023-12-04 11:10:32,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=263866.6666666667, ans=0.2 2023-12-04 11:11:16,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=264133.3333333333, ans=0.0 2023-12-04 11:11:21,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=264200.0, ans=0.0 2023-12-04 11:11:22,359 INFO [train.py:1087] (1/4) Epoch 45, batch 250, loss[loss=0.1512, simple_loss=0.2444, pruned_loss=0.02902, over 24769.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2494, pruned_loss=0.03263, over 3451969.40 frames. ], batch size: 70, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:11:33,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=264266.6666666667, ans=0.1 2023-12-04 11:11:35,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=264266.6666666667, ans=0.0 2023-12-04 11:11:42,234 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.61 vs. limit=12.0 2023-12-04 11:11:45,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=264333.3333333333, ans=0.0 2023-12-04 11:11:47,833 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.28 vs. limit=15.0 2023-12-04 11:11:54,845 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.261e+02 1.380e+02 1.511e+02 1.936e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 11:11:56,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=264400.0, ans=0.0 2023-12-04 11:12:19,001 INFO [train.py:1087] (1/4) Epoch 45, batch 300, loss[loss=0.1404, simple_loss=0.2356, pruned_loss=0.02261, over 24857.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2493, pruned_loss=0.03249, over 3764819.03 frames. ], batch size: 68, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:12:27,730 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=264533.3333333333, ans=0.0 2023-12-04 11:12:32,401 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.29 vs. limit=22.5 2023-12-04 11:12:37,299 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=264600.0, ans=0.125 2023-12-04 11:13:14,907 INFO [train.py:1087] (1/4) Epoch 45, batch 350, loss[loss=0.1615, simple_loss=0.2574, pruned_loss=0.03277, over 22791.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2497, pruned_loss=0.0328, over 3997549.58 frames. ], batch size: 106, lr: 5.46e-03, grad_scale: 16.0 2023-12-04 11:13:15,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=264866.6666666667, ans=0.125 2023-12-04 11:13:24,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=264866.6666666667, ans=0.125 2023-12-04 11:13:26,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=264933.3333333333, ans=0.125 2023-12-04 11:13:46,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=265000.0, ans=0.125 2023-12-04 11:13:47,431 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.292e+02 1.391e+02 1.519e+02 2.133e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 11:13:58,764 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265133.3333333333, ans=0.1 2023-12-04 11:13:59,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=265133.3333333333, ans=0.125 2023-12-04 11:14:02,256 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-12-04 11:14:10,356 INFO [train.py:1087] (1/4) Epoch 45, batch 400, loss[loss=0.1543, simple_loss=0.2411, pruned_loss=0.03371, over 24751.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2496, pruned_loss=0.03283, over 4172742.22 frames. ], batch size: 63, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:14:15,559 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.68 vs. limit=15.0 2023-12-04 11:14:34,230 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.70 vs. limit=12.0 2023-12-04 11:14:38,579 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.90 vs. limit=12.0 2023-12-04 11:14:54,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=265400.0, ans=0.125 2023-12-04 11:14:59,753 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=265466.6666666667, ans=0.0 2023-12-04 11:15:06,951 INFO [train.py:1087] (1/4) Epoch 45, batch 450, loss[loss=0.1548, simple_loss=0.2483, pruned_loss=0.03061, over 24774.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2496, pruned_loss=0.03277, over 4307939.84 frames. ], batch size: 64, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:15:28,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=265600.0, ans=0.125 2023-12-04 11:15:39,535 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.288e+02 1.390e+02 1.502e+02 2.726e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 11:15:48,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=265733.3333333333, ans=0.125 2023-12-04 11:16:03,391 INFO [train.py:1087] (1/4) Epoch 45, batch 500, loss[loss=0.1532, simple_loss=0.2464, pruned_loss=0.03, over 24526.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2497, pruned_loss=0.03263, over 4420663.24 frames. ], batch size: 75, lr: 5.45e-03, grad_scale: 32.0 2023-12-04 11:16:29,329 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=266000.0, ans=0.125 2023-12-04 11:16:33,263 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=266000.0, ans=0.0 2023-12-04 11:16:47,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=266133.3333333333, ans=0.2 2023-12-04 11:16:59,716 INFO [train.py:1087] (1/4) Epoch 45, batch 550, loss[loss=0.151, simple_loss=0.2452, pruned_loss=0.02835, over 24735.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2497, pruned_loss=0.0326, over 4516640.51 frames. ], batch size: 63, lr: 5.44e-03, grad_scale: 32.0 2023-12-04 11:17:10,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=266266.6666666667, ans=0.95 2023-12-04 11:17:17,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266266.6666666667, ans=0.125 2023-12-04 11:17:20,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266333.3333333333, ans=0.125 2023-12-04 11:17:32,313 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.283e+02 1.402e+02 1.532e+02 2.255e+02, threshold=2.804e+02, percent-clipped=0.0 2023-12-04 11:17:39,484 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.14 vs. limit=15.0 2023-12-04 11:17:45,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=266466.6666666667, ans=0.125 2023-12-04 11:17:46,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=266466.6666666667, ans=0.125 2023-12-04 11:17:55,090 INFO [train.py:1087] (1/4) Epoch 45, batch 600, loss[loss=0.1535, simple_loss=0.2476, pruned_loss=0.02971, over 24705.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2498, pruned_loss=0.0328, over 4578629.28 frames. ], batch size: 69, lr: 5.44e-03, grad_scale: 16.0 2023-12-04 11:18:00,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266533.3333333333, ans=0.125 2023-12-04 11:18:02,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=266533.3333333333, ans=0.0 2023-12-04 11:18:10,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=266600.0, ans=0.05 2023-12-04 11:18:48,395 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=266800.0, ans=0.125 2023-12-04 11:18:54,523 INFO [train.py:1087] (1/4) Epoch 45, batch 650, loss[loss=0.1584, simple_loss=0.2537, pruned_loss=0.03153, over 24809.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2501, pruned_loss=0.0331, over 4608561.09 frames. ], batch size: 62, lr: 5.44e-03, grad_scale: 16.0 2023-12-04 11:18:54,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266866.6666666667, ans=0.1 2023-12-04 11:19:06,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266933.3333333333, ans=0.125 2023-12-04 11:19:16,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=267000.0, ans=0.125 2023-12-04 11:19:24,617 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:19:28,563 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.316e+02 1.419e+02 1.572e+02 2.122e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 11:19:35,935 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-12-04 11:19:51,012 INFO [train.py:1087] (1/4) Epoch 45, batch 700, loss[loss=0.1684, simple_loss=0.2639, pruned_loss=0.03644, over 24728.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2501, pruned_loss=0.03307, over 4644523.63 frames. ], batch size: 61, lr: 5.43e-03, grad_scale: 16.0 2023-12-04 11:20:13,200 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.22 vs. limit=15.0 2023-12-04 11:20:24,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=267400.0, ans=0.0 2023-12-04 11:20:28,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=267400.0, ans=0.0 2023-12-04 11:20:39,357 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=267466.6666666667, ans=0.1 2023-12-04 11:20:47,169 INFO [train.py:1087] (1/4) Epoch 45, batch 750, loss[loss=0.1557, simple_loss=0.2477, pruned_loss=0.03185, over 24744.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2498, pruned_loss=0.03296, over 4693709.28 frames. ], batch size: 66, lr: 5.43e-03, grad_scale: 16.0 2023-12-04 11:20:55,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=267533.3333333333, ans=0.125 2023-12-04 11:20:59,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=267600.0, ans=0.07 2023-12-04 11:21:20,415 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.281e+02 1.385e+02 1.471e+02 1.798e+02, threshold=2.769e+02, percent-clipped=0.0 2023-12-04 11:21:31,743 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.11 vs. limit=22.5 2023-12-04 11:21:42,629 INFO [train.py:1087] (1/4) Epoch 45, batch 800, loss[loss=0.1907, simple_loss=0.2688, pruned_loss=0.05628, over 16794.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2501, pruned_loss=0.03291, over 4709982.22 frames. ], batch size: 177, lr: 5.43e-03, grad_scale: 32.0 2023-12-04 11:21:53,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=267933.3333333333, ans=0.1 2023-12-04 11:22:20,308 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-12-04 11:22:20,726 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=268066.6666666667, ans=0.2 2023-12-04 11:22:21,945 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2023-12-04 11:22:33,605 INFO [train.py:1087] (1/4) Epoch 45, batch 850, loss[loss=0.1725, simple_loss=0.2561, pruned_loss=0.04446, over 24506.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.2501, pruned_loss=0.03302, over 4743383.57 frames. ], batch size: 75, lr: 5.42e-03, grad_scale: 32.0 2023-12-04 11:22:54,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=268333.3333333333, ans=0.125 2023-12-04 11:22:54,735 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.97 vs. limit=15.0 2023-12-04 11:22:59,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=268333.3333333333, ans=0.125 2023-12-04 11:23:05,462 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.317e+02 1.452e+02 1.588e+02 2.065e+02, threshold=2.904e+02, percent-clipped=0.0 2023-12-04 11:23:06,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=268400.0, ans=0.125 2023-12-04 11:23:07,889 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-12-04 11:23:08,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=268400.0, ans=0.125 2023-12-04 11:23:10,405 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-12-04 11:23:11,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=268400.0, ans=0.2 2023-12-04 11:23:16,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=268466.6666666667, ans=0.125 2023-12-04 11:23:27,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=268500.0, ans=0.0 2023-12-04 11:23:28,075 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-12-04 11:23:33,064 INFO [train.py:1087] (1/4) Epoch 46, batch 0, loss[loss=0.1567, simple_loss=0.2497, pruned_loss=0.03179, over 24782.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2497, pruned_loss=0.03179, over 24782.00 frames. ], batch size: 70, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:23:33,065 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 11:23:45,364 INFO [train.py:1119] (1/4) Epoch 46, validation: loss=0.1518, simple_loss=0.2501, pruned_loss=0.02668, over 944034.00 frames. 2023-12-04 11:23:45,365 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 11:23:48,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=268500.0, ans=0.0 2023-12-04 11:23:59,251 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=268566.6666666667, ans=0.2 2023-12-04 11:24:16,299 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-12-04 11:24:16,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=268633.3333333333, ans=0.125 2023-12-04 11:24:21,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=268700.0, ans=0.0 2023-12-04 11:24:40,866 INFO [train.py:1087] (1/4) Epoch 46, batch 50, loss[loss=0.165, simple_loss=0.2597, pruned_loss=0.0351, over 24270.00 frames. ], tot_loss[loss=0.1598, simple_loss=0.2525, pruned_loss=0.03356, over 1095308.87 frames. ], batch size: 79, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:24:57,358 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.29 vs. limit=15.0 2023-12-04 11:25:02,130 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.70 vs. limit=22.5 2023-12-04 11:25:20,366 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.282e+02 1.420e+02 1.605e+02 2.832e+02, threshold=2.839e+02, percent-clipped=0.0 2023-12-04 11:25:29,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=269100.0, ans=0.0 2023-12-04 11:25:35,881 INFO [train.py:1087] (1/4) Epoch 46, batch 100, loss[loss=0.1508, simple_loss=0.2453, pruned_loss=0.02814, over 24722.00 frames. ], tot_loss[loss=0.1578, simple_loss=0.2502, pruned_loss=0.03274, over 1935143.55 frames. ], batch size: 74, lr: 5.36e-03, grad_scale: 32.0 2023-12-04 11:25:46,039 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:25:53,082 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2023-12-04 11:26:31,389 INFO [train.py:1087] (1/4) Epoch 46, batch 150, loss[loss=0.1592, simple_loss=0.2465, pruned_loss=0.0359, over 24480.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2505, pruned_loss=0.03298, over 2584783.08 frames. ], batch size: 77, lr: 5.35e-03, grad_scale: 32.0 2023-12-04 11:26:37,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=269500.0, ans=0.125 2023-12-04 11:26:43,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=269566.6666666667, ans=0.95 2023-12-04 11:26:44,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=269566.6666666667, ans=0.0 2023-12-04 11:26:59,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=269633.3333333333, ans=0.0 2023-12-04 11:27:04,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=269700.0, ans=0.0 2023-12-04 11:27:12,676 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.277e+02 1.340e+02 1.477e+02 1.879e+02, threshold=2.679e+02, percent-clipped=0.0 2023-12-04 11:27:26,817 INFO [train.py:1087] (1/4) Epoch 46, batch 200, loss[loss=0.1917, simple_loss=0.2727, pruned_loss=0.05537, over 16715.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2503, pruned_loss=0.03311, over 3065734.98 frames. ], batch size: 177, lr: 5.35e-03, grad_scale: 16.0 2023-12-04 11:27:27,535 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-12-04 11:27:28,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=269833.3333333333, ans=0.125 2023-12-04 11:27:28,396 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.43 vs. limit=15.0 2023-12-04 11:27:31,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=269833.3333333333, ans=0.0 2023-12-04 11:27:37,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=269900.0, ans=0.125 2023-12-04 11:27:41,434 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=269900.0, ans=0.2 2023-12-04 11:27:53,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=269966.6666666667, ans=0.0 2023-12-04 11:28:20,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=270100.0, ans=0.2 2023-12-04 11:28:22,403 INFO [train.py:1087] (1/4) Epoch 46, batch 250, loss[loss=0.1583, simple_loss=0.2492, pruned_loss=0.03368, over 24465.00 frames. ], tot_loss[loss=0.1591, simple_loss=0.2511, pruned_loss=0.03354, over 3431422.80 frames. ], batch size: 77, lr: 5.35e-03, grad_scale: 16.0 2023-12-04 11:28:36,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=270233.3333333333, ans=0.2 2023-12-04 11:29:03,031 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.293e+02 1.389e+02 1.495e+02 1.860e+02, threshold=2.779e+02, percent-clipped=0.0 2023-12-04 11:29:16,885 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2023-12-04 11:29:17,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=270500.0, ans=0.0 2023-12-04 11:29:18,224 INFO [train.py:1087] (1/4) Epoch 46, batch 300, loss[loss=0.1543, simple_loss=0.2468, pruned_loss=0.0309, over 24737.00 frames. ], tot_loss[loss=0.1588, simple_loss=0.2508, pruned_loss=0.03337, over 3732195.76 frames. ], batch size: 63, lr: 5.34e-03, grad_scale: 16.0 2023-12-04 11:29:44,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=270633.3333333333, ans=0.125 2023-12-04 11:29:49,099 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-12-04 11:29:51,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=270700.0, ans=0.125 2023-12-04 11:30:02,908 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=270766.6666666667, ans=0.125 2023-12-04 11:30:03,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=270766.6666666667, ans=0.0 2023-12-04 11:30:09,600 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.86 vs. limit=15.0 2023-12-04 11:30:13,355 INFO [train.py:1087] (1/4) Epoch 46, batch 350, loss[loss=0.1463, simple_loss=0.2405, pruned_loss=0.02608, over 24575.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2503, pruned_loss=0.03312, over 3980022.61 frames. ], batch size: 65, lr: 5.34e-03, grad_scale: 16.0 2023-12-04 11:30:13,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=270833.3333333333, ans=0.0 2023-12-04 11:30:35,697 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=12.0 2023-12-04 11:30:43,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=270966.6666666667, ans=0.125 2023-12-04 11:30:43,782 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=270966.6666666667, ans=10.0 2023-12-04 11:30:50,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=271033.3333333333, ans=0.04949747468305833 2023-12-04 11:30:54,998 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.070e+02 1.318e+02 1.456e+02 1.567e+02 1.885e+02, threshold=2.913e+02, percent-clipped=0.0 2023-12-04 11:31:08,712 INFO [train.py:1087] (1/4) Epoch 46, batch 400, loss[loss=0.1676, simple_loss=0.2564, pruned_loss=0.03942, over 20913.00 frames. ], tot_loss[loss=0.1581, simple_loss=0.25, pruned_loss=0.0331, over 4156386.80 frames. ], batch size: 50, lr: 5.34e-03, grad_scale: 32.0 2023-12-04 11:31:18,974 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=271233.3333333333, ans=0.0 2023-12-04 11:31:36,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=271300.0, ans=0.5 2023-12-04 11:31:38,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=271300.0, ans=0.125 2023-12-04 11:31:45,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271366.6666666667, ans=0.1 2023-12-04 11:31:56,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=271433.3333333333, ans=0.05 2023-12-04 11:32:04,243 INFO [train.py:1087] (1/4) Epoch 46, batch 450, loss[loss=0.1571, simple_loss=0.2535, pruned_loss=0.03041, over 24719.00 frames. ], tot_loss[loss=0.1584, simple_loss=0.2502, pruned_loss=0.03323, over 4283100.88 frames. ], batch size: 67, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:32:07,767 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=271500.0, ans=0.125 2023-12-04 11:32:08,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=271500.0, ans=0.0 2023-12-04 11:32:30,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=271633.3333333333, ans=15.0 2023-12-04 11:32:45,216 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.132e+02 1.332e+02 1.442e+02 1.626e+02 2.367e+02, threshold=2.885e+02, percent-clipped=0.0 2023-12-04 11:32:45,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=271700.0, ans=0.1 2023-12-04 11:33:00,606 INFO [train.py:1087] (1/4) Epoch 46, batch 500, loss[loss=0.1543, simple_loss=0.2466, pruned_loss=0.03103, over 24804.00 frames. ], tot_loss[loss=0.1584, simple_loss=0.2504, pruned_loss=0.03323, over 4388811.18 frames. ], batch size: 62, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:33:27,375 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=271966.6666666667, ans=0.125 2023-12-04 11:33:30,516 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-12-04 11:33:47,927 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.85 vs. limit=22.5 2023-12-04 11:33:51,293 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-12-04 11:33:56,372 INFO [train.py:1087] (1/4) Epoch 46, batch 550, loss[loss=0.1679, simple_loss=0.2555, pruned_loss=0.0401, over 21455.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.2501, pruned_loss=0.03289, over 4484309.75 frames. ], batch size: 127, lr: 5.33e-03, grad_scale: 32.0 2023-12-04 11:33:57,095 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=272166.6666666667, ans=0.2 2023-12-04 11:34:09,337 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=272233.3333333333, ans=0.125 2023-12-04 11:34:13,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=272233.3333333333, ans=0.125 2023-12-04 11:34:14,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=272233.3333333333, ans=0.125 2023-12-04 11:34:19,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=272300.0, ans=0.125 2023-12-04 11:34:26,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=272300.0, ans=0.0 2023-12-04 11:34:29,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272366.6666666667, ans=0.1 2023-12-04 11:34:32,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=272366.6666666667, ans=0.2 2023-12-04 11:34:38,247 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.294e+02 1.384e+02 1.497e+02 2.316e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-04 11:34:50,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=272433.3333333333, ans=0.125 2023-12-04 11:34:52,510 INFO [train.py:1087] (1/4) Epoch 46, batch 600, loss[loss=0.1474, simple_loss=0.2424, pruned_loss=0.02619, over 24757.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2498, pruned_loss=0.03279, over 4555937.06 frames. ], batch size: 66, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:34:57,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=272500.0, ans=0.125 2023-12-04 11:35:06,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=272566.6666666667, ans=0.1 2023-12-04 11:35:08,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=272566.6666666667, ans=0.1 2023-12-04 11:35:12,942 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=272566.6666666667, ans=0.2 2023-12-04 11:35:26,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272700.0, ans=0.0 2023-12-04 11:35:27,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=272700.0, ans=0.125 2023-12-04 11:35:29,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=272700.0, ans=0.125 2023-12-04 11:35:36,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=272766.6666666667, ans=0.125 2023-12-04 11:35:48,568 INFO [train.py:1087] (1/4) Epoch 46, batch 650, loss[loss=0.1527, simple_loss=0.2487, pruned_loss=0.02836, over 24800.00 frames. ], tot_loss[loss=0.1579, simple_loss=0.25, pruned_loss=0.03285, over 4612194.75 frames. ], batch size: 72, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:36:09,399 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-12-04 11:36:16,464 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=272966.6666666667, ans=0.0 2023-12-04 11:36:29,605 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.280e+02 1.369e+02 1.496e+02 2.731e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 11:36:44,175 INFO [train.py:1087] (1/4) Epoch 46, batch 700, loss[loss=0.1589, simple_loss=0.2514, pruned_loss=0.03326, over 24335.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2499, pruned_loss=0.03277, over 4653888.44 frames. ], batch size: 79, lr: 5.32e-03, grad_scale: 32.0 2023-12-04 11:36:49,092 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=13.09 vs. limit=22.5 2023-12-04 11:36:53,430 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=273166.6666666667, ans=0.125 2023-12-04 11:37:05,723 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.07 vs. limit=12.0 2023-12-04 11:37:05,824 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.61 vs. limit=15.0 2023-12-04 11:37:07,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273300.0, ans=0.1 2023-12-04 11:37:08,455 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-12-04 11:37:12,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=273300.0, ans=0.2 2023-12-04 11:37:13,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=273300.0, ans=0.125 2023-12-04 11:37:24,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=273366.6666666667, ans=0.125 2023-12-04 11:37:36,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=273433.3333333333, ans=0.2 2023-12-04 11:37:39,460 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.80 vs. limit=22.5 2023-12-04 11:37:39,939 INFO [train.py:1087] (1/4) Epoch 46, batch 750, loss[loss=0.1651, simple_loss=0.2525, pruned_loss=0.03883, over 24319.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2498, pruned_loss=0.03258, over 4698931.42 frames. ], batch size: 79, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:37:43,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=273500.0, ans=0.125 2023-12-04 11:37:58,030 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=273566.6666666667, ans=0.1 2023-12-04 11:38:12,942 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=273700.0, ans=0.125 2023-12-04 11:38:15,181 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-12-04 11:38:21,512 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.302e+02 1.405e+02 1.601e+02 2.273e+02, threshold=2.810e+02, percent-clipped=0.0 2023-12-04 11:38:27,063 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=273766.6666666667, ans=0.125 2023-12-04 11:38:32,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=273766.6666666667, ans=0.2 2023-12-04 11:38:36,058 INFO [train.py:1087] (1/4) Epoch 46, batch 800, loss[loss=0.1515, simple_loss=0.2415, pruned_loss=0.03073, over 24180.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2493, pruned_loss=0.03256, over 4720501.80 frames. ], batch size: 82, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:38:41,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=273833.3333333333, ans=0.2 2023-12-04 11:38:42,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=273833.3333333333, ans=0.0 2023-12-04 11:38:45,711 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=273900.0, ans=0.07 2023-12-04 11:38:59,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=273966.6666666667, ans=0.0 2023-12-04 11:38:59,863 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-12-04 11:39:19,422 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:39:26,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=274166.6666666667, ans=0.1 2023-12-04 11:39:27,321 INFO [train.py:1087] (1/4) Epoch 46, batch 850, loss[loss=0.1536, simple_loss=0.2452, pruned_loss=0.03097, over 24522.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2492, pruned_loss=0.03247, over 4741303.97 frames. ], batch size: 75, lr: 5.31e-03, grad_scale: 32.0 2023-12-04 11:39:27,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=274166.6666666667, ans=0.0 2023-12-04 11:39:32,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=274166.6666666667, ans=0.0 2023-12-04 11:39:34,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=274166.6666666667, ans=0.05 2023-12-04 11:39:37,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274233.3333333333, ans=0.125 2023-12-04 11:39:53,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=274300.0, ans=0.0 2023-12-04 11:39:59,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=274366.6666666667, ans=0.125 2023-12-04 11:40:04,861 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.284e+02 1.352e+02 1.445e+02 2.031e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 11:40:26,882 INFO [train.py:1087] (1/4) Epoch 47, batch 0, loss[loss=0.1657, simple_loss=0.2643, pruned_loss=0.03357, over 22943.00 frames. ], tot_loss[loss=0.1657, simple_loss=0.2643, pruned_loss=0.03357, over 22943.00 frames. ], batch size: 106, lr: 5.25e-03, grad_scale: 32.0 2023-12-04 11:40:26,882 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 11:40:38,083 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.7901, 3.4289, 3.4982, 3.3669], device='cuda:1') 2023-12-04 11:40:39,382 INFO [train.py:1119] (1/4) Epoch 47, validation: loss=0.152, simple_loss=0.2504, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 11:40:39,382 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 11:40:41,873 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-12-04 11:40:50,996 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=15.0 2023-12-04 11:40:53,675 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274533.3333333333, ans=0.0 2023-12-04 11:41:00,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=274600.0, ans=0.1 2023-12-04 11:41:35,062 INFO [train.py:1087] (1/4) Epoch 47, batch 50, loss[loss=0.1608, simple_loss=0.2549, pruned_loss=0.03339, over 24704.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2496, pruned_loss=0.03238, over 1098059.54 frames. ], batch size: 74, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:41:43,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=274800.0, ans=0.125 2023-12-04 11:41:49,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=274866.6666666667, ans=0.0 2023-12-04 11:41:54,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=274866.6666666667, ans=0.125 2023-12-04 11:41:54,982 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=274866.6666666667, ans=0.1 2023-12-04 11:42:00,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=274933.3333333333, ans=0.125 2023-12-04 11:42:01,076 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.91 vs. limit=15.0 2023-12-04 11:42:01,332 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=12.0 2023-12-04 11:42:07,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=275000.0, ans=0.125 2023-12-04 11:42:10,823 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=275000.0, ans=0.025 2023-12-04 11:42:10,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=275000.0, ans=0.125 2023-12-04 11:42:21,521 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.272e+02 1.416e+02 1.610e+02 2.646e+02, threshold=2.833e+02, percent-clipped=0.0 2023-12-04 11:42:21,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=275066.6666666667, ans=0.125 2023-12-04 11:42:30,855 INFO [train.py:1087] (1/4) Epoch 47, batch 100, loss[loss=0.1618, simple_loss=0.252, pruned_loss=0.03575, over 24753.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.2503, pruned_loss=0.03259, over 1909056.63 frames. ], batch size: 70, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:42:44,779 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.77 vs. limit=12.0 2023-12-04 11:43:01,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=275266.6666666667, ans=0.2 2023-12-04 11:43:01,849 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=275266.6666666667, ans=0.125 2023-12-04 11:43:13,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=275333.3333333333, ans=0.0 2023-12-04 11:43:18,977 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.64 vs. limit=22.5 2023-12-04 11:43:22,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=275400.0, ans=0.125 2023-12-04 11:43:25,670 INFO [train.py:1087] (1/4) Epoch 47, batch 150, loss[loss=0.1632, simple_loss=0.2588, pruned_loss=0.03382, over 24098.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2495, pruned_loss=0.03205, over 2565569.52 frames. ], batch size: 87, lr: 5.24e-03, grad_scale: 32.0 2023-12-04 11:43:25,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=275466.6666666667, ans=0.2 2023-12-04 11:43:31,120 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.80 vs. limit=22.5 2023-12-04 11:43:48,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=275600.0, ans=0.125 2023-12-04 11:44:03,044 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=275666.6666666667, ans=0.125 2023-12-04 11:44:13,361 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.064e+02 1.279e+02 1.366e+02 1.469e+02 1.889e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-04 11:44:21,887 INFO [train.py:1087] (1/4) Epoch 47, batch 200, loss[loss=0.1561, simple_loss=0.2512, pruned_loss=0.03055, over 21148.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2495, pruned_loss=0.03206, over 3064976.41 frames. ], batch size: 127, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:44:41,587 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:44:43,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275933.3333333333, ans=0.1 2023-12-04 11:45:18,579 INFO [train.py:1087] (1/4) Epoch 47, batch 250, loss[loss=0.1588, simple_loss=0.253, pruned_loss=0.03226, over 24740.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2492, pruned_loss=0.03215, over 3455421.85 frames. ], batch size: 63, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:45:22,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=276133.3333333333, ans=0.125 2023-12-04 11:45:28,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=276200.0, ans=0.125 2023-12-04 11:45:44,116 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=276266.6666666667, ans=0.125 2023-12-04 11:45:58,589 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=276333.3333333333, ans=0.1 2023-12-04 11:46:00,781 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=276333.3333333333, ans=0.2 2023-12-04 11:46:05,728 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.247e+02 1.326e+02 1.449e+02 2.219e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-04 11:46:14,078 INFO [train.py:1087] (1/4) Epoch 47, batch 300, loss[loss=0.155, simple_loss=0.2481, pruned_loss=0.03091, over 24499.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2494, pruned_loss=0.03231, over 3760220.88 frames. ], batch size: 77, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:46:18,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=276466.6666666667, ans=0.125 2023-12-04 11:46:23,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=276466.6666666667, ans=0.125 2023-12-04 11:46:24,714 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=276533.3333333333, ans=0.125 2023-12-04 11:46:35,221 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276600.0, ans=0.125 2023-12-04 11:46:46,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=276666.6666666667, ans=0.2 2023-12-04 11:46:56,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276666.6666666667, ans=0.125 2023-12-04 11:47:01,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=276733.3333333333, ans=0.125 2023-12-04 11:47:08,898 INFO [train.py:1087] (1/4) Epoch 47, batch 350, loss[loss=0.1553, simple_loss=0.246, pruned_loss=0.03232, over 24735.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2491, pruned_loss=0.03221, over 4003098.81 frames. ], batch size: 63, lr: 5.23e-03, grad_scale: 16.0 2023-12-04 11:47:09,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=276800.0, ans=0.0 2023-12-04 11:47:57,755 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.280e+02 1.373e+02 1.456e+02 2.082e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 11:48:02,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=277066.6666666667, ans=0.125 2023-12-04 11:48:05,379 INFO [train.py:1087] (1/4) Epoch 47, batch 400, loss[loss=0.1572, simple_loss=0.2491, pruned_loss=0.03264, over 24761.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2493, pruned_loss=0.03224, over 4191158.27 frames. ], batch size: 65, lr: 5.22e-03, grad_scale: 32.0 2023-12-04 11:48:09,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=277133.3333333333, ans=0.1 2023-12-04 11:48:20,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=277200.0, ans=0.0 2023-12-04 11:48:35,146 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.92 vs. limit=15.0 2023-12-04 11:48:41,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=277333.3333333333, ans=0.125 2023-12-04 11:48:47,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=277333.3333333333, ans=0.0 2023-12-04 11:48:48,674 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2023-12-04 11:48:53,020 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=277400.0, ans=0.125 2023-12-04 11:48:59,764 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=277400.0, ans=0.125 2023-12-04 11:49:01,949 INFO [train.py:1087] (1/4) Epoch 47, batch 450, loss[loss=0.1692, simple_loss=0.26, pruned_loss=0.03921, over 24775.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2497, pruned_loss=0.03258, over 4307816.46 frames. ], batch size: 64, lr: 5.22e-03, grad_scale: 32.0 2023-12-04 11:49:50,810 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.153e+02 1.318e+02 1.415e+02 1.584e+02 2.056e+02, threshold=2.829e+02, percent-clipped=0.0 2023-12-04 11:49:56,213 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.23 vs. limit=15.0 2023-12-04 11:49:57,491 INFO [train.py:1087] (1/4) Epoch 47, batch 500, loss[loss=0.1687, simple_loss=0.2572, pruned_loss=0.04009, over 22749.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2497, pruned_loss=0.03262, over 4402441.20 frames. ], batch size: 106, lr: 5.22e-03, grad_scale: 16.0 2023-12-04 11:50:06,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=277800.0, ans=0.125 2023-12-04 11:50:06,819 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.93 vs. limit=15.0 2023-12-04 11:50:10,941 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.68 vs. limit=15.0 2023-12-04 11:50:13,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=277866.6666666667, ans=0.1 2023-12-04 11:50:20,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=277933.3333333333, ans=0.0 2023-12-04 11:50:31,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=278000.0, ans=0.0 2023-12-04 11:50:43,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278066.6666666667, ans=0.1 2023-12-04 11:50:53,376 INFO [train.py:1087] (1/4) Epoch 47, batch 550, loss[loss=0.1745, simple_loss=0.2611, pruned_loss=0.04395, over 16628.00 frames. ], tot_loss[loss=0.1577, simple_loss=0.25, pruned_loss=0.03276, over 4495968.46 frames. ], batch size: 177, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:51:01,898 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.99 vs. limit=10.0 2023-12-04 11:51:08,095 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-12-04 11:51:21,880 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=15.0 2023-12-04 11:51:31,729 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=278333.3333333333, ans=0.2 2023-12-04 11:51:33,349 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=15.0 2023-12-04 11:51:42,224 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.317e+02 1.437e+02 1.661e+02 2.158e+02, threshold=2.873e+02, percent-clipped=0.0 2023-12-04 11:51:49,075 INFO [train.py:1087] (1/4) Epoch 47, batch 600, loss[loss=0.2077, simple_loss=0.2851, pruned_loss=0.06521, over 17209.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2497, pruned_loss=0.03266, over 4561316.63 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:51:51,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278466.6666666667, ans=0.1 2023-12-04 11:51:57,524 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=278466.6666666667, ans=0.0 2023-12-04 11:51:58,434 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 11:52:06,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=278533.3333333333, ans=0.07 2023-12-04 11:52:22,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=22.5 2023-12-04 11:52:32,940 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=278666.6666666667, ans=0.125 2023-12-04 11:52:40,574 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.80 vs. limit=22.5 2023-12-04 11:52:45,489 INFO [train.py:1087] (1/4) Epoch 47, batch 650, loss[loss=0.1644, simple_loss=0.2563, pruned_loss=0.03628, over 24147.00 frames. ], tot_loss[loss=0.1573, simple_loss=0.2495, pruned_loss=0.03254, over 4625863.41 frames. ], batch size: 82, lr: 5.21e-03, grad_scale: 16.0 2023-12-04 11:52:46,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=278800.0, ans=0.0 2023-12-04 11:52:55,038 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=278800.0, ans=0.125 2023-12-04 11:53:14,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=278933.3333333333, ans=0.2 2023-12-04 11:53:19,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=279000.0, ans=0.125 2023-12-04 11:53:24,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=279000.0, ans=0.0 2023-12-04 11:53:32,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279066.6666666667, ans=0.1 2023-12-04 11:53:35,822 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.314e+02 1.430e+02 1.582e+02 2.935e+02, threshold=2.859e+02, percent-clipped=1.0 2023-12-04 11:53:42,222 INFO [train.py:1087] (1/4) Epoch 47, batch 700, loss[loss=0.1609, simple_loss=0.2519, pruned_loss=0.03493, over 24568.00 frames. ], tot_loss[loss=0.1576, simple_loss=0.2498, pruned_loss=0.03267, over 4653020.02 frames. ], batch size: 65, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:53:47,023 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.20 vs. limit=10.0 2023-12-04 11:53:47,741 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=279133.3333333333, ans=0.95 2023-12-04 11:53:59,287 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=279200.0, ans=0.125 2023-12-04 11:54:00,291 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=279200.0, ans=0.125 2023-12-04 11:54:00,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=279200.0, ans=0.125 2023-12-04 11:54:15,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=279333.3333333333, ans=0.0 2023-12-04 11:54:38,263 INFO [train.py:1087] (1/4) Epoch 47, batch 750, loss[loss=0.1517, simple_loss=0.2457, pruned_loss=0.02887, over 24560.00 frames. ], tot_loss[loss=0.1574, simple_loss=0.2497, pruned_loss=0.03259, over 4689511.42 frames. ], batch size: 63, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:55:06,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=279600.0, ans=0.0 2023-12-04 11:55:26,818 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.155e+02 1.316e+02 1.459e+02 1.626e+02 2.190e+02, threshold=2.917e+02, percent-clipped=0.0 2023-12-04 11:55:33,704 INFO [train.py:1087] (1/4) Epoch 47, batch 800, loss[loss=0.1498, simple_loss=0.2376, pruned_loss=0.03099, over 24794.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2493, pruned_loss=0.03229, over 4726470.51 frames. ], batch size: 62, lr: 5.20e-03, grad_scale: 32.0 2023-12-04 11:55:50,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=279866.6666666667, ans=0.0 2023-12-04 11:56:19,894 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.67 vs. limit=12.0 2023-12-04 11:56:25,382 INFO [train.py:1087] (1/4) Epoch 47, batch 850, loss[loss=0.1545, simple_loss=0.2493, pruned_loss=0.02987, over 24844.00 frames. ], tot_loss[loss=0.1572, simple_loss=0.2493, pruned_loss=0.03251, over 4732600.97 frames. ], batch size: 68, lr: 5.20e-03, grad_scale: 16.0 2023-12-04 11:56:32,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=280133.3333333333, ans=0.0 2023-12-04 11:56:42,866 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2023-12-04 11:56:47,185 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-12-04 11:56:51,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=280266.6666666667, ans=0.0 2023-12-04 11:56:55,177 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.82 vs. limit=10.0 2023-12-04 11:57:03,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=280333.3333333333, ans=0.0 2023-12-04 11:57:16,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=280433.3333333333, ans=0.125 2023-12-04 11:57:19,343 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.062e+02 1.297e+02 1.418e+02 1.577e+02 2.456e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 11:57:19,370 INFO [train.py:1087] (1/4) Epoch 48, batch 0, loss[loss=0.1609, simple_loss=0.2592, pruned_loss=0.03129, over 22890.00 frames. ], tot_loss[loss=0.1609, simple_loss=0.2592, pruned_loss=0.03129, over 22890.00 frames. ], batch size: 106, lr: 5.14e-03, grad_scale: 32.0 2023-12-04 11:57:19,370 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 11:57:31,862 INFO [train.py:1119] (1/4) Epoch 48, validation: loss=0.152, simple_loss=0.2501, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 11:57:31,863 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 11:57:42,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=280500.0, ans=0.2 2023-12-04 11:58:01,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=280566.6666666667, ans=0.0 2023-12-04 11:58:03,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=280566.6666666667, ans=0.125 2023-12-04 11:58:19,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=280700.0, ans=0.125 2023-12-04 11:58:24,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=280700.0, ans=0.5 2023-12-04 11:58:27,422 INFO [train.py:1087] (1/4) Epoch 48, batch 50, loss[loss=0.1574, simple_loss=0.2497, pruned_loss=0.03254, over 24730.00 frames. ], tot_loss[loss=0.1589, simple_loss=0.2508, pruned_loss=0.03354, over 1075698.76 frames. ], batch size: 63, lr: 5.13e-03, grad_scale: 32.0 2023-12-04 11:58:30,994 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=15.0 2023-12-04 11:58:31,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=280766.6666666667, ans=0.125 2023-12-04 11:58:46,006 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=22.5 2023-12-04 11:58:48,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=280900.0, ans=0.125 2023-12-04 11:58:50,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280900.0, ans=0.1 2023-12-04 11:58:50,454 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-12-04 11:58:57,809 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.17 vs. limit=15.0 2023-12-04 11:59:01,599 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=280966.6666666667, ans=0.125 2023-12-04 11:59:12,222 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=281033.3333333333, ans=0.125 2023-12-04 11:59:21,765 INFO [train.py:1087] (1/4) Epoch 48, batch 100, loss[loss=0.1432, simple_loss=0.2348, pruned_loss=0.02581, over 24753.00 frames. ], tot_loss[loss=0.1582, simple_loss=0.2501, pruned_loss=0.03315, over 1892667.49 frames. ], batch size: 66, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 11:59:23,159 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.273e+02 1.375e+02 1.487e+02 2.008e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 11:59:29,943 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=281100.0, ans=0.0 2023-12-04 11:59:36,505 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=281166.6666666667, ans=0.2 2023-12-04 11:59:40,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=281166.6666666667, ans=0.5 2023-12-04 12:00:11,312 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.28 vs. limit=15.0 2023-12-04 12:00:12,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=281366.6666666667, ans=0.125 2023-12-04 12:00:17,006 INFO [train.py:1087] (1/4) Epoch 48, batch 150, loss[loss=0.1536, simple_loss=0.2466, pruned_loss=0.03032, over 24548.00 frames. ], tot_loss[loss=0.1575, simple_loss=0.2498, pruned_loss=0.03264, over 2553517.53 frames. ], batch size: 63, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 12:00:21,530 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=281433.3333333333, ans=0.0 2023-12-04 12:00:24,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=281433.3333333333, ans=0.1 2023-12-04 12:01:13,189 INFO [train.py:1087] (1/4) Epoch 48, batch 200, loss[loss=0.1476, simple_loss=0.243, pruned_loss=0.02608, over 24573.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2493, pruned_loss=0.03247, over 3052094.65 frames. ], batch size: 65, lr: 5.13e-03, grad_scale: 16.0 2023-12-04 12:01:14,235 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.241e+02 1.316e+02 1.436e+02 2.359e+02, threshold=2.632e+02, percent-clipped=0.0 2023-12-04 12:01:16,792 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=281766.6666666667, ans=15.0 2023-12-04 12:01:28,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=281833.3333333333, ans=0.125 2023-12-04 12:01:47,293 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2023-12-04 12:01:55,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=15.0 2023-12-04 12:02:08,701 INFO [train.py:1087] (1/4) Epoch 48, batch 250, loss[loss=0.1634, simple_loss=0.2545, pruned_loss=0.03618, over 23596.00 frames. ], tot_loss[loss=0.1571, simple_loss=0.2493, pruned_loss=0.03245, over 3457666.61 frames. ], batch size: 94, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:02:18,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282166.6666666667, ans=0.1 2023-12-04 12:02:18,397 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=282166.6666666667, ans=0.0 2023-12-04 12:02:18,432 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=282166.6666666667, ans=0.125 2023-12-04 12:02:28,483 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=282166.6666666667, ans=0.2 2023-12-04 12:02:58,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282366.6666666667, ans=0.1 2023-12-04 12:02:59,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282366.6666666667, ans=0.1 2023-12-04 12:03:03,801 INFO [train.py:1087] (1/4) Epoch 48, batch 300, loss[loss=0.1659, simple_loss=0.2596, pruned_loss=0.03614, over 23521.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.249, pruned_loss=0.03217, over 3764388.20 frames. ], batch size: 94, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:03:05,187 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.347e+02 1.460e+02 1.610e+02 2.169e+02, threshold=2.920e+02, percent-clipped=0.0 2023-12-04 12:03:08,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=282433.3333333333, ans=0.0 2023-12-04 12:03:08,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=282433.3333333333, ans=0.0 2023-12-04 12:03:17,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=282500.0, ans=0.1 2023-12-04 12:03:18,941 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-12-04 12:03:34,246 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.76 vs. limit=15.0 2023-12-04 12:03:44,816 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.90 vs. limit=15.0 2023-12-04 12:03:59,230 INFO [train.py:1087] (1/4) Epoch 48, batch 350, loss[loss=0.1572, simple_loss=0.251, pruned_loss=0.03176, over 24289.00 frames. ], tot_loss[loss=0.1566, simple_loss=0.2489, pruned_loss=0.03216, over 3989004.74 frames. ], batch size: 79, lr: 5.12e-03, grad_scale: 16.0 2023-12-04 12:04:02,642 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=282766.6666666667, ans=0.2 2023-12-04 12:04:17,880 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-12-04 12:04:50,613 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-12-04 12:04:51,741 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.47 vs. limit=15.0 2023-12-04 12:04:54,352 INFO [train.py:1087] (1/4) Epoch 48, batch 400, loss[loss=0.1576, simple_loss=0.2472, pruned_loss=0.03396, over 24735.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.249, pruned_loss=0.0323, over 4165918.61 frames. ], batch size: 63, lr: 5.11e-03, grad_scale: 32.0 2023-12-04 12:04:55,400 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.259e+02 1.354e+02 1.462e+02 2.607e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 12:05:03,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=283100.0, ans=0.125 2023-12-04 12:05:04,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=283166.6666666667, ans=0.0 2023-12-04 12:05:16,088 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=283233.3333333333, ans=0.2 2023-12-04 12:05:32,055 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=283300.0, ans=0.125 2023-12-04 12:05:50,080 INFO [train.py:1087] (1/4) Epoch 48, batch 450, loss[loss=0.2, simple_loss=0.2746, pruned_loss=0.06271, over 16674.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.249, pruned_loss=0.03229, over 4301226.69 frames. ], batch size: 178, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:05:56,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=283433.3333333333, ans=0.125 2023-12-04 12:06:09,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=283500.0, ans=0.125 2023-12-04 12:06:10,247 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=15.0 2023-12-04 12:06:34,221 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.61 vs. limit=15.0 2023-12-04 12:06:35,908 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283700.0, ans=0.125 2023-12-04 12:06:45,305 INFO [train.py:1087] (1/4) Epoch 48, batch 500, loss[loss=0.1475, simple_loss=0.2388, pruned_loss=0.02806, over 24852.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2492, pruned_loss=0.03236, over 4405325.38 frames. ], batch size: 68, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:06:47,389 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.281e+02 1.349e+02 1.438e+02 2.094e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 12:06:49,756 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283766.6666666667, ans=0.0 2023-12-04 12:06:58,051 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=283833.3333333333, ans=0.0 2023-12-04 12:07:05,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=283900.0, ans=0.09899494936611666 2023-12-04 12:07:05,882 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.92 vs. limit=15.0 2023-12-04 12:07:11,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=283900.0, ans=0.0 2023-12-04 12:07:15,637 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-12-04 12:07:17,325 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=283966.6666666667, ans=0.125 2023-12-04 12:07:18,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283966.6666666667, ans=0.1 2023-12-04 12:07:34,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=284033.3333333333, ans=0.125 2023-12-04 12:07:39,634 INFO [train.py:1087] (1/4) Epoch 48, batch 550, loss[loss=0.1496, simple_loss=0.2436, pruned_loss=0.02782, over 24614.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2493, pruned_loss=0.03222, over 4501285.12 frames. ], batch size: 68, lr: 5.11e-03, grad_scale: 16.0 2023-12-04 12:07:39,968 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=284100.0, ans=0.125 2023-12-04 12:08:03,553 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=284233.3333333333, ans=0.125 2023-12-04 12:08:31,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=284366.6666666667, ans=0.125 2023-12-04 12:08:35,657 INFO [train.py:1087] (1/4) Epoch 48, batch 600, loss[loss=0.158, simple_loss=0.2526, pruned_loss=0.03164, over 24604.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2493, pruned_loss=0.0322, over 4584439.23 frames. ], batch size: 68, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:08:37,859 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.324e+02 1.469e+02 1.698e+02 2.383e+02, threshold=2.939e+02, percent-clipped=0.0 2023-12-04 12:08:44,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=284433.3333333333, ans=0.0 2023-12-04 12:08:54,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284500.0, ans=0.1 2023-12-04 12:08:59,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=284566.6666666667, ans=0.2 2023-12-04 12:09:01,929 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=284566.6666666667, ans=0.0 2023-12-04 12:09:02,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=284566.6666666667, ans=0.125 2023-12-04 12:09:25,191 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=284700.0, ans=0.0 2023-12-04 12:09:28,730 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.50 vs. limit=15.0 2023-12-04 12:09:31,639 INFO [train.py:1087] (1/4) Epoch 48, batch 650, loss[loss=0.1629, simple_loss=0.2584, pruned_loss=0.03371, over 24795.00 frames. ], tot_loss[loss=0.1568, simple_loss=0.2491, pruned_loss=0.0323, over 4618496.81 frames. ], batch size: 72, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:09:37,382 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-12-04 12:09:51,265 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-12-04 12:10:05,551 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.71 vs. limit=10.0 2023-12-04 12:10:07,512 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=12.0 2023-12-04 12:10:26,847 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-12-04 12:10:27,167 INFO [train.py:1087] (1/4) Epoch 48, batch 700, loss[loss=0.1585, simple_loss=0.2523, pruned_loss=0.03234, over 24719.00 frames. ], tot_loss[loss=0.1565, simple_loss=0.2488, pruned_loss=0.03211, over 4664565.69 frames. ], batch size: 67, lr: 5.10e-03, grad_scale: 16.0 2023-12-04 12:10:27,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=285100.0, ans=0.125 2023-12-04 12:10:29,339 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.347e+02 1.589e+02 1.708e+02 2.161e+02, threshold=3.178e+02, percent-clipped=0.0 2023-12-04 12:10:37,451 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-12-04 12:10:43,758 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=15.0 2023-12-04 12:10:57,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=285233.3333333333, ans=0.125 2023-12-04 12:11:07,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=285300.0, ans=0.125 2023-12-04 12:11:09,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285300.0, ans=0.1 2023-12-04 12:11:19,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=285366.6666666667, ans=0.0 2023-12-04 12:11:22,114 INFO [train.py:1087] (1/4) Epoch 48, batch 750, loss[loss=0.1612, simple_loss=0.2485, pruned_loss=0.03696, over 24504.00 frames. ], tot_loss[loss=0.1566, simple_loss=0.2489, pruned_loss=0.03217, over 4700730.39 frames. ], batch size: 75, lr: 5.09e-03, grad_scale: 16.0 2023-12-04 12:11:23,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=285433.3333333333, ans=0.0 2023-12-04 12:11:41,300 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=285500.0, ans=0.0 2023-12-04 12:11:45,891 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.40 vs. limit=10.0 2023-12-04 12:12:04,795 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-12-04 12:12:15,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=285700.0, ans=0.0 2023-12-04 12:12:17,406 INFO [train.py:1087] (1/4) Epoch 48, batch 800, loss[loss=0.1556, simple_loss=0.2505, pruned_loss=0.03039, over 24144.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2489, pruned_loss=0.03225, over 4715335.09 frames. ], batch size: 58, lr: 5.09e-03, grad_scale: 32.0 2023-12-04 12:12:19,543 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.291e+02 1.368e+02 1.474e+02 1.788e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 12:12:22,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=285766.6666666667, ans=0.125 2023-12-04 12:12:28,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=285833.3333333333, ans=0.0 2023-12-04 12:12:31,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=285833.3333333333, ans=0.2 2023-12-04 12:12:36,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=285833.3333333333, ans=0.2 2023-12-04 12:12:36,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=285833.3333333333, ans=0.125 2023-12-04 12:12:40,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=285900.0, ans=0.09899494936611666 2023-12-04 12:13:02,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=286033.3333333333, ans=0.2 2023-12-04 12:13:09,261 INFO [train.py:1087] (1/4) Epoch 48, batch 850, loss[loss=0.151, simple_loss=0.2456, pruned_loss=0.02818, over 24556.00 frames. ], tot_loss[loss=0.1565, simple_loss=0.2487, pruned_loss=0.03217, over 4734251.18 frames. ], batch size: 63, lr: 5.09e-03, grad_scale: 16.0 2023-12-04 12:13:15,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=286100.0, ans=0.0 2023-12-04 12:13:21,893 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.72 vs. limit=12.0 2023-12-04 12:13:38,685 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=286300.0, ans=0.0 2023-12-04 12:14:06,589 INFO [train.py:1087] (1/4) Epoch 49, batch 0, loss[loss=0.1405, simple_loss=0.2389, pruned_loss=0.02104, over 24783.00 frames. ], tot_loss[loss=0.1405, simple_loss=0.2389, pruned_loss=0.02104, over 24783.00 frames. ], batch size: 73, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:14:06,590 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 12:14:18,644 INFO [train.py:1119] (1/4) Epoch 49, validation: loss=0.1515, simple_loss=0.2498, pruned_loss=0.02665, over 944034.00 frames. 2023-12-04 12:14:18,645 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 12:14:26,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=286400.0, ans=0.2 2023-12-04 12:14:27,056 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.277e+02 1.377e+02 1.551e+02 2.556e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 12:14:31,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=286466.6666666667, ans=0.2 2023-12-04 12:14:40,277 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.00 vs. limit=12.0 2023-12-04 12:15:10,297 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=286666.6666666667, ans=0.125 2023-12-04 12:15:14,292 INFO [train.py:1087] (1/4) Epoch 49, batch 50, loss[loss=0.1531, simple_loss=0.2421, pruned_loss=0.03203, over 24562.00 frames. ], tot_loss[loss=0.1583, simple_loss=0.2502, pruned_loss=0.0332, over 1076787.31 frames. ], batch size: 63, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:15:34,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=286800.0, ans=0.0 2023-12-04 12:15:46,254 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.99 vs. limit=15.0 2023-12-04 12:16:06,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=287000.0, ans=0.07 2023-12-04 12:16:09,396 INFO [train.py:1087] (1/4) Epoch 49, batch 100, loss[loss=0.1452, simple_loss=0.2398, pruned_loss=0.0253, over 24726.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2488, pruned_loss=0.03169, over 1918722.36 frames. ], batch size: 74, lr: 5.03e-03, grad_scale: 32.0 2023-12-04 12:16:18,633 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.082e+02 1.265e+02 1.369e+02 1.471e+02 2.016e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-04 12:16:39,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=287200.0, ans=0.125 2023-12-04 12:16:41,380 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-12-04 12:16:46,019 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.27 vs. limit=15.0 2023-12-04 12:16:49,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=287266.6666666667, ans=0.125 2023-12-04 12:17:00,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=287333.3333333333, ans=0.125 2023-12-04 12:17:03,407 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-12-04 12:17:04,959 INFO [train.py:1087] (1/4) Epoch 49, batch 150, loss[loss=0.1546, simple_loss=0.2431, pruned_loss=0.03303, over 24548.00 frames. ], tot_loss[loss=0.1563, simple_loss=0.2489, pruned_loss=0.03185, over 2554223.22 frames. ], batch size: 62, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:17:18,183 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=287466.6666666667, ans=0.0 2023-12-04 12:18:00,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=287733.3333333333, ans=0.0 2023-12-04 12:18:01,296 INFO [train.py:1087] (1/4) Epoch 49, batch 200, loss[loss=0.1393, simple_loss=0.2355, pruned_loss=0.02159, over 24604.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2483, pruned_loss=0.03144, over 3054924.13 frames. ], batch size: 68, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:18:11,201 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.291e+02 1.371e+02 1.476e+02 1.985e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 12:18:45,256 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-12-04 12:18:46,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=288000.0, ans=0.125 2023-12-04 12:18:57,955 INFO [train.py:1087] (1/4) Epoch 49, batch 250, loss[loss=0.1547, simple_loss=0.2456, pruned_loss=0.03191, over 24769.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2487, pruned_loss=0.03182, over 3455828.49 frames. ], batch size: 70, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:19:06,798 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=288066.6666666667, ans=0.04949747468305833 2023-12-04 12:19:14,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=288133.3333333333, ans=0.125 2023-12-04 12:19:53,310 INFO [train.py:1087] (1/4) Epoch 49, batch 300, loss[loss=0.1625, simple_loss=0.252, pruned_loss=0.03653, over 24487.00 frames. ], tot_loss[loss=0.1564, simple_loss=0.2489, pruned_loss=0.03195, over 3745843.71 frames. ], batch size: 77, lr: 5.02e-03, grad_scale: 16.0 2023-12-04 12:20:03,360 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.110e+02 1.319e+02 1.446e+02 1.575e+02 3.932e+02, threshold=2.892e+02, percent-clipped=1.0 2023-12-04 12:20:04,034 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.21 vs. limit=22.5 2023-12-04 12:20:07,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288466.6666666667, ans=0.125 2023-12-04 12:20:39,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=288666.6666666667, ans=0.0 2023-12-04 12:20:40,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=288666.6666666667, ans=0.0 2023-12-04 12:20:48,946 INFO [train.py:1087] (1/4) Epoch 49, batch 350, loss[loss=0.1569, simple_loss=0.2503, pruned_loss=0.03174, over 24762.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2485, pruned_loss=0.03182, over 3996180.84 frames. ], batch size: 64, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:20:53,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=288733.3333333333, ans=0.5 2023-12-04 12:21:12,075 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.80 vs. limit=10.0 2023-12-04 12:21:22,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=288933.3333333333, ans=0.125 2023-12-04 12:21:45,364 INFO [train.py:1087] (1/4) Epoch 49, batch 400, loss[loss=0.1596, simple_loss=0.2545, pruned_loss=0.03233, over 23726.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2486, pruned_loss=0.03183, over 4167110.81 frames. ], batch size: 57, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:21:49,142 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.91 vs. limit=12.0 2023-12-04 12:21:56,374 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.290e+02 1.383e+02 1.499e+02 2.166e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 12:22:20,175 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.05 vs. limit=10.0 2023-12-04 12:22:21,866 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=289266.6666666667, ans=0.0 2023-12-04 12:22:35,533 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.10 vs. limit=15.0 2023-12-04 12:22:41,511 INFO [train.py:1087] (1/4) Epoch 49, batch 450, loss[loss=0.1562, simple_loss=0.2513, pruned_loss=0.03058, over 21599.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2483, pruned_loss=0.03176, over 4303862.30 frames. ], batch size: 128, lr: 5.01e-03, grad_scale: 16.0 2023-12-04 12:22:54,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=289466.6666666667, ans=0.125 2023-12-04 12:22:58,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=289466.6666666667, ans=0.0 2023-12-04 12:23:01,471 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=289466.6666666667, ans=0.05 2023-12-04 12:23:21,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=289600.0, ans=0.125 2023-12-04 12:23:23,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289600.0, ans=0.1 2023-12-04 12:23:26,853 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=289666.6666666667, ans=0.09899494936611666 2023-12-04 12:23:37,221 INFO [train.py:1087] (1/4) Epoch 49, batch 500, loss[loss=0.167, simple_loss=0.2602, pruned_loss=0.03688, over 22816.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2484, pruned_loss=0.03175, over 4409958.52 frames. ], batch size: 106, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:23:44,258 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.97 vs. limit=22.5 2023-12-04 12:23:44,430 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-12-04 12:23:47,862 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.299e+02 1.421e+02 1.527e+02 2.302e+02, threshold=2.842e+02, percent-clipped=0.0 2023-12-04 12:23:55,673 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=289800.0, ans=0.2 2023-12-04 12:23:56,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=289800.0, ans=0.035 2023-12-04 12:24:00,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=289866.6666666667, ans=0.125 2023-12-04 12:24:13,283 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=289933.3333333333, ans=0.125 2023-12-04 12:24:15,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=289933.3333333333, ans=0.125 2023-12-04 12:24:30,393 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=290000.0, ans=0.2 2023-12-04 12:24:30,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=290000.0, ans=0.125 2023-12-04 12:24:32,210 INFO [train.py:1087] (1/4) Epoch 49, batch 550, loss[loss=0.1502, simple_loss=0.2411, pruned_loss=0.02969, over 24725.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2484, pruned_loss=0.03171, over 4498732.21 frames. ], batch size: 61, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:24:54,158 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=290200.0, ans=0.1 2023-12-04 12:24:58,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=290200.0, ans=0.05 2023-12-04 12:25:06,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290266.6666666667, ans=0.1 2023-12-04 12:25:28,634 INFO [train.py:1087] (1/4) Epoch 49, batch 600, loss[loss=0.1969, simple_loss=0.2783, pruned_loss=0.05773, over 16869.00 frames. ], tot_loss[loss=0.1563, simple_loss=0.2486, pruned_loss=0.03199, over 4553376.29 frames. ], batch size: 177, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:25:40,496 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.365e+02 1.448e+02 1.602e+02 1.913e+02, threshold=2.897e+02, percent-clipped=0.0 2023-12-04 12:25:47,654 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-12-04 12:25:50,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=290533.3333333333, ans=0.125 2023-12-04 12:26:00,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=290533.3333333333, ans=0.125 2023-12-04 12:26:05,631 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=290600.0, ans=0.07 2023-12-04 12:26:05,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=290600.0, ans=0.125 2023-12-04 12:26:06,570 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=290600.0, ans=0.125 2023-12-04 12:26:13,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=290666.6666666667, ans=0.125 2023-12-04 12:26:25,388 INFO [train.py:1087] (1/4) Epoch 49, batch 650, loss[loss=0.1508, simple_loss=0.2478, pruned_loss=0.02689, over 24706.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2485, pruned_loss=0.03172, over 4626355.88 frames. ], batch size: 69, lr: 5.00e-03, grad_scale: 16.0 2023-12-04 12:26:44,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=290800.0, ans=0.0 2023-12-04 12:26:48,280 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=290866.6666666667, ans=0.125 2023-12-04 12:27:04,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=290933.3333333333, ans=0.125 2023-12-04 12:27:08,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=290933.3333333333, ans=0.125 2023-12-04 12:27:08,284 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=290933.3333333333, ans=0.125 2023-12-04 12:27:22,238 INFO [train.py:1087] (1/4) Epoch 49, batch 700, loss[loss=0.1597, simple_loss=0.2552, pruned_loss=0.03209, over 24752.00 frames. ], tot_loss[loss=0.1567, simple_loss=0.2489, pruned_loss=0.0322, over 4654238.16 frames. ], batch size: 70, lr: 4.99e-03, grad_scale: 16.0 2023-12-04 12:27:22,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=291066.6666666667, ans=0.0 2023-12-04 12:27:33,332 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.253e+02 1.366e+02 1.468e+02 1.908e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 12:27:48,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=291200.0, ans=0.125 2023-12-04 12:27:48,136 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:28:04,879 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-12-04 12:28:14,734 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=291333.3333333333, ans=0.0 2023-12-04 12:28:17,992 INFO [train.py:1087] (1/4) Epoch 49, batch 750, loss[loss=0.1697, simple_loss=0.2631, pruned_loss=0.03816, over 23459.00 frames. ], tot_loss[loss=0.1569, simple_loss=0.2491, pruned_loss=0.03231, over 4678002.28 frames. ], batch size: 94, lr: 4.99e-03, grad_scale: 16.0 2023-12-04 12:28:23,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=291400.0, ans=0.125 2023-12-04 12:28:28,254 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=291466.6666666667, ans=0.125 2023-12-04 12:28:34,218 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.68 vs. limit=15.0 2023-12-04 12:29:11,176 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.64 vs. limit=10.0 2023-12-04 12:29:13,775 INFO [train.py:1087] (1/4) Epoch 49, batch 800, loss[loss=0.1468, simple_loss=0.2434, pruned_loss=0.02507, over 24798.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2486, pruned_loss=0.03191, over 4715728.29 frames. ], batch size: 71, lr: 4.99e-03, grad_scale: 32.0 2023-12-04 12:29:20,909 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=291733.3333333333, ans=0.0 2023-12-04 12:29:24,827 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.268e+02 1.351e+02 1.470e+02 2.062e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 12:29:25,315 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2023-12-04 12:29:58,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=292000.0, ans=0.125 2023-12-04 12:30:05,892 INFO [train.py:1087] (1/4) Epoch 49, batch 850, loss[loss=0.1413, simple_loss=0.2347, pruned_loss=0.02396, over 24765.00 frames. ], tot_loss[loss=0.1566, simple_loss=0.2489, pruned_loss=0.03211, over 4729900.42 frames. ], batch size: 64, lr: 4.98e-03, grad_scale: 32.0 2023-12-04 12:30:07,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=292066.6666666667, ans=0.2 2023-12-04 12:30:16,661 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-12-04 12:30:41,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=292266.6666666667, ans=0.2 2023-12-04 12:30:48,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=292333.3333333333, ans=0.125 2023-12-04 12:31:04,582 INFO [train.py:1087] (1/4) Epoch 50, batch 0, loss[loss=0.148, simple_loss=0.2428, pruned_loss=0.0266, over 24802.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2428, pruned_loss=0.0266, over 24802.00 frames. ], batch size: 72, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:31:04,582 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 12:31:15,478 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.3866, 2.3994, 2.8746, 3.0930], device='cuda:1') 2023-12-04 12:31:16,957 INFO [train.py:1119] (1/4) Epoch 50, validation: loss=0.1516, simple_loss=0.2496, pruned_loss=0.02681, over 944034.00 frames. 2023-12-04 12:31:16,958 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 12:31:18,672 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.84 vs. limit=15.0 2023-12-04 12:31:32,287 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.25 vs. limit=6.0 2023-12-04 12:31:32,634 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.261e+02 1.358e+02 1.517e+02 2.252e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 12:31:36,826 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.83 vs. limit=22.5 2023-12-04 12:31:44,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=292500.0, ans=0.125 2023-12-04 12:31:51,984 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.87 vs. limit=12.0 2023-12-04 12:32:00,240 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=292633.3333333333, ans=0.2 2023-12-04 12:32:01,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=292633.3333333333, ans=0.0 2023-12-04 12:32:11,088 INFO [train.py:1087] (1/4) Epoch 50, batch 50, loss[loss=0.1577, simple_loss=0.2518, pruned_loss=0.03184, over 24471.00 frames. ], tot_loss[loss=0.1561, simple_loss=0.2487, pruned_loss=0.03172, over 1097337.14 frames. ], batch size: 75, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:32:11,251 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=292700.0, ans=0.0 2023-12-04 12:32:46,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=292900.0, ans=0.125 2023-12-04 12:33:06,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=293033.3333333333, ans=15.0 2023-12-04 12:33:07,078 INFO [train.py:1087] (1/4) Epoch 50, batch 100, loss[loss=0.1474, simple_loss=0.2426, pruned_loss=0.02613, over 24568.00 frames. ], tot_loss[loss=0.157, simple_loss=0.2495, pruned_loss=0.03227, over 1889586.46 frames. ], batch size: 64, lr: 4.93e-03, grad_scale: 32.0 2023-12-04 12:33:12,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=293033.3333333333, ans=0.125 2023-12-04 12:33:21,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=293100.0, ans=0.05 2023-12-04 12:33:23,421 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.296e+02 1.400e+02 1.614e+02 2.724e+02, threshold=2.800e+02, percent-clipped=1.0 2023-12-04 12:33:33,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293166.6666666667, ans=0.125 2023-12-04 12:33:33,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=293166.6666666667, ans=0.125 2023-12-04 12:33:46,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=293233.3333333333, ans=0.125 2023-12-04 12:33:50,787 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.14 vs. limit=15.0 2023-12-04 12:33:52,564 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=293300.0, ans=0.0 2023-12-04 12:33:59,536 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=293300.0, ans=0.2 2023-12-04 12:34:04,668 INFO [train.py:1087] (1/4) Epoch 50, batch 150, loss[loss=0.1502, simple_loss=0.2499, pruned_loss=0.02528, over 24793.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2483, pruned_loss=0.0313, over 2558106.22 frames. ], batch size: 73, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:34:11,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=293366.6666666667, ans=0.125 2023-12-04 12:34:16,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=293433.3333333333, ans=0.125 2023-12-04 12:34:40,253 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=293566.6666666667, ans=0.0 2023-12-04 12:34:40,487 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.41 vs. limit=15.0 2023-12-04 12:34:59,660 INFO [train.py:1087] (1/4) Epoch 50, batch 200, loss[loss=0.1517, simple_loss=0.2381, pruned_loss=0.03261, over 24759.00 frames. ], tot_loss[loss=0.1557, simple_loss=0.2483, pruned_loss=0.03153, over 3062641.60 frames. ], batch size: 64, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:35:06,898 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.15 vs. limit=15.0 2023-12-04 12:35:13,486 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=293766.6666666667, ans=0.2 2023-12-04 12:35:18,236 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.295e+02 1.378e+02 1.503e+02 2.109e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 12:35:18,768 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=293766.6666666667, ans=0.0 2023-12-04 12:35:30,853 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=293833.3333333333, ans=0.125 2023-12-04 12:35:53,043 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=8.0 2023-12-04 12:36:06,786 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=293966.6666666667, ans=0.0 2023-12-04 12:36:09,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=293966.6666666667, ans=0.1 2023-12-04 12:36:12,404 INFO [train.py:1087] (1/4) Epoch 50, batch 250, loss[loss=0.1547, simple_loss=0.2484, pruned_loss=0.03044, over 24727.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2479, pruned_loss=0.03135, over 3460395.13 frames. ], batch size: 67, lr: 4.92e-03, grad_scale: 32.0 2023-12-04 12:36:32,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=294100.0, ans=0.0 2023-12-04 12:36:44,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=294166.6666666667, ans=0.125 2023-12-04 12:36:56,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=294233.3333333333, ans=0.125 2023-12-04 12:36:56,997 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=294233.3333333333, ans=0.125 2023-12-04 12:36:58,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=294233.3333333333, ans=0.2 2023-12-04 12:37:11,919 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-12-04 12:37:12,754 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:37:12,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=294300.0, ans=0.125 2023-12-04 12:37:17,311 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=294300.0, ans=0.125 2023-12-04 12:37:28,453 INFO [train.py:1087] (1/4) Epoch 50, batch 300, loss[loss=0.1633, simple_loss=0.2525, pruned_loss=0.03711, over 24307.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2483, pruned_loss=0.03167, over 3766269.58 frames. ], batch size: 79, lr: 4.92e-03, grad_scale: 16.0 2023-12-04 12:37:44,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294433.3333333333, ans=0.1 2023-12-04 12:37:52,382 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.315e+02 1.430e+02 1.594e+02 2.206e+02, threshold=2.860e+02, percent-clipped=0.0 2023-12-04 12:37:54,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294433.3333333333, ans=0.1 2023-12-04 12:38:04,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=294500.0, ans=0.125 2023-12-04 12:38:25,215 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=294566.6666666667, ans=0.0 2023-12-04 12:38:26,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=294566.6666666667, ans=0.125 2023-12-04 12:38:44,318 INFO [train.py:1087] (1/4) Epoch 50, batch 350, loss[loss=0.1375, simple_loss=0.2307, pruned_loss=0.02219, over 24746.00 frames. ], tot_loss[loss=0.1562, simple_loss=0.2485, pruned_loss=0.03194, over 3988085.46 frames. ], batch size: 66, lr: 4.91e-03, grad_scale: 16.0 2023-12-04 12:39:29,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=294900.0, ans=0.0 2023-12-04 12:39:40,694 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=294900.0, ans=0.125 2023-12-04 12:40:01,231 INFO [train.py:1087] (1/4) Epoch 50, batch 400, loss[loss=0.1632, simple_loss=0.2595, pruned_loss=0.03344, over 21731.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2483, pruned_loss=0.03181, over 4178820.36 frames. ], batch size: 52, lr: 4.91e-03, grad_scale: 32.0 2023-12-04 12:40:07,792 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.92 vs. limit=10.0 2023-12-04 12:40:13,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=295033.3333333333, ans=0.2 2023-12-04 12:40:18,392 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=12.0 2023-12-04 12:40:18,638 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-12-04 12:40:25,338 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=295100.0, ans=0.125 2023-12-04 12:40:26,392 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.284e+02 1.371e+02 1.478e+02 1.721e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 12:41:16,232 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=295366.6666666667, ans=0.125 2023-12-04 12:41:17,320 INFO [train.py:1087] (1/4) Epoch 50, batch 450, loss[loss=0.163, simple_loss=0.2585, pruned_loss=0.03372, over 22927.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2484, pruned_loss=0.03173, over 4327560.58 frames. ], batch size: 106, lr: 4.91e-03, grad_scale: 32.0 2023-12-04 12:41:37,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=295433.3333333333, ans=0.0 2023-12-04 12:41:40,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=295433.3333333333, ans=0.04949747468305833 2023-12-04 12:42:19,659 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.57 vs. limit=10.0 2023-12-04 12:42:33,094 INFO [train.py:1087] (1/4) Epoch 50, batch 500, loss[loss=0.1799, simple_loss=0.2629, pruned_loss=0.04847, over 16657.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2481, pruned_loss=0.03178, over 4420806.61 frames. ], batch size: 177, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:42:33,957 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-12-04 12:42:53,291 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=295766.6666666667, ans=0.125 2023-12-04 12:42:54,975 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:42:57,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=295766.6666666667, ans=0.1 2023-12-04 12:42:58,529 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.256e+02 1.344e+02 1.428e+02 2.057e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 12:43:49,680 INFO [train.py:1087] (1/4) Epoch 50, batch 550, loss[loss=0.1511, simple_loss=0.2416, pruned_loss=0.03032, over 24553.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2481, pruned_loss=0.03178, over 4506625.77 frames. ], batch size: 62, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:43:51,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=296033.3333333333, ans=0.0 2023-12-04 12:44:03,345 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-12-04 12:44:15,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=296100.0, ans=0.0 2023-12-04 12:44:25,296 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-12-04 12:44:53,596 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=12.0 2023-12-04 12:45:06,826 INFO [train.py:1087] (1/4) Epoch 50, batch 600, loss[loss=0.1427, simple_loss=0.2352, pruned_loss=0.02506, over 24772.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2481, pruned_loss=0.03177, over 4590883.58 frames. ], batch size: 71, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:45:33,643 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.304e+02 1.379e+02 1.513e+02 2.002e+02, threshold=2.759e+02, percent-clipped=0.0 2023-12-04 12:46:14,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=296633.3333333333, ans=0.125 2023-12-04 12:46:24,155 INFO [train.py:1087] (1/4) Epoch 50, batch 650, loss[loss=0.1506, simple_loss=0.2398, pruned_loss=0.03069, over 24732.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2477, pruned_loss=0.03149, over 4656801.30 frames. ], batch size: 67, lr: 4.90e-03, grad_scale: 16.0 2023-12-04 12:46:40,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=296766.6666666667, ans=0.125 2023-12-04 12:46:49,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=296766.6666666667, ans=0.125 2023-12-04 12:46:52,557 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=296766.6666666667, ans=0.0 2023-12-04 12:46:58,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=296833.3333333333, ans=0.125 2023-12-04 12:47:01,124 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=296833.3333333333, ans=0.0 2023-12-04 12:47:41,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=297033.3333333333, ans=0.0 2023-12-04 12:47:42,708 INFO [train.py:1087] (1/4) Epoch 50, batch 700, loss[loss=0.1448, simple_loss=0.2342, pruned_loss=0.02763, over 24545.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2481, pruned_loss=0.03181, over 4672069.51 frames. ], batch size: 62, lr: 4.89e-03, grad_scale: 16.0 2023-12-04 12:48:06,631 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=297100.0, ans=0.0 2023-12-04 12:48:09,140 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.064e+02 1.270e+02 1.347e+02 1.446e+02 1.862e+02, threshold=2.693e+02, percent-clipped=0.0 2023-12-04 12:48:41,164 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=12.0 2023-12-04 12:49:01,116 INFO [train.py:1087] (1/4) Epoch 50, batch 750, loss[loss=0.1778, simple_loss=0.2614, pruned_loss=0.04711, over 16813.00 frames. ], tot_loss[loss=0.1559, simple_loss=0.2483, pruned_loss=0.03178, over 4709836.37 frames. ], batch size: 177, lr: 4.89e-03, grad_scale: 16.0 2023-12-04 12:49:05,323 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=297366.6666666667, ans=0.04949747468305833 2023-12-04 12:49:42,226 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=297500.0, ans=0.0 2023-12-04 12:50:13,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=297633.3333333333, ans=0.2 2023-12-04 12:50:17,580 INFO [train.py:1087] (1/4) Epoch 50, batch 800, loss[loss=0.1584, simple_loss=0.2542, pruned_loss=0.03132, over 24605.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2481, pruned_loss=0.03157, over 4736836.71 frames. ], batch size: 68, lr: 4.89e-03, grad_scale: 32.0 2023-12-04 12:50:36,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=297766.6666666667, ans=0.1 2023-12-04 12:50:38,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=297766.6666666667, ans=0.2 2023-12-04 12:50:42,508 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.327e+02 1.412e+02 1.549e+02 1.936e+02, threshold=2.824e+02, percent-clipped=0.0 2023-12-04 12:50:43,262 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.44 vs. limit=15.0 2023-12-04 12:50:55,906 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.60 vs. limit=22.5 2023-12-04 12:51:00,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=297900.0, ans=0.1 2023-12-04 12:51:26,886 INFO [train.py:1087] (1/4) Epoch 50, batch 850, loss[loss=0.1627, simple_loss=0.2501, pruned_loss=0.03764, over 24259.00 frames. ], tot_loss[loss=0.156, simple_loss=0.2484, pruned_loss=0.0318, over 4739743.92 frames. ], batch size: 82, lr: 4.89e-03, grad_scale: 32.0 2023-12-04 12:51:33,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=298033.3333333333, ans=0.125 2023-12-04 12:51:34,062 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-12-04 12:52:11,509 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-12-04 12:52:22,823 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-12-04 12:52:45,043 INFO [train.py:1087] (1/4) Epoch 51, batch 0, loss[loss=0.1484, simple_loss=0.2423, pruned_loss=0.02721, over 24804.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2423, pruned_loss=0.02721, over 24804.00 frames. ], batch size: 62, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:52:45,045 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 12:53:01,898 INFO [train.py:1119] (1/4) Epoch 51, validation: loss=0.1517, simple_loss=0.2496, pruned_loss=0.02685, over 944034.00 frames. 2023-12-04 12:53:01,900 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 12:53:11,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.42 vs. limit=15.0 2023-12-04 12:53:31,220 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 12:53:35,309 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.039e+02 1.298e+02 1.401e+02 1.598e+02 2.046e+02, threshold=2.803e+02, percent-clipped=0.0 2023-12-04 12:53:48,758 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-12-04 12:53:48,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-12-04 12:54:19,899 INFO [train.py:1087] (1/4) Epoch 51, batch 50, loss[loss=0.1519, simple_loss=0.2477, pruned_loss=0.02808, over 24758.00 frames. ], tot_loss[loss=0.1565, simple_loss=0.2494, pruned_loss=0.03184, over 1087528.59 frames. ], batch size: 66, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:55:11,936 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=15.0 2023-12-04 12:55:18,751 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=298866.6666666667, ans=0.0 2023-12-04 12:55:36,395 INFO [train.py:1087] (1/4) Epoch 51, batch 100, loss[loss=0.1449, simple_loss=0.2434, pruned_loss=0.02317, over 24772.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2487, pruned_loss=0.0312, over 1915487.48 frames. ], batch size: 71, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:55:56,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=299066.6666666667, ans=0.2 2023-12-04 12:56:10,770 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.247e+02 1.341e+02 1.436e+02 1.925e+02, threshold=2.681e+02, percent-clipped=0.0 2023-12-04 12:56:32,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=299200.0, ans=0.125 2023-12-04 12:56:51,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=299266.6666666667, ans=0.125 2023-12-04 12:56:53,535 INFO [train.py:1087] (1/4) Epoch 51, batch 150, loss[loss=0.1638, simple_loss=0.2609, pruned_loss=0.03332, over 24158.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.2484, pruned_loss=0.03131, over 2559064.62 frames. ], batch size: 58, lr: 4.83e-03, grad_scale: 32.0 2023-12-04 12:57:02,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=14.53 vs. limit=22.5 2023-12-04 12:57:16,470 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.71 vs. limit=15.0 2023-12-04 12:57:54,256 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=299600.0, ans=0.125 2023-12-04 12:57:59,260 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-12-04 12:58:08,688 INFO [train.py:1087] (1/4) Epoch 51, batch 200, loss[loss=0.1504, simple_loss=0.2432, pruned_loss=0.02884, over 24759.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2475, pruned_loss=0.031, over 3063819.66 frames. ], batch size: 66, lr: 4.82e-03, grad_scale: 32.0 2023-12-04 12:58:14,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=299666.6666666667, ans=0.125 2023-12-04 12:58:25,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=299733.3333333333, ans=0.2 2023-12-04 12:58:35,743 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.02 vs. limit=10.0 2023-12-04 12:58:42,722 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=299800.0, ans=0.5 2023-12-04 12:58:43,677 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.285e+02 1.368e+02 1.485e+02 1.874e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 12:59:22,658 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=300000.0, ans=0.0 2023-12-04 12:59:24,423 INFO [train.py:1087] (1/4) Epoch 51, batch 250, loss[loss=0.1601, simple_loss=0.249, pruned_loss=0.03561, over 24538.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.248, pruned_loss=0.03126, over 3446456.57 frames. ], batch size: 63, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 12:59:26,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=300000.0, ans=0.0 2023-12-04 13:00:17,221 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=300200.0, ans=0.5 2023-12-04 13:00:18,722 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=300200.0, ans=0.1 2023-12-04 13:00:20,271 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=300200.0, ans=0.1 2023-12-04 13:00:41,659 INFO [train.py:1087] (1/4) Epoch 51, batch 300, loss[loss=0.1773, simple_loss=0.2669, pruned_loss=0.04385, over 23406.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2478, pruned_loss=0.03135, over 3754586.91 frames. ], batch size: 94, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 13:00:47,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300333.3333333333, ans=0.1 2023-12-04 13:00:50,687 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.07 vs. limit=15.0 2023-12-04 13:01:16,433 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.302e+02 1.405e+02 1.500e+02 2.125e+02, threshold=2.811e+02, percent-clipped=0.0 2023-12-04 13:01:18,882 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.14 vs. limit=15.0 2023-12-04 13:01:24,443 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=300466.6666666667, ans=0.07 2023-12-04 13:01:57,785 INFO [train.py:1087] (1/4) Epoch 51, batch 350, loss[loss=0.1565, simple_loss=0.2494, pruned_loss=0.03181, over 24549.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2479, pruned_loss=0.03163, over 3982368.87 frames. ], batch size: 62, lr: 4.82e-03, grad_scale: 16.0 2023-12-04 13:02:34,843 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.25 vs. limit=22.5 2023-12-04 13:02:43,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=300866.6666666667, ans=0.2 2023-12-04 13:02:48,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=300866.6666666667, ans=0.0 2023-12-04 13:02:49,831 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=300866.6666666667, ans=0.0 2023-12-04 13:02:51,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=300866.6666666667, ans=0.09899494936611666 2023-12-04 13:02:55,091 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:03:05,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=300933.3333333333, ans=0.2 2023-12-04 13:03:11,345 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300933.3333333333, ans=0.1 2023-12-04 13:03:14,074 INFO [train.py:1087] (1/4) Epoch 51, batch 400, loss[loss=0.1501, simple_loss=0.2445, pruned_loss=0.02782, over 24716.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2479, pruned_loss=0.0314, over 4176604.96 frames. ], batch size: 67, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:03:44,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=301133.3333333333, ans=0.125 2023-12-04 13:03:48,298 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:03:49,709 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.264e+02 1.361e+02 1.495e+02 2.080e+02, threshold=2.722e+02, percent-clipped=0.0 2023-12-04 13:04:08,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=301200.0, ans=10.0 2023-12-04 13:04:30,748 INFO [train.py:1087] (1/4) Epoch 51, batch 450, loss[loss=0.1495, simple_loss=0.2441, pruned_loss=0.02742, over 24753.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2478, pruned_loss=0.03149, over 4314751.71 frames. ], batch size: 65, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:04:32,975 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.91 vs. limit=15.0 2023-12-04 13:04:34,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=301333.3333333333, ans=0.125 2023-12-04 13:04:45,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=301400.0, ans=0.0 2023-12-04 13:04:45,677 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=301400.0, ans=0.125 2023-12-04 13:04:46,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301400.0, ans=0.1 2023-12-04 13:05:18,714 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=301533.3333333333, ans=0.125 2023-12-04 13:05:34,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=301600.0, ans=0.2 2023-12-04 13:05:43,921 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.95 vs. limit=15.0 2023-12-04 13:05:46,811 INFO [train.py:1087] (1/4) Epoch 51, batch 500, loss[loss=0.1505, simple_loss=0.2489, pruned_loss=0.02599, over 24565.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2477, pruned_loss=0.03149, over 4434565.24 frames. ], batch size: 66, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:05:55,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=301666.6666666667, ans=0.125 2023-12-04 13:06:12,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=301733.3333333333, ans=0.125 2023-12-04 13:06:21,503 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.256e+02 1.336e+02 1.463e+02 1.740e+02, threshold=2.671e+02, percent-clipped=0.0 2023-12-04 13:06:32,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=301866.6666666667, ans=0.2 2023-12-04 13:06:49,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=301933.3333333333, ans=0.125 2023-12-04 13:06:52,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=301933.3333333333, ans=0.0 2023-12-04 13:06:58,276 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:07:03,739 INFO [train.py:1087] (1/4) Epoch 51, batch 550, loss[loss=0.1387, simple_loss=0.2289, pruned_loss=0.02426, over 24732.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2475, pruned_loss=0.03131, over 4503568.41 frames. ], batch size: 67, lr: 4.81e-03, grad_scale: 32.0 2023-12-04 13:07:19,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=302066.6666666667, ans=0.05 2023-12-04 13:07:25,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=302066.6666666667, ans=0.125 2023-12-04 13:07:36,204 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.63 vs. limit=15.0 2023-12-04 13:07:56,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=302200.0, ans=0.125 2023-12-04 13:07:56,948 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=302200.0, ans=0.0 2023-12-04 13:08:11,854 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=302266.6666666667, ans=0.0 2023-12-04 13:08:21,969 INFO [train.py:1087] (1/4) Epoch 51, batch 600, loss[loss=0.1438, simple_loss=0.2378, pruned_loss=0.02488, over 24709.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2476, pruned_loss=0.03152, over 4558769.44 frames. ], batch size: 74, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:08:22,311 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=302333.3333333333, ans=0.2 2023-12-04 13:08:45,844 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=302400.0, ans=0.125 2023-12-04 13:08:58,256 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.169e+02 1.269e+02 1.388e+02 1.458e+02 2.113e+02, threshold=2.776e+02, percent-clipped=0.0 2023-12-04 13:09:00,046 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=302466.6666666667, ans=0.125 2023-12-04 13:09:01,286 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=302466.6666666667, ans=0.125 2023-12-04 13:09:27,303 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=302600.0, ans=0.2 2023-12-04 13:09:40,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302666.6666666667, ans=0.1 2023-12-04 13:09:41,625 INFO [train.py:1087] (1/4) Epoch 51, batch 650, loss[loss=0.1406, simple_loss=0.2353, pruned_loss=0.02296, over 24754.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2472, pruned_loss=0.03133, over 4601426.51 frames. ], batch size: 64, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:10:01,209 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=302733.3333333333, ans=0.125 2023-12-04 13:10:07,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=302733.3333333333, ans=0.0 2023-12-04 13:10:09,198 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=302733.3333333333, ans=0.1 2023-12-04 13:10:22,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=302800.0, ans=0.0 2023-12-04 13:10:26,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=302866.6666666667, ans=0.0 2023-12-04 13:10:58,633 INFO [train.py:1087] (1/4) Epoch 51, batch 700, loss[loss=0.1495, simple_loss=0.2393, pruned_loss=0.02987, over 24764.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2473, pruned_loss=0.03125, over 4651107.03 frames. ], batch size: 64, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:10:59,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303000.0, ans=0.125 2023-12-04 13:11:33,732 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.156e+02 1.280e+02 1.383e+02 1.505e+02 1.927e+02, threshold=2.765e+02, percent-clipped=0.0 2023-12-04 13:12:10,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=303266.6666666667, ans=0.0 2023-12-04 13:12:11,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303266.6666666667, ans=0.1 2023-12-04 13:12:16,997 INFO [train.py:1087] (1/4) Epoch 51, batch 750, loss[loss=0.1588, simple_loss=0.2504, pruned_loss=0.03364, over 24746.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2474, pruned_loss=0.03116, over 4678513.67 frames. ], batch size: 66, lr: 4.80e-03, grad_scale: 32.0 2023-12-04 13:12:23,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=303333.3333333333, ans=0.1 2023-12-04 13:12:50,223 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-12-04 13:13:13,271 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=303533.3333333333, ans=10.0 2023-12-04 13:13:35,050 INFO [train.py:1087] (1/4) Epoch 51, batch 800, loss[loss=0.1579, simple_loss=0.2522, pruned_loss=0.03179, over 24265.00 frames. ], tot_loss[loss=0.1555, simple_loss=0.248, pruned_loss=0.03148, over 4673160.80 frames. ], batch size: 79, lr: 4.79e-03, grad_scale: 32.0 2023-12-04 13:14:09,423 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.291e+02 1.399e+02 1.523e+02 2.071e+02, threshold=2.798e+02, percent-clipped=0.0 2023-12-04 13:14:25,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=303866.6666666667, ans=0.2 2023-12-04 13:14:46,095 INFO [train.py:1087] (1/4) Epoch 51, batch 850, loss[loss=0.1517, simple_loss=0.2441, pruned_loss=0.02963, over 24793.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2477, pruned_loss=0.0313, over 4712131.04 frames. ], batch size: 72, lr: 4.79e-03, grad_scale: 32.0 2023-12-04 13:14:58,958 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=304066.6666666667, ans=0.5 2023-12-04 13:15:19,952 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=304133.3333333333, ans=0.2 2023-12-04 13:16:09,177 INFO [train.py:1087] (1/4) Epoch 52, batch 0, loss[loss=0.1511, simple_loss=0.2453, pruned_loss=0.02844, over 24542.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2453, pruned_loss=0.02844, over 24542.00 frames. ], batch size: 62, lr: 4.74e-03, grad_scale: 32.0 2023-12-04 13:16:09,180 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 13:16:25,865 INFO [train.py:1119] (1/4) Epoch 52, validation: loss=0.1515, simple_loss=0.2494, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 13:16:25,866 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 13:16:26,156 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:16:55,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=304433.3333333333, ans=0.2 2023-12-04 13:17:01,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=304433.3333333333, ans=0.1 2023-12-04 13:17:01,332 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=304433.3333333333, ans=0.2 2023-12-04 13:17:11,988 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.264e+02 1.362e+02 1.496e+02 2.534e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 13:17:12,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=304500.0, ans=0.0 2023-12-04 13:17:43,851 INFO [train.py:1087] (1/4) Epoch 52, batch 50, loss[loss=0.1573, simple_loss=0.2489, pruned_loss=0.03288, over 24563.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2477, pruned_loss=0.03136, over 1092298.08 frames. ], batch size: 63, lr: 4.74e-03, grad_scale: 16.0 2023-12-04 13:18:22,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=304766.6666666667, ans=0.0 2023-12-04 13:19:03,479 INFO [train.py:1087] (1/4) Epoch 52, batch 100, loss[loss=0.1482, simple_loss=0.2426, pruned_loss=0.02689, over 24797.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2475, pruned_loss=0.03123, over 1916010.83 frames. ], batch size: 62, lr: 4.74e-03, grad_scale: 16.0 2023-12-04 13:19:05,549 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=304966.6666666667, ans=0.125 2023-12-04 13:19:12,497 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=22.5 2023-12-04 13:19:16,240 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=304966.6666666667, ans=0.0 2023-12-04 13:19:34,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305100.0, ans=0.1 2023-12-04 13:19:40,292 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=305100.0, ans=0.125 2023-12-04 13:19:50,784 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.256e+02 1.354e+02 1.466e+02 2.724e+02, threshold=2.709e+02, percent-clipped=1.0 2023-12-04 13:19:51,575 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2023-12-04 13:20:22,233 INFO [train.py:1087] (1/4) Epoch 52, batch 150, loss[loss=0.1592, simple_loss=0.2561, pruned_loss=0.03117, over 24767.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2476, pruned_loss=0.03118, over 2569260.97 frames. ], batch size: 65, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:20:29,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=305300.0, ans=0.0 2023-12-04 13:20:32,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=305300.0, ans=0.125 2023-12-04 13:20:51,470 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.87 vs. limit=6.0 2023-12-04 13:21:41,571 INFO [train.py:1087] (1/4) Epoch 52, batch 200, loss[loss=0.1417, simple_loss=0.2362, pruned_loss=0.02365, over 24759.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2474, pruned_loss=0.0311, over 3060947.12 frames. ], batch size: 65, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:21:48,285 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.10 vs. limit=15.0 2023-12-04 13:22:28,493 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.310e+02 1.423e+02 1.525e+02 1.849e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 13:22:34,863 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=305833.3333333333, ans=0.1 2023-12-04 13:23:00,026 INFO [train.py:1087] (1/4) Epoch 52, batch 250, loss[loss=0.1568, simple_loss=0.2478, pruned_loss=0.03289, over 24806.00 frames. ], tot_loss[loss=0.155, simple_loss=0.2475, pruned_loss=0.03119, over 3444970.49 frames. ], batch size: 62, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:23:02,431 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.31 vs. limit=15.0 2023-12-04 13:23:03,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=305966.6666666667, ans=0.125 2023-12-04 13:23:08,257 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.15 vs. limit=15.0 2023-12-04 13:23:09,555 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:23:12,228 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:23:18,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=306033.3333333333, ans=0.0 2023-12-04 13:23:20,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306033.3333333333, ans=0.125 2023-12-04 13:23:20,727 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-12-04 13:23:23,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=306033.3333333333, ans=0.0 2023-12-04 13:23:25,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=306033.3333333333, ans=0.125 2023-12-04 13:23:25,301 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:23:26,677 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=306033.3333333333, ans=0.04949747468305833 2023-12-04 13:24:06,328 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.18 vs. limit=15.0 2023-12-04 13:24:18,665 INFO [train.py:1087] (1/4) Epoch 52, batch 300, loss[loss=0.1491, simple_loss=0.2416, pruned_loss=0.0283, over 24852.00 frames. ], tot_loss[loss=0.1553, simple_loss=0.2478, pruned_loss=0.03142, over 3739102.73 frames. ], batch size: 68, lr: 4.73e-03, grad_scale: 16.0 2023-12-04 13:24:29,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=306300.0, ans=0.04949747468305833 2023-12-04 13:24:33,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=306366.6666666667, ans=0.125 2023-12-04 13:24:44,362 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-12-04 13:25:04,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=306500.0, ans=0.125 2023-12-04 13:25:04,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=306500.0, ans=0.0 2023-12-04 13:25:05,176 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.271e+02 1.354e+02 1.459e+02 1.877e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 13:25:23,670 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.10 vs. limit=22.5 2023-12-04 13:25:36,095 INFO [train.py:1087] (1/4) Epoch 52, batch 350, loss[loss=0.1549, simple_loss=0.251, pruned_loss=0.02942, over 24780.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2476, pruned_loss=0.03146, over 3970791.43 frames. ], batch size: 70, lr: 4.72e-03, grad_scale: 16.0 2023-12-04 13:26:20,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=306766.6666666667, ans=0.125 2023-12-04 13:26:22,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306833.3333333333, ans=0.125 2023-12-04 13:26:38,983 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.56 vs. limit=22.5 2023-12-04 13:26:55,395 INFO [train.py:1087] (1/4) Epoch 52, batch 400, loss[loss=0.1504, simple_loss=0.2406, pruned_loss=0.03011, over 24576.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2474, pruned_loss=0.03144, over 4149816.14 frames. ], batch size: 65, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:27:17,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=307033.3333333333, ans=0.0 2023-12-04 13:27:34,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=307100.0, ans=0.125 2023-12-04 13:27:43,129 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.294e+02 1.381e+02 1.531e+02 2.056e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-04 13:28:05,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=307233.3333333333, ans=0.0 2023-12-04 13:28:07,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=307233.3333333333, ans=0.125 2023-12-04 13:28:14,808 INFO [train.py:1087] (1/4) Epoch 52, batch 450, loss[loss=0.1485, simple_loss=0.2389, pruned_loss=0.02898, over 24793.00 frames. ], tot_loss[loss=0.1557, simple_loss=0.2478, pruned_loss=0.03182, over 4266009.17 frames. ], batch size: 73, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:28:18,542 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.85 vs. limit=22.5 2023-12-04 13:28:46,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=307433.3333333333, ans=0.0 2023-12-04 13:28:52,953 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=307433.3333333333, ans=0.125 2023-12-04 13:29:31,007 INFO [train.py:1087] (1/4) Epoch 52, batch 500, loss[loss=0.1488, simple_loss=0.2404, pruned_loss=0.02862, over 24547.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2475, pruned_loss=0.03163, over 4388332.48 frames. ], batch size: 63, lr: 4.72e-03, grad_scale: 32.0 2023-12-04 13:29:32,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=307633.3333333333, ans=0.2 2023-12-04 13:29:44,344 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=307633.3333333333, ans=0.125 2023-12-04 13:30:10,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=307766.6666666667, ans=0.125 2023-12-04 13:30:17,940 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.320e+02 1.466e+02 1.634e+02 2.572e+02, threshold=2.932e+02, percent-clipped=0.0 2023-12-04 13:30:20,383 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.45 vs. limit=15.0 2023-12-04 13:30:26,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=307833.3333333333, ans=0.125 2023-12-04 13:30:28,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=307833.3333333333, ans=0.04949747468305833 2023-12-04 13:30:29,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=307833.3333333333, ans=0.125 2023-12-04 13:30:40,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=307900.0, ans=0.0 2023-12-04 13:30:50,169 INFO [train.py:1087] (1/4) Epoch 52, batch 550, loss[loss=0.1432, simple_loss=0.2376, pruned_loss=0.02438, over 24572.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2476, pruned_loss=0.0316, over 4483052.90 frames. ], batch size: 64, lr: 4.71e-03, grad_scale: 32.0 2023-12-04 13:30:57,128 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.28 vs. limit=10.0 2023-12-04 13:31:25,635 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=308100.0, ans=0.0 2023-12-04 13:31:27,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=308100.0, ans=0.04949747468305833 2023-12-04 13:31:36,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=308166.6666666667, ans=0.0 2023-12-04 13:31:42,198 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=308166.6666666667, ans=0.2 2023-12-04 13:31:46,495 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308166.6666666667, ans=0.1 2023-12-04 13:31:49,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=308166.6666666667, ans=0.125 2023-12-04 13:32:07,349 INFO [train.py:1087] (1/4) Epoch 52, batch 600, loss[loss=0.1369, simple_loss=0.2291, pruned_loss=0.02233, over 24572.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2472, pruned_loss=0.03116, over 4565370.13 frames. ], batch size: 64, lr: 4.71e-03, grad_scale: 32.0 2023-12-04 13:32:35,792 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.90 vs. limit=22.5 2023-12-04 13:32:54,892 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.254e+02 1.324e+02 1.412e+02 1.879e+02, threshold=2.648e+02, percent-clipped=0.0 2023-12-04 13:33:00,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=308500.0, ans=0.2 2023-12-04 13:33:22,601 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=308566.6666666667, ans=0.1 2023-12-04 13:33:25,188 INFO [train.py:1087] (1/4) Epoch 52, batch 650, loss[loss=0.1494, simple_loss=0.2392, pruned_loss=0.02977, over 24570.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2469, pruned_loss=0.03091, over 4632643.75 frames. ], batch size: 64, lr: 4.71e-03, grad_scale: 16.0 2023-12-04 13:33:25,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=308633.3333333333, ans=0.0 2023-12-04 13:33:30,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=308633.3333333333, ans=0.125 2023-12-04 13:33:43,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=308700.0, ans=0.1 2023-12-04 13:34:31,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=308900.0, ans=0.125 2023-12-04 13:34:43,627 INFO [train.py:1087] (1/4) Epoch 52, batch 700, loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.02787, over 24566.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2469, pruned_loss=0.031, over 4658076.46 frames. ], batch size: 64, lr: 4.71e-03, grad_scale: 16.0 2023-12-04 13:34:50,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=308966.6666666667, ans=0.0 2023-12-04 13:35:25,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309100.0, ans=0.1 2023-12-04 13:35:26,849 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=309100.0, ans=0.125 2023-12-04 13:35:32,667 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.284e+02 1.379e+02 1.528e+02 1.943e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-04 13:35:40,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=309166.6666666667, ans=0.125 2023-12-04 13:36:03,294 INFO [train.py:1087] (1/4) Epoch 52, batch 750, loss[loss=0.1532, simple_loss=0.2432, pruned_loss=0.03167, over 24559.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2471, pruned_loss=0.03107, over 4675854.33 frames. ], batch size: 65, lr: 4.70e-03, grad_scale: 16.0 2023-12-04 13:36:03,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=309300.0, ans=0.5 2023-12-04 13:36:29,641 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2023-12-04 13:36:42,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=309433.3333333333, ans=0.2 2023-12-04 13:36:45,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=309433.3333333333, ans=0.0 2023-12-04 13:37:10,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=309566.6666666667, ans=0.0 2023-12-04 13:37:14,685 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309566.6666666667, ans=0.1 2023-12-04 13:37:21,769 INFO [train.py:1087] (1/4) Epoch 52, batch 800, loss[loss=0.1529, simple_loss=0.2457, pruned_loss=0.03007, over 24795.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2473, pruned_loss=0.03108, over 4699074.09 frames. ], batch size: 72, lr: 4.70e-03, grad_scale: 32.0 2023-12-04 13:37:36,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=309700.0, ans=0.0 2023-12-04 13:37:42,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=309700.0, ans=0.125 2023-12-04 13:38:01,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=309766.6666666667, ans=0.125 2023-12-04 13:38:06,953 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.283e+02 1.363e+02 1.464e+02 2.233e+02, threshold=2.727e+02, percent-clipped=0.0 2023-12-04 13:38:21,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=309900.0, ans=0.0 2023-12-04 13:38:32,403 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.78 vs. limit=15.0 2023-12-04 13:38:33,074 INFO [train.py:1087] (1/4) Epoch 52, batch 850, loss[loss=0.1488, simple_loss=0.2435, pruned_loss=0.02702, over 24757.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2477, pruned_loss=0.03127, over 4719294.92 frames. ], batch size: 70, lr: 4.70e-03, grad_scale: 32.0 2023-12-04 13:38:59,950 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-12-04 13:39:08,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=310100.0, ans=0.125 2023-12-04 13:39:11,930 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=310100.0, ans=0.05 2023-12-04 13:39:15,859 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=310166.6666666667, ans=0.0 2023-12-04 13:39:25,811 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.43 vs. limit=15.0 2023-12-04 13:39:30,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=310233.3333333333, ans=0.125 2023-12-04 13:39:56,194 INFO [train.py:1087] (1/4) Epoch 53, batch 0, loss[loss=0.153, simple_loss=0.251, pruned_loss=0.0275, over 21418.00 frames. ], tot_loss[loss=0.153, simple_loss=0.251, pruned_loss=0.0275, over 21418.00 frames. ], batch size: 128, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:39:56,196 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 13:40:15,188 INFO [train.py:1119] (1/4) Epoch 53, validation: loss=0.1513, simple_loss=0.249, pruned_loss=0.02677, over 944034.00 frames. 2023-12-04 13:40:15,190 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 13:40:28,945 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-12-04 13:40:47,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=310400.0, ans=15.0 2023-12-04 13:40:53,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=310400.0, ans=0.0 2023-12-04 13:40:58,376 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=310400.0, ans=0.125 2023-12-04 13:41:12,821 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.286e+02 1.385e+02 1.598e+02 2.350e+02, threshold=2.770e+02, percent-clipped=0.0 2023-12-04 13:41:14,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=310466.6666666667, ans=0.1 2023-12-04 13:41:24,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=310533.3333333333, ans=0.125 2023-12-04 13:41:33,321 INFO [train.py:1087] (1/4) Epoch 53, batch 50, loss[loss=0.1502, simple_loss=0.244, pruned_loss=0.02816, over 24621.00 frames. ], tot_loss[loss=0.1558, simple_loss=0.2489, pruned_loss=0.03137, over 1074109.18 frames. ], batch size: 68, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:41:33,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=310600.0, ans=0.125 2023-12-04 13:41:48,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=310666.6666666667, ans=0.125 2023-12-04 13:41:53,470 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=310666.6666666667, ans=0.1 2023-12-04 13:42:38,059 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.46 vs. limit=15.0 2023-12-04 13:42:52,849 INFO [train.py:1087] (1/4) Epoch 53, batch 100, loss[loss=0.1639, simple_loss=0.2579, pruned_loss=0.03491, over 23423.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2477, pruned_loss=0.03079, over 1909519.30 frames. ], batch size: 94, lr: 4.65e-03, grad_scale: 32.0 2023-12-04 13:43:04,274 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.81 vs. limit=15.0 2023-12-04 13:43:53,142 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.159e+02 1.276e+02 1.377e+02 1.537e+02 2.483e+02, threshold=2.754e+02, percent-clipped=0.0 2023-12-04 13:43:57,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=311200.0, ans=0.2 2023-12-04 13:43:58,984 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.73 vs. limit=22.5 2023-12-04 13:44:09,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=311266.6666666667, ans=12.0 2023-12-04 13:44:10,721 INFO [train.py:1087] (1/4) Epoch 53, batch 150, loss[loss=0.153, simple_loss=0.2453, pruned_loss=0.03035, over 24579.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2472, pruned_loss=0.03071, over 2557300.52 frames. ], batch size: 64, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:44:12,657 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=311266.6666666667, ans=0.2 2023-12-04 13:44:18,292 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=311266.6666666667, ans=0.125 2023-12-04 13:44:27,446 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=22.5 2023-12-04 13:44:45,010 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.23 vs. limit=15.0 2023-12-04 13:44:53,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=311400.0, ans=0.2 2023-12-04 13:44:56,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=311466.6666666667, ans=0.0 2023-12-04 13:45:04,510 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311466.6666666667, ans=0.125 2023-12-04 13:45:08,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=311466.6666666667, ans=0.09899494936611666 2023-12-04 13:45:25,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=311533.3333333333, ans=0.0 2023-12-04 13:45:27,407 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.18 vs. limit=15.0 2023-12-04 13:45:29,591 INFO [train.py:1087] (1/4) Epoch 53, batch 200, loss[loss=0.1546, simple_loss=0.2487, pruned_loss=0.03028, over 21503.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2476, pruned_loss=0.03076, over 3056009.58 frames. ], batch size: 128, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:46:02,453 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=311733.3333333333, ans=0.0 2023-12-04 13:46:15,919 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-12-04 13:46:33,518 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.293e+02 1.379e+02 1.475e+02 1.829e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-04 13:46:46,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=311866.6666666667, ans=0.125 2023-12-04 13:46:50,941 INFO [train.py:1087] (1/4) Epoch 53, batch 250, loss[loss=0.1452, simple_loss=0.2389, pruned_loss=0.02578, over 24760.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2469, pruned_loss=0.03079, over 3457502.64 frames. ], batch size: 70, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:48:13,331 INFO [train.py:1087] (1/4) Epoch 53, batch 300, loss[loss=0.1543, simple_loss=0.249, pruned_loss=0.02975, over 24717.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2469, pruned_loss=0.03079, over 3759975.38 frames. ], batch size: 67, lr: 4.64e-03, grad_scale: 8.0 2023-12-04 13:48:16,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=312266.6666666667, ans=0.0 2023-12-04 13:48:36,666 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=312333.3333333333, ans=0.0 2023-12-04 13:48:50,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=312400.0, ans=0.125 2023-12-04 13:48:53,751 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-12-04 13:49:02,693 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.24 vs. limit=15.0 2023-12-04 13:49:16,786 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=312466.6666666667, ans=0.07 2023-12-04 13:49:17,870 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.305e+02 1.404e+02 1.545e+02 2.475e+02, threshold=2.807e+02, percent-clipped=0.0 2023-12-04 13:49:29,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=312533.3333333333, ans=0.125 2023-12-04 13:49:35,567 INFO [train.py:1087] (1/4) Epoch 53, batch 350, loss[loss=0.1571, simple_loss=0.2508, pruned_loss=0.03169, over 24565.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2467, pruned_loss=0.03073, over 4004173.81 frames. ], batch size: 65, lr: 4.63e-03, grad_scale: 8.0 2023-12-04 13:49:44,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=312600.0, ans=0.125 2023-12-04 13:49:52,957 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=312666.6666666667, ans=0.125 2023-12-04 13:50:20,613 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:50:20,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=312733.3333333333, ans=0.0 2023-12-04 13:50:31,714 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=312800.0, ans=0.125 2023-12-04 13:50:42,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=312866.6666666667, ans=0.0 2023-12-04 13:50:58,411 INFO [train.py:1087] (1/4) Epoch 53, batch 400, loss[loss=0.1481, simple_loss=0.2416, pruned_loss=0.02728, over 24583.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2467, pruned_loss=0.03054, over 4188555.37 frames. ], batch size: 65, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:51:26,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=313000.0, ans=0.05 2023-12-04 13:51:33,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313066.6666666667, ans=0.1 2023-12-04 13:52:00,438 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.298e+02 1.363e+02 1.477e+02 1.902e+02, threshold=2.727e+02, percent-clipped=0.0 2023-12-04 13:52:19,374 INFO [train.py:1087] (1/4) Epoch 53, batch 450, loss[loss=0.1415, simple_loss=0.2366, pruned_loss=0.02319, over 24755.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2468, pruned_loss=0.03077, over 4329253.60 frames. ], batch size: 70, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:53:03,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=313400.0, ans=0.1 2023-12-04 13:53:40,449 INFO [train.py:1087] (1/4) Epoch 53, batch 500, loss[loss=0.1528, simple_loss=0.2455, pruned_loss=0.03008, over 24753.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2472, pruned_loss=0.03085, over 4427808.72 frames. ], batch size: 70, lr: 4.63e-03, grad_scale: 16.0 2023-12-04 13:53:42,262 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313600.0, ans=0.1 2023-12-04 13:53:47,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=313600.0, ans=0.09899494936611666 2023-12-04 13:54:09,069 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:54:09,623 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=12.0 2023-12-04 13:54:14,284 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=313733.3333333333, ans=0.125 2023-12-04 13:54:43,779 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.289e+02 1.379e+02 1.489e+02 2.033e+02, threshold=2.759e+02, percent-clipped=0.0 2023-12-04 13:54:48,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=313866.6666666667, ans=0.125 2023-12-04 13:55:00,574 INFO [train.py:1087] (1/4) Epoch 53, batch 550, loss[loss=0.1449, simple_loss=0.2393, pruned_loss=0.02527, over 24770.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2476, pruned_loss=0.03098, over 4512445.58 frames. ], batch size: 65, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:55:19,912 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2023-12-04 13:55:23,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314000.0, ans=0.1 2023-12-04 13:55:31,368 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=314066.6666666667, ans=0.125 2023-12-04 13:55:34,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=314066.6666666667, ans=0.125 2023-12-04 13:55:48,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=314133.3333333333, ans=0.0 2023-12-04 13:56:04,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=314200.0, ans=0.125 2023-12-04 13:56:09,344 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 13:56:10,901 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=314200.0, ans=0.125 2023-12-04 13:56:20,867 INFO [train.py:1087] (1/4) Epoch 53, batch 600, loss[loss=0.1536, simple_loss=0.2445, pruned_loss=0.03135, over 24722.00 frames. ], tot_loss[loss=0.1548, simple_loss=0.2474, pruned_loss=0.03109, over 4579121.96 frames. ], batch size: 67, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:56:27,379 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=314266.6666666667, ans=0.2 2023-12-04 13:56:38,765 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-12-04 13:56:49,287 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314333.3333333333, ans=0.1 2023-12-04 13:56:56,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=314400.0, ans=0.0 2023-12-04 13:57:21,010 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.322e+02 1.398e+02 1.473e+02 2.127e+02, threshold=2.796e+02, percent-clipped=0.0 2023-12-04 13:57:33,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=314533.3333333333, ans=0.0 2023-12-04 13:57:38,343 INFO [train.py:1087] (1/4) Epoch 53, batch 650, loss[loss=0.1633, simple_loss=0.2556, pruned_loss=0.03548, over 24480.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2472, pruned_loss=0.03078, over 4649478.70 frames. ], batch size: 75, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:58:39,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=314866.6666666667, ans=0.0 2023-12-04 13:58:39,660 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=314866.6666666667, ans=0.07 2023-12-04 13:58:41,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=314866.6666666667, ans=0.125 2023-12-04 13:58:41,180 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=314866.6666666667, ans=0.1 2023-12-04 13:58:52,223 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-12-04 13:58:54,197 INFO [train.py:1087] (1/4) Epoch 53, batch 700, loss[loss=0.1423, simple_loss=0.2372, pruned_loss=0.02373, over 24870.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2469, pruned_loss=0.03058, over 4687988.36 frames. ], batch size: 68, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 13:59:03,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=314933.3333333333, ans=0.125 2023-12-04 13:59:53,629 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.311e+02 1.391e+02 1.533e+02 2.138e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-04 14:00:01,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315200.0, ans=0.1 2023-12-04 14:00:09,989 INFO [train.py:1087] (1/4) Epoch 53, batch 750, loss[loss=0.1563, simple_loss=0.2483, pruned_loss=0.03211, over 24547.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.247, pruned_loss=0.03086, over 4713084.48 frames. ], batch size: 63, lr: 4.62e-03, grad_scale: 16.0 2023-12-04 14:00:15,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=315266.6666666667, ans=0.125 2023-12-04 14:00:24,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=315333.3333333333, ans=0.0 2023-12-04 14:00:31,184 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-12-04 14:00:31,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=315333.3333333333, ans=0.0 2023-12-04 14:00:36,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=315333.3333333333, ans=0.125 2023-12-04 14:01:12,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=315533.3333333333, ans=0.125 2023-12-04 14:01:13,313 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=315533.3333333333, ans=0.125 2023-12-04 14:01:25,402 INFO [train.py:1087] (1/4) Epoch 53, batch 800, loss[loss=0.1549, simple_loss=0.2481, pruned_loss=0.03087, over 24571.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2473, pruned_loss=0.031, over 4732909.50 frames. ], batch size: 64, lr: 4.61e-03, grad_scale: 32.0 2023-12-04 14:01:31,965 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.52 vs. limit=15.0 2023-12-04 14:01:54,135 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:02:19,652 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.282e+02 1.351e+02 1.470e+02 2.116e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 14:02:21,218 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:02:34,139 INFO [train.py:1087] (1/4) Epoch 53, batch 850, loss[loss=0.1604, simple_loss=0.2553, pruned_loss=0.03273, over 24558.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2472, pruned_loss=0.03101, over 4762394.43 frames. ], batch size: 62, lr: 4.61e-03, grad_scale: 16.0 2023-12-04 14:02:34,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=315933.3333333333, ans=0.0 2023-12-04 14:02:39,863 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.67 vs. limit=15.0 2023-12-04 14:03:50,894 INFO [train.py:1087] (1/4) Epoch 54, batch 0, loss[loss=0.1463, simple_loss=0.2393, pruned_loss=0.02665, over 24779.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2393, pruned_loss=0.02665, over 24779.00 frames. ], batch size: 62, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:03:50,896 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 14:04:07,112 INFO [train.py:1119] (1/4) Epoch 54, validation: loss=0.1516, simple_loss=0.249, pruned_loss=0.02707, over 944034.00 frames. 2023-12-04 14:04:07,114 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 14:04:19,666 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.96 vs. limit=10.0 2023-12-04 14:04:22,672 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-12-04 14:04:32,850 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=316300.0, ans=0.0 2023-12-04 14:04:56,449 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=316433.3333333333, ans=0.0 2023-12-04 14:04:59,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=316433.3333333333, ans=0.125 2023-12-04 14:05:02,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=316433.3333333333, ans=0.0 2023-12-04 14:05:06,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=316500.0, ans=0.0 2023-12-04 14:05:15,780 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.288e+02 1.377e+02 1.512e+02 2.121e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 14:05:23,553 INFO [train.py:1087] (1/4) Epoch 54, batch 50, loss[loss=0.1619, simple_loss=0.2578, pruned_loss=0.03298, over 24773.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2486, pruned_loss=0.03079, over 1089578.22 frames. ], batch size: 64, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:06:01,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=316700.0, ans=0.2 2023-12-04 14:06:20,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=316766.6666666667, ans=0.0 2023-12-04 14:06:26,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=316833.3333333333, ans=0.125 2023-12-04 14:06:32,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=316833.3333333333, ans=0.2 2023-12-04 14:06:40,760 INFO [train.py:1087] (1/4) Epoch 54, batch 100, loss[loss=0.1467, simple_loss=0.2421, pruned_loss=0.02569, over 24758.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2477, pruned_loss=0.03081, over 1917260.59 frames. ], batch size: 70, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:06:50,729 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=316900.0, ans=0.125 2023-12-04 14:07:05,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316966.6666666667, ans=0.1 2023-12-04 14:07:18,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317033.3333333333, ans=0.1 2023-12-04 14:07:41,882 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=317166.6666666667, ans=0.125 2023-12-04 14:07:46,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=317166.6666666667, ans=0.2 2023-12-04 14:07:51,182 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.075e+02 1.240e+02 1.325e+02 1.426e+02 1.923e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-04 14:07:58,956 INFO [train.py:1087] (1/4) Epoch 54, batch 150, loss[loss=0.1558, simple_loss=0.2497, pruned_loss=0.03099, over 24573.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2475, pruned_loss=0.03075, over 2573323.71 frames. ], batch size: 64, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:08:05,025 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-12-04 14:08:51,892 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.54 vs. limit=22.5 2023-12-04 14:08:52,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=317433.3333333333, ans=0.125 2023-12-04 14:08:56,479 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.96 vs. limit=10.0 2023-12-04 14:09:04,152 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=317500.0, ans=0.125 2023-12-04 14:09:17,742 INFO [train.py:1087] (1/4) Epoch 54, batch 200, loss[loss=0.1509, simple_loss=0.2443, pruned_loss=0.02872, over 24743.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2475, pruned_loss=0.03097, over 3073142.44 frames. ], batch size: 63, lr: 4.56e-03, grad_scale: 32.0 2023-12-04 14:09:30,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=317566.6666666667, ans=0.0 2023-12-04 14:09:47,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=317633.3333333333, ans=0.125 2023-12-04 14:09:54,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=317700.0, ans=0.2 2023-12-04 14:09:55,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=317700.0, ans=0.04949747468305833 2023-12-04 14:10:24,581 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.27 vs. limit=15.0 2023-12-04 14:10:28,795 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.325e+02 1.399e+02 1.551e+02 2.197e+02, threshold=2.799e+02, percent-clipped=0.0 2023-12-04 14:10:36,648 INFO [train.py:1087] (1/4) Epoch 54, batch 250, loss[loss=0.1635, simple_loss=0.2518, pruned_loss=0.03762, over 24011.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2477, pruned_loss=0.03104, over 3455260.32 frames. ], batch size: 87, lr: 4.55e-03, grad_scale: 32.0 2023-12-04 14:10:46,817 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-12-04 14:10:52,685 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-12-04 14:10:58,321 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=317966.6666666667, ans=0.95 2023-12-04 14:11:05,921 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-12-04 14:11:15,798 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.44 vs. limit=15.0 2023-12-04 14:11:31,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=318100.0, ans=0.09899494936611666 2023-12-04 14:11:43,162 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-12-04 14:11:54,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=318166.6666666667, ans=0.125 2023-12-04 14:11:54,264 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=318166.6666666667, ans=0.0 2023-12-04 14:11:58,184 INFO [train.py:1087] (1/4) Epoch 54, batch 300, loss[loss=0.1651, simple_loss=0.264, pruned_loss=0.03311, over 22680.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.248, pruned_loss=0.03116, over 3758582.48 frames. ], batch size: 106, lr: 4.55e-03, grad_scale: 16.0 2023-12-04 14:12:09,361 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=15.0 2023-12-04 14:12:34,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=318366.6666666667, ans=0.125 2023-12-04 14:12:51,013 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=318433.3333333333, ans=0.125 2023-12-04 14:12:58,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=318433.3333333333, ans=0.125 2023-12-04 14:13:15,581 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.067e+02 1.258e+02 1.370e+02 1.479e+02 2.899e+02, threshold=2.739e+02, percent-clipped=1.0 2023-12-04 14:13:22,246 INFO [train.py:1087] (1/4) Epoch 54, batch 350, loss[loss=0.1374, simple_loss=0.2365, pruned_loss=0.01918, over 24781.00 frames. ], tot_loss[loss=0.1549, simple_loss=0.2478, pruned_loss=0.031, over 3996780.69 frames. ], batch size: 73, lr: 4.55e-03, grad_scale: 16.0 2023-12-04 14:13:27,729 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.87 vs. limit=15.0 2023-12-04 14:13:30,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=318566.6666666667, ans=0.125 2023-12-04 14:13:34,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=318566.6666666667, ans=0.0 2023-12-04 14:13:49,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=318633.3333333333, ans=0.0 2023-12-04 14:14:11,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=318766.6666666667, ans=0.125 2023-12-04 14:14:16,597 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.43 vs. limit=22.5 2023-12-04 14:14:20,099 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-12-04 14:14:21,238 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=318766.6666666667, ans=0.2 2023-12-04 14:14:43,937 INFO [train.py:1087] (1/4) Epoch 54, batch 400, loss[loss=0.1593, simple_loss=0.2529, pruned_loss=0.03286, over 24095.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2472, pruned_loss=0.03059, over 4187154.75 frames. ], batch size: 58, lr: 4.55e-03, grad_scale: 32.0 2023-12-04 14:14:45,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318900.0, ans=0.125 2023-12-04 14:14:58,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=318966.6666666667, ans=0.125 2023-12-04 14:15:15,274 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.82 vs. limit=15.0 2023-12-04 14:15:27,870 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.74 vs. limit=22.5 2023-12-04 14:15:59,058 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.256e+02 1.328e+02 1.463e+02 1.976e+02, threshold=2.656e+02, percent-clipped=0.0 2023-12-04 14:16:05,806 INFO [train.py:1087] (1/4) Epoch 54, batch 450, loss[loss=0.1429, simple_loss=0.2366, pruned_loss=0.0246, over 24790.00 frames. ], tot_loss[loss=0.1538, simple_loss=0.2468, pruned_loss=0.03043, over 4342897.40 frames. ], batch size: 73, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:16:19,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=319233.3333333333, ans=0.125 2023-12-04 14:16:27,925 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319300.0, ans=0.1 2023-12-04 14:16:33,525 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-12-04 14:17:27,360 INFO [train.py:1087] (1/4) Epoch 54, batch 500, loss[loss=0.1584, simple_loss=0.2499, pruned_loss=0.03351, over 23998.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2465, pruned_loss=0.03033, over 4464323.54 frames. ], batch size: 87, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:17:40,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=319633.3333333333, ans=0.025 2023-12-04 14:17:41,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=319633.3333333333, ans=0.0 2023-12-04 14:18:06,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=319700.0, ans=0.125 2023-12-04 14:18:28,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=319833.3333333333, ans=0.2 2023-12-04 14:18:32,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=319833.3333333333, ans=0.0 2023-12-04 14:18:38,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=319833.3333333333, ans=0.125 2023-12-04 14:18:39,575 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.327e+02 1.410e+02 1.630e+02 2.218e+02, threshold=2.821e+02, percent-clipped=0.0 2023-12-04 14:18:45,424 INFO [train.py:1087] (1/4) Epoch 54, batch 550, loss[loss=0.1479, simple_loss=0.2387, pruned_loss=0.0285, over 24736.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2466, pruned_loss=0.03055, over 4543400.61 frames. ], batch size: 63, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:18:57,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=319900.0, ans=0.125 2023-12-04 14:18:57,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319900.0, ans=0.1 2023-12-04 14:19:04,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=319966.6666666667, ans=0.125 2023-12-04 14:19:13,566 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=319966.6666666667, ans=0.1 2023-12-04 14:19:13,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=319966.6666666667, ans=0.0 2023-12-04 14:19:14,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319966.6666666667, ans=0.1 2023-12-04 14:19:38,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=320100.0, ans=0.125 2023-12-04 14:19:45,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=320100.0, ans=0.2 2023-12-04 14:19:56,574 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=320166.6666666667, ans=0.0 2023-12-04 14:19:58,478 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.87 vs. limit=10.0 2023-12-04 14:20:07,497 INFO [train.py:1087] (1/4) Epoch 54, batch 600, loss[loss=0.1398, simple_loss=0.2344, pruned_loss=0.02263, over 24703.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2467, pruned_loss=0.03058, over 4584188.70 frames. ], batch size: 74, lr: 4.54e-03, grad_scale: 32.0 2023-12-04 14:20:15,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=320233.3333333333, ans=0.05 2023-12-04 14:20:21,248 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320300.0, ans=0.125 2023-12-04 14:20:23,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=320300.0, ans=0.125 2023-12-04 14:20:23,576 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-12-04 14:21:17,910 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.127e+02 1.296e+02 1.398e+02 1.549e+02 2.118e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-04 14:21:21,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=320500.0, ans=0.125 2023-12-04 14:21:23,721 INFO [train.py:1087] (1/4) Epoch 54, batch 650, loss[loss=0.1492, simple_loss=0.2412, pruned_loss=0.02862, over 24570.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2465, pruned_loss=0.03047, over 4643993.33 frames. ], batch size: 65, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:21:49,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320633.3333333333, ans=0.125 2023-12-04 14:21:59,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=320700.0, ans=0.125 2023-12-04 14:22:11,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=320766.6666666667, ans=0.0 2023-12-04 14:22:29,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=320833.3333333333, ans=0.125 2023-12-04 14:22:29,947 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.01 vs. limit=10.0 2023-12-04 14:22:39,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=320900.0, ans=0.2 2023-12-04 14:22:39,832 INFO [train.py:1087] (1/4) Epoch 54, batch 700, loss[loss=0.1576, simple_loss=0.2518, pruned_loss=0.0317, over 24726.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2468, pruned_loss=0.03063, over 4686115.34 frames. ], batch size: 67, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:22:58,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=320966.6666666667, ans=0.035 2023-12-04 14:23:33,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=321100.0, ans=0.125 2023-12-04 14:23:38,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=321166.6666666667, ans=0.2 2023-12-04 14:23:45,682 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-12-04 14:23:49,048 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.277e+02 1.367e+02 1.520e+02 2.046e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 14:23:55,366 INFO [train.py:1087] (1/4) Epoch 54, batch 750, loss[loss=0.1806, simple_loss=0.2651, pruned_loss=0.04806, over 16682.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2466, pruned_loss=0.0306, over 4703044.38 frames. ], batch size: 178, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:24:20,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=321300.0, ans=0.125 2023-12-04 14:25:12,339 INFO [train.py:1087] (1/4) Epoch 54, batch 800, loss[loss=0.1491, simple_loss=0.2412, pruned_loss=0.02852, over 24756.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2466, pruned_loss=0.03062, over 4721419.97 frames. ], batch size: 66, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:25:59,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=321766.6666666667, ans=0.015 2023-12-04 14:26:02,482 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=15.0 2023-12-04 14:26:07,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=321833.3333333333, ans=0.0 2023-12-04 14:26:14,361 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=321833.3333333333, ans=0.2 2023-12-04 14:26:16,613 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.254e+02 1.363e+02 1.450e+02 1.749e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 14:26:21,929 INFO [train.py:1087] (1/4) Epoch 54, batch 850, loss[loss=0.1508, simple_loss=0.2442, pruned_loss=0.02868, over 24607.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2469, pruned_loss=0.03079, over 4729539.42 frames. ], batch size: 68, lr: 4.53e-03, grad_scale: 32.0 2023-12-04 14:26:36,043 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-12-04 14:26:46,136 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=321966.6666666667, ans=0.0 2023-12-04 14:26:56,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=322033.3333333333, ans=0.0 2023-12-04 14:27:17,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322166.6666666667, ans=0.1 2023-12-04 14:27:45,145 INFO [train.py:1087] (1/4) Epoch 55, batch 0, loss[loss=0.1612, simple_loss=0.2523, pruned_loss=0.035, over 23898.00 frames. ], tot_loss[loss=0.1612, simple_loss=0.2523, pruned_loss=0.035, over 23898.00 frames. ], batch size: 87, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:27:45,146 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 14:28:01,828 INFO [train.py:1119] (1/4) Epoch 55, validation: loss=0.1514, simple_loss=0.2492, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 14:28:01,829 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 14:28:09,575 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:28:27,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=322266.6666666667, ans=0.125 2023-12-04 14:28:39,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322333.3333333333, ans=0.1 2023-12-04 14:28:52,778 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.12 vs. limit=15.0 2023-12-04 14:29:02,873 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-12-04 14:29:05,313 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322466.6666666667, ans=0.1 2023-12-04 14:29:05,806 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.92 vs. limit=15.0 2023-12-04 14:29:16,110 INFO [train.py:1087] (1/4) Epoch 55, batch 50, loss[loss=0.1405, simple_loss=0.2351, pruned_loss=0.02295, over 24716.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2473, pruned_loss=0.03154, over 1077166.72 frames. ], batch size: 69, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:29:19,670 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.275e+02 1.387e+02 1.532e+02 2.766e+02, threshold=2.774e+02, percent-clipped=1.0 2023-12-04 14:29:24,446 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=322533.3333333333, ans=0.0 2023-12-04 14:30:05,753 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=22.5 2023-12-04 14:30:11,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=322733.3333333333, ans=15.0 2023-12-04 14:30:15,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=322800.0, ans=0.125 2023-12-04 14:30:27,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=322800.0, ans=0.125 2023-12-04 14:30:31,013 INFO [train.py:1087] (1/4) Epoch 55, batch 100, loss[loss=0.1651, simple_loss=0.2564, pruned_loss=0.03688, over 23979.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2465, pruned_loss=0.0307, over 1910812.99 frames. ], batch size: 87, lr: 4.48e-03, grad_scale: 32.0 2023-12-04 14:30:44,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=322866.6666666667, ans=0.02 2023-12-04 14:30:56,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=322933.3333333333, ans=0.0 2023-12-04 14:30:56,465 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-12-04 14:30:57,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=322933.3333333333, ans=0.125 2023-12-04 14:31:42,277 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=323133.3333333333, ans=0.0 2023-12-04 14:31:46,350 INFO [train.py:1087] (1/4) Epoch 55, batch 150, loss[loss=0.164, simple_loss=0.2553, pruned_loss=0.03635, over 24753.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2475, pruned_loss=0.0314, over 2542046.04 frames. ], batch size: 64, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:31:50,754 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.281e+02 1.343e+02 1.455e+02 3.000e+02, threshold=2.685e+02, percent-clipped=1.0 2023-12-04 14:32:08,281 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=323266.6666666667, ans=0.125 2023-12-04 14:32:22,653 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.15 vs. limit=15.0 2023-12-04 14:32:24,104 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.75 vs. limit=22.5 2023-12-04 14:32:32,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=323400.0, ans=0.125 2023-12-04 14:32:34,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=323400.0, ans=0.0 2023-12-04 14:33:02,218 INFO [train.py:1087] (1/4) Epoch 55, batch 200, loss[loss=0.1534, simple_loss=0.2424, pruned_loss=0.03223, over 24545.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2467, pruned_loss=0.03097, over 3047298.69 frames. ], batch size: 63, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:33:02,701 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=323533.3333333333, ans=0.0 2023-12-04 14:33:11,943 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=323533.3333333333, ans=0.125 2023-12-04 14:33:45,540 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.34 vs. limit=10.0 2023-12-04 14:33:57,350 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=323733.3333333333, ans=0.2 2023-12-04 14:34:18,211 INFO [train.py:1087] (1/4) Epoch 55, batch 250, loss[loss=0.1597, simple_loss=0.2485, pruned_loss=0.03542, over 24567.00 frames. ], tot_loss[loss=0.1547, simple_loss=0.2468, pruned_loss=0.03127, over 3425080.12 frames. ], batch size: 64, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:34:22,540 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.264e+02 1.356e+02 1.493e+02 1.911e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 14:34:25,949 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:34:28,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=323866.6666666667, ans=0.0 2023-12-04 14:35:35,504 INFO [train.py:1087] (1/4) Epoch 55, batch 300, loss[loss=0.1603, simple_loss=0.2464, pruned_loss=0.03708, over 24175.00 frames. ], tot_loss[loss=0.1552, simple_loss=0.2471, pruned_loss=0.03165, over 3719646.59 frames. ], batch size: 58, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:35:42,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=324200.0, ans=0.125 2023-12-04 14:35:43,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324200.0, ans=0.1 2023-12-04 14:36:22,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=324400.0, ans=0.2 2023-12-04 14:36:25,505 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=324400.0, ans=0.125 2023-12-04 14:36:51,494 INFO [train.py:1087] (1/4) Epoch 55, batch 350, loss[loss=0.1511, simple_loss=0.2419, pruned_loss=0.03017, over 24768.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2469, pruned_loss=0.03113, over 3965423.15 frames. ], batch size: 70, lr: 4.47e-03, grad_scale: 16.0 2023-12-04 14:36:55,914 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.238e+02 1.351e+02 1.486e+02 2.118e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 14:36:56,963 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=324533.3333333333, ans=0.1 2023-12-04 14:37:02,462 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.01 vs. limit=15.0 2023-12-04 14:37:19,658 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=324600.0, ans=0.0 2023-12-04 14:38:09,167 INFO [train.py:1087] (1/4) Epoch 55, batch 400, loss[loss=0.1503, simple_loss=0.242, pruned_loss=0.02932, over 24756.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.2467, pruned_loss=0.03106, over 4137633.97 frames. ], batch size: 65, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:38:10,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=324866.6666666667, ans=0.0 2023-12-04 14:38:27,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=324933.3333333333, ans=0.0 2023-12-04 14:38:54,778 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=325066.6666666667, ans=0.125 2023-12-04 14:38:58,419 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.34 vs. limit=15.0 2023-12-04 14:39:02,323 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=325066.6666666667, ans=0.0 2023-12-04 14:39:14,407 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=325133.3333333333, ans=0.2 2023-12-04 14:39:25,766 INFO [train.py:1087] (1/4) Epoch 55, batch 450, loss[loss=0.1742, simple_loss=0.2669, pruned_loss=0.04073, over 24192.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2464, pruned_loss=0.03078, over 4288876.79 frames. ], batch size: 82, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:39:30,012 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.294e+02 1.407e+02 1.497e+02 1.971e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 14:39:30,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=325200.0, ans=0.125 2023-12-04 14:39:30,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=325200.0, ans=0.0 2023-12-04 14:39:52,357 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=325266.6666666667, ans=0.125 2023-12-04 14:39:56,335 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=325333.3333333333, ans=0.125 2023-12-04 14:39:58,281 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.96 vs. limit=15.0 2023-12-04 14:40:04,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325333.3333333333, ans=0.1 2023-12-04 14:40:17,078 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=325400.0, ans=15.0 2023-12-04 14:40:26,541 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=325466.6666666667, ans=0.0 2023-12-04 14:40:26,562 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:40:26,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325466.6666666667, ans=0.1 2023-12-04 14:40:36,629 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:40:41,103 INFO [train.py:1087] (1/4) Epoch 55, batch 500, loss[loss=0.1582, simple_loss=0.2529, pruned_loss=0.03179, over 24204.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2469, pruned_loss=0.03099, over 4400871.11 frames. ], batch size: 82, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:40:41,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=325533.3333333333, ans=0.125 2023-12-04 14:40:47,495 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=325533.3333333333, ans=0.125 2023-12-04 14:41:17,337 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:41:41,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325800.0, ans=0.1 2023-12-04 14:41:49,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325800.0, ans=0.1 2023-12-04 14:41:55,200 INFO [train.py:1087] (1/4) Epoch 55, batch 550, loss[loss=0.1454, simple_loss=0.2386, pruned_loss=0.0261, over 24866.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2467, pruned_loss=0.03088, over 4508930.55 frames. ], batch size: 68, lr: 4.46e-03, grad_scale: 32.0 2023-12-04 14:41:55,742 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:42:00,033 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.269e+02 1.374e+02 1.501e+02 2.077e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 14:42:14,133 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325933.3333333333, ans=0.1 2023-12-04 14:42:20,438 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=325933.3333333333, ans=0.125 2023-12-04 14:42:33,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=326000.0, ans=0.125 2023-12-04 14:42:46,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=326066.6666666667, ans=0.125 2023-12-04 14:42:49,737 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.45 vs. limit=12.0 2023-12-04 14:43:12,541 INFO [train.py:1087] (1/4) Epoch 55, batch 600, loss[loss=0.1517, simple_loss=0.2424, pruned_loss=0.0305, over 24729.00 frames. ], tot_loss[loss=0.1544, simple_loss=0.247, pruned_loss=0.03088, over 4581813.07 frames. ], batch size: 61, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:43:22,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=326200.0, ans=0.125 2023-12-04 14:43:28,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=326266.6666666667, ans=0.125 2023-12-04 14:43:51,567 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=326333.3333333333, ans=0.125 2023-12-04 14:44:19,389 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.32 vs. limit=6.0 2023-12-04 14:44:29,702 INFO [train.py:1087] (1/4) Epoch 55, batch 650, loss[loss=0.1601, simple_loss=0.2541, pruned_loss=0.03301, over 24801.00 frames. ], tot_loss[loss=0.1543, simple_loss=0.2469, pruned_loss=0.03087, over 4626100.41 frames. ], batch size: 71, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:44:34,254 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.282e+02 1.381e+02 1.479e+02 1.836e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-04 14:44:45,011 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.97 vs. limit=15.0 2023-12-04 14:44:58,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=326666.6666666667, ans=0.125 2023-12-04 14:45:00,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=326666.6666666667, ans=0.1 2023-12-04 14:45:17,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=326733.3333333333, ans=0.125 2023-12-04 14:45:25,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=326733.3333333333, ans=0.125 2023-12-04 14:45:30,540 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.38 vs. limit=6.0 2023-12-04 14:45:46,461 INFO [train.py:1087] (1/4) Epoch 55, batch 700, loss[loss=0.1586, simple_loss=0.2531, pruned_loss=0.032, over 24863.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2468, pruned_loss=0.03071, over 4679103.38 frames. ], batch size: 68, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:46:33,753 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-12-04 14:46:37,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327066.6666666667, ans=0.1 2023-12-04 14:46:40,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=327066.6666666667, ans=0.125 2023-12-04 14:47:01,625 INFO [train.py:1087] (1/4) Epoch 55, batch 750, loss[loss=0.1555, simple_loss=0.2452, pruned_loss=0.03289, over 23718.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.03051, over 4722810.78 frames. ], batch size: 95, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:47:06,276 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.251e+02 1.375e+02 1.484e+02 2.107e+02, threshold=2.750e+02, percent-clipped=0.0 2023-12-04 14:47:08,983 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.67 vs. limit=22.5 2023-12-04 14:47:17,583 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327266.6666666667, ans=0.1 2023-12-04 14:48:09,231 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 14:48:17,088 INFO [train.py:1087] (1/4) Epoch 55, batch 800, loss[loss=0.1453, simple_loss=0.2367, pruned_loss=0.02689, over 24676.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2463, pruned_loss=0.03051, over 4723182.30 frames. ], batch size: 74, lr: 4.45e-03, grad_scale: 32.0 2023-12-04 14:48:27,195 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327533.3333333333, ans=0.1 2023-12-04 14:48:41,859 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=327600.0, ans=0.07 2023-12-04 14:49:04,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=327733.3333333333, ans=0.2 2023-12-04 14:49:08,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=327733.3333333333, ans=0.125 2023-12-04 14:49:11,279 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=327800.0, ans=0.125 2023-12-04 14:49:25,761 INFO [train.py:1087] (1/4) Epoch 55, batch 850, loss[loss=0.1519, simple_loss=0.2452, pruned_loss=0.02932, over 24730.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2463, pruned_loss=0.03052, over 4748518.77 frames. ], batch size: 67, lr: 4.44e-03, grad_scale: 32.0 2023-12-04 14:49:30,093 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.287e+02 1.437e+02 1.562e+02 2.084e+02, threshold=2.874e+02, percent-clipped=0.0 2023-12-04 14:49:34,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=327866.6666666667, ans=0.04949747468305833 2023-12-04 14:49:35,813 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.41 vs. limit=10.0 2023-12-04 14:49:46,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=327933.3333333333, ans=0.0 2023-12-04 14:50:05,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=328066.6666666667, ans=0.1 2023-12-04 14:50:14,691 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.47 vs. limit=15.0 2023-12-04 14:50:47,363 INFO [train.py:1087] (1/4) Epoch 56, batch 0, loss[loss=0.1481, simple_loss=0.241, pruned_loss=0.02758, over 24320.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.241, pruned_loss=0.02758, over 24320.00 frames. ], batch size: 79, lr: 4.40e-03, grad_scale: 32.0 2023-12-04 14:50:47,365 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 14:51:03,637 INFO [train.py:1119] (1/4) Epoch 56, validation: loss=0.1512, simple_loss=0.2487, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 14:51:03,638 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 14:51:51,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=328366.6666666667, ans=0.125 2023-12-04 14:51:56,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=328366.6666666667, ans=0.2 2023-12-04 14:52:20,363 INFO [train.py:1087] (1/4) Epoch 56, batch 50, loss[loss=0.15, simple_loss=0.2401, pruned_loss=0.03, over 24574.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2456, pruned_loss=0.02919, over 1099824.07 frames. ], batch size: 64, lr: 4.40e-03, grad_scale: 16.0 2023-12-04 14:52:29,375 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=328500.0, ans=0.125 2023-12-04 14:52:35,604 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.319e+02 1.474e+02 1.645e+02 2.645e+02, threshold=2.949e+02, percent-clipped=0.0 2023-12-04 14:52:46,231 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=328566.6666666667, ans=0.0 2023-12-04 14:53:17,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=328700.0, ans=0.2 2023-12-04 14:53:37,575 INFO [train.py:1087] (1/4) Epoch 56, batch 100, loss[loss=0.1485, simple_loss=0.2449, pruned_loss=0.0261, over 24768.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2453, pruned_loss=0.02912, over 1923447.42 frames. ], batch size: 70, lr: 4.40e-03, grad_scale: 16.0 2023-12-04 14:53:42,978 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=328833.3333333333, ans=0.0 2023-12-04 14:53:45,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=328833.3333333333, ans=0.125 2023-12-04 14:54:22,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=329033.3333333333, ans=0.125 2023-12-04 14:54:25,391 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-12-04 14:54:52,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329166.6666666667, ans=0.1 2023-12-04 14:54:53,395 INFO [train.py:1087] (1/4) Epoch 56, batch 150, loss[loss=0.149, simple_loss=0.2454, pruned_loss=0.02633, over 24747.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2457, pruned_loss=0.02941, over 2575532.18 frames. ], batch size: 66, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:55:02,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=329166.6666666667, ans=0.125 2023-12-04 14:55:09,763 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.250e+02 1.328e+02 1.469e+02 1.847e+02, threshold=2.656e+02, percent-clipped=0.0 2023-12-04 14:55:11,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=329233.3333333333, ans=0.0 2023-12-04 14:55:43,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=329366.6666666667, ans=10.0 2023-12-04 14:55:43,880 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=15.0 2023-12-04 14:56:02,004 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2023-12-04 14:56:09,821 INFO [train.py:1087] (1/4) Epoch 56, batch 200, loss[loss=0.1548, simple_loss=0.246, pruned_loss=0.03179, over 23722.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2458, pruned_loss=0.02976, over 3069079.33 frames. ], batch size: 57, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:56:21,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=329500.0, ans=0.2 2023-12-04 14:57:17,571 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=22.5 2023-12-04 14:57:23,036 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-12-04 14:57:25,427 INFO [train.py:1087] (1/4) Epoch 56, batch 250, loss[loss=0.1592, simple_loss=0.2465, pruned_loss=0.03597, over 24060.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2458, pruned_loss=0.02975, over 3464894.98 frames. ], batch size: 87, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:57:25,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329833.3333333333, ans=0.125 2023-12-04 14:57:35,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=329833.3333333333, ans=0.125 2023-12-04 14:57:39,920 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.149e+02 1.298e+02 1.398e+02 1.522e+02 2.172e+02, threshold=2.796e+02, percent-clipped=0.0 2023-12-04 14:58:02,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=329966.6666666667, ans=0.0 2023-12-04 14:58:27,601 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=330100.0, ans=0.0 2023-12-04 14:58:41,388 INFO [train.py:1087] (1/4) Epoch 56, batch 300, loss[loss=0.1628, simple_loss=0.2477, pruned_loss=0.03893, over 24483.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2457, pruned_loss=0.02998, over 3761961.05 frames. ], batch size: 75, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 14:58:48,028 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=330166.6666666667, ans=0.0 2023-12-04 14:59:03,837 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=330233.3333333333, ans=0.125 2023-12-04 14:59:08,277 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.08 vs. limit=15.0 2023-12-04 14:59:13,849 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330300.0, ans=0.125 2023-12-04 14:59:19,682 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=330300.0, ans=0.0 2023-12-04 14:59:41,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=330433.3333333333, ans=0.0 2023-12-04 14:59:55,184 INFO [train.py:1087] (1/4) Epoch 56, batch 350, loss[loss=0.1608, simple_loss=0.252, pruned_loss=0.03482, over 20912.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2459, pruned_loss=0.03008, over 4003988.54 frames. ], batch size: 50, lr: 4.39e-03, grad_scale: 16.0 2023-12-04 15:00:11,333 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.168e+02 1.289e+02 1.367e+02 1.457e+02 1.920e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 15:00:15,238 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-12-04 15:00:46,688 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:00:53,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=330700.0, ans=0.125 2023-12-04 15:01:11,849 INFO [train.py:1087] (1/4) Epoch 56, batch 400, loss[loss=0.154, simple_loss=0.2489, pruned_loss=0.02958, over 24692.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2461, pruned_loss=0.03015, over 4172582.24 frames. ], batch size: 69, lr: 4.38e-03, grad_scale: 32.0 2023-12-04 15:01:22,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=330833.3333333333, ans=0.0 2023-12-04 15:02:28,084 INFO [train.py:1087] (1/4) Epoch 56, batch 450, loss[loss=0.1569, simple_loss=0.2528, pruned_loss=0.03045, over 24802.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2458, pruned_loss=0.03004, over 4305860.64 frames. ], batch size: 73, lr: 4.38e-03, grad_scale: 32.0 2023-12-04 15:02:32,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=331166.6666666667, ans=0.125 2023-12-04 15:02:42,985 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.275e+02 1.353e+02 1.494e+02 1.982e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 15:02:43,859 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.20 vs. limit=15.0 2023-12-04 15:03:02,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=331300.0, ans=0.125 2023-12-04 15:03:07,460 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=331300.0, ans=0.0 2023-12-04 15:03:32,334 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-12-04 15:03:32,508 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=15.0 2023-12-04 15:03:33,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331433.3333333333, ans=0.1 2023-12-04 15:03:43,293 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=331433.3333333333, ans=0.0 2023-12-04 15:03:46,048 INFO [train.py:1087] (1/4) Epoch 56, batch 500, loss[loss=0.1589, simple_loss=0.2534, pruned_loss=0.0322, over 22809.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.246, pruned_loss=0.03018, over 4400646.79 frames. ], batch size: 106, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:03:55,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=331500.0, ans=0.2 2023-12-04 15:04:03,826 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=331566.6666666667, ans=0.125 2023-12-04 15:04:05,168 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=331566.6666666667, ans=0.125 2023-12-04 15:04:16,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331633.3333333333, ans=0.1 2023-12-04 15:04:38,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=331700.0, ans=10.0 2023-12-04 15:04:43,801 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=331700.0, ans=0.2 2023-12-04 15:04:52,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=331766.6666666667, ans=0.0 2023-12-04 15:05:01,803 INFO [train.py:1087] (1/4) Epoch 56, batch 550, loss[loss=0.151, simple_loss=0.2368, pruned_loss=0.03257, over 24803.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2457, pruned_loss=0.03019, over 4503526.20 frames. ], batch size: 71, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:05:18,655 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.304e+02 1.422e+02 1.549e+02 2.533e+02, threshold=2.844e+02, percent-clipped=0.0 2023-12-04 15:05:22,315 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=331900.0, ans=0.0 2023-12-04 15:05:50,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=332033.3333333333, ans=0.0 2023-12-04 15:06:02,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=332033.3333333333, ans=0.2 2023-12-04 15:06:12,032 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:06:12,390 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.79 vs. limit=15.0 2023-12-04 15:06:14,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=332100.0, ans=0.0 2023-12-04 15:06:19,667 INFO [train.py:1087] (1/4) Epoch 56, batch 600, loss[loss=0.1439, simple_loss=0.2367, pruned_loss=0.02559, over 24759.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2457, pruned_loss=0.03016, over 4573344.88 frames. ], batch size: 64, lr: 4.38e-03, grad_scale: 16.0 2023-12-04 15:06:57,806 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=332300.0, ans=0.125 2023-12-04 15:07:00,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=332300.0, ans=0.2 2023-12-04 15:07:04,424 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=332366.6666666667, ans=0.2 2023-12-04 15:07:36,858 INFO [train.py:1087] (1/4) Epoch 56, batch 650, loss[loss=0.155, simple_loss=0.2445, pruned_loss=0.0327, over 24754.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03012, over 4631463.01 frames. ], batch size: 64, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:07:45,338 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=332500.0, ans=0.125 2023-12-04 15:07:45,489 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-12-04 15:07:46,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=332500.0, ans=0.0 2023-12-04 15:07:48,684 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-12-04 15:07:53,880 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.153e+02 1.324e+02 1.446e+02 1.648e+02 2.050e+02, threshold=2.891e+02, percent-clipped=0.0 2023-12-04 15:07:56,373 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=15.0 2023-12-04 15:08:02,036 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332566.6666666667, ans=0.125 2023-12-04 15:08:39,793 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=15.0 2023-12-04 15:08:47,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=332766.6666666667, ans=0.05 2023-12-04 15:08:53,221 INFO [train.py:1087] (1/4) Epoch 56, batch 700, loss[loss=0.1594, simple_loss=0.2571, pruned_loss=0.03085, over 24793.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2461, pruned_loss=0.03042, over 4674865.47 frames. ], batch size: 73, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:08:58,435 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=332833.3333333333, ans=0.0 2023-12-04 15:09:19,599 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=332900.0, ans=0.125 2023-12-04 15:09:19,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=332900.0, ans=0.5 2023-12-04 15:09:21,164 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=332900.0, ans=0.125 2023-12-04 15:09:27,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332966.6666666667, ans=0.125 2023-12-04 15:10:10,754 INFO [train.py:1087] (1/4) Epoch 56, batch 750, loss[loss=0.151, simple_loss=0.2456, pruned_loss=0.02824, over 24768.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2462, pruned_loss=0.0303, over 4708840.69 frames. ], batch size: 70, lr: 4.37e-03, grad_scale: 16.0 2023-12-04 15:10:28,022 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.050e+02 1.279e+02 1.376e+02 1.521e+02 2.298e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 15:10:34,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=333233.3333333333, ans=0.015 2023-12-04 15:10:43,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=333300.0, ans=0.0 2023-12-04 15:10:48,386 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.73 vs. limit=10.0 2023-12-04 15:11:18,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=333433.3333333333, ans=0.125 2023-12-04 15:11:26,384 INFO [train.py:1087] (1/4) Epoch 56, batch 800, loss[loss=0.1437, simple_loss=0.2374, pruned_loss=0.02502, over 24722.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.246, pruned_loss=0.03024, over 4718213.38 frames. ], batch size: 69, lr: 4.37e-03, grad_scale: 32.0 2023-12-04 15:11:46,744 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=333566.6666666667, ans=10.0 2023-12-04 15:12:05,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=333633.3333333333, ans=0.125 2023-12-04 15:12:24,642 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=333766.6666666667, ans=0.0 2023-12-04 15:12:36,225 INFO [train.py:1087] (1/4) Epoch 56, batch 850, loss[loss=0.1469, simple_loss=0.2407, pruned_loss=0.02659, over 24765.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2463, pruned_loss=0.0304, over 4736429.55 frames. ], batch size: 64, lr: 4.36e-03, grad_scale: 32.0 2023-12-04 15:12:50,926 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.290e+02 1.407e+02 1.531e+02 2.076e+02, threshold=2.814e+02, percent-clipped=0.0 2023-12-04 15:13:01,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=333966.6666666667, ans=0.125 2023-12-04 15:13:08,739 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=333966.6666666667, ans=0.0 2023-12-04 15:13:12,746 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=333966.6666666667, ans=0.125 2023-12-04 15:13:19,806 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-12-04 15:13:26,467 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.17 vs. limit=15.0 2023-12-04 15:13:59,689 INFO [train.py:1087] (1/4) Epoch 57, batch 0, loss[loss=0.1485, simple_loss=0.2446, pruned_loss=0.02624, over 24727.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2446, pruned_loss=0.02624, over 24727.00 frames. ], batch size: 67, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:13:59,690 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 15:14:16,496 INFO [train.py:1119] (1/4) Epoch 57, validation: loss=0.1509, simple_loss=0.2484, pruned_loss=0.02671, over 944034.00 frames. 2023-12-04 15:14:16,497 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 15:15:05,945 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=334333.3333333333, ans=0.125 2023-12-04 15:15:06,363 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.48 vs. limit=22.5 2023-12-04 15:15:14,551 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=334333.3333333333, ans=0.0 2023-12-04 15:15:28,261 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2023-12-04 15:15:29,572 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=334400.0, ans=0.04949747468305833 2023-12-04 15:15:34,482 INFO [train.py:1087] (1/4) Epoch 57, batch 50, loss[loss=0.1661, simple_loss=0.2573, pruned_loss=0.03751, over 24329.00 frames. ], tot_loss[loss=0.1531, simple_loss=0.2458, pruned_loss=0.03016, over 1090787.18 frames. ], batch size: 79, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:15:57,894 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.283e+02 1.393e+02 1.536e+02 2.601e+02, threshold=2.787e+02, percent-clipped=0.0 2023-12-04 15:15:58,374 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=334533.3333333333, ans=0.125 2023-12-04 15:16:02,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=334600.0, ans=0.2 2023-12-04 15:16:25,421 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=334666.6666666667, ans=0.125 2023-12-04 15:16:30,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=334666.6666666667, ans=0.0 2023-12-04 15:16:35,879 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-12-04 15:16:50,068 INFO [train.py:1087] (1/4) Epoch 57, batch 100, loss[loss=0.1503, simple_loss=0.2462, pruned_loss=0.02717, over 24687.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2454, pruned_loss=0.02997, over 1923948.25 frames. ], batch size: 74, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:17:27,607 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=334933.3333333333, ans=0.125 2023-12-04 15:18:08,214 INFO [train.py:1087] (1/4) Epoch 57, batch 150, loss[loss=0.1474, simple_loss=0.2431, pruned_loss=0.02586, over 24731.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2461, pruned_loss=0.03047, over 2544180.09 frames. ], batch size: 67, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:18:08,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=335133.3333333333, ans=0.125 2023-12-04 15:18:18,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=335133.3333333333, ans=0.125 2023-12-04 15:18:29,264 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:18:33,150 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.254e+02 1.319e+02 1.488e+02 2.110e+02, threshold=2.637e+02, percent-clipped=0.0 2023-12-04 15:18:44,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=335266.6666666667, ans=0.0 2023-12-04 15:18:47,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=335266.6666666667, ans=0.125 2023-12-04 15:18:53,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=335333.3333333333, ans=0.0 2023-12-04 15:18:57,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=335333.3333333333, ans=0.125 2023-12-04 15:19:01,174 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335333.3333333333, ans=0.1 2023-12-04 15:19:22,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.75 vs. limit=15.0 2023-12-04 15:19:25,683 INFO [train.py:1087] (1/4) Epoch 57, batch 200, loss[loss=0.1454, simple_loss=0.2407, pruned_loss=0.02504, over 24737.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2454, pruned_loss=0.03007, over 3054938.47 frames. ], batch size: 63, lr: 4.32e-03, grad_scale: 32.0 2023-12-04 15:20:01,260 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.13 vs. limit=15.0 2023-12-04 15:20:10,512 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.18 vs. limit=10.0 2023-12-04 15:20:32,454 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=22.5 2023-12-04 15:20:42,500 INFO [train.py:1087] (1/4) Epoch 57, batch 250, loss[loss=0.164, simple_loss=0.2555, pruned_loss=0.03622, over 23477.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2458, pruned_loss=0.03028, over 3442331.57 frames. ], batch size: 94, lr: 4.31e-03, grad_scale: 32.0 2023-12-04 15:21:08,084 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.273e+02 1.365e+02 1.487e+02 1.960e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 15:21:31,024 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=336000.0, ans=0.125 2023-12-04 15:21:53,643 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=15.0 2023-12-04 15:22:01,362 INFO [train.py:1087] (1/4) Epoch 57, batch 300, loss[loss=0.1684, simple_loss=0.258, pruned_loss=0.03945, over 24178.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.246, pruned_loss=0.03036, over 3729716.86 frames. ], batch size: 82, lr: 4.31e-03, grad_scale: 16.0 2023-12-04 15:22:12,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=336133.3333333333, ans=0.2 2023-12-04 15:22:15,782 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.80 vs. limit=15.0 2023-12-04 15:22:18,312 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=336200.0, ans=0.2 2023-12-04 15:22:21,791 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.63 vs. limit=22.5 2023-12-04 15:22:36,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=336266.6666666667, ans=0.0 2023-12-04 15:22:41,778 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=336266.6666666667, ans=0.125 2023-12-04 15:22:55,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=336333.3333333333, ans=0.125 2023-12-04 15:23:03,643 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.18 vs. limit=12.0 2023-12-04 15:23:12,476 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:23:19,052 INFO [train.py:1087] (1/4) Epoch 57, batch 350, loss[loss=0.1503, simple_loss=0.2418, pruned_loss=0.02939, over 24580.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.03056, over 3979701.51 frames. ], batch size: 65, lr: 4.31e-03, grad_scale: 16.0 2023-12-04 15:23:38,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=336533.3333333333, ans=0.0 2023-12-04 15:23:45,591 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.295e+02 1.376e+02 1.514e+02 1.932e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 15:24:15,800 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-12-04 15:24:35,373 INFO [train.py:1087] (1/4) Epoch 57, batch 400, loss[loss=0.1532, simple_loss=0.2486, pruned_loss=0.02887, over 24809.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2465, pruned_loss=0.03069, over 4159557.94 frames. ], batch size: 72, lr: 4.31e-03, grad_scale: 32.0 2023-12-04 15:24:38,753 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=336800.0, ans=0.0 2023-12-04 15:24:45,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=336800.0, ans=0.125 2023-12-04 15:25:04,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=336866.6666666667, ans=0.0 2023-12-04 15:25:10,735 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=336933.3333333333, ans=0.0 2023-12-04 15:25:18,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=336933.3333333333, ans=0.125 2023-12-04 15:25:29,019 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-12-04 15:25:38,182 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.26 vs. limit=12.0 2023-12-04 15:25:42,746 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.84 vs. limit=22.5 2023-12-04 15:25:47,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=337066.6666666667, ans=0.125 2023-12-04 15:25:53,934 INFO [train.py:1087] (1/4) Epoch 57, batch 450, loss[loss=0.1481, simple_loss=0.2413, pruned_loss=0.02744, over 24607.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2467, pruned_loss=0.03072, over 4286078.13 frames. ], batch size: 68, lr: 4.30e-03, grad_scale: 32.0 2023-12-04 15:25:54,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=337133.3333333333, ans=0.125 2023-12-04 15:26:03,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=337133.3333333333, ans=0.125 2023-12-04 15:26:09,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=337200.0, ans=0.1 2023-12-04 15:26:10,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=337200.0, ans=0.125 2023-12-04 15:26:16,382 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=337200.0, ans=0.125 2023-12-04 15:26:20,994 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.284e+02 1.352e+02 1.480e+02 2.036e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 15:26:21,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=337200.0, ans=0.2 2023-12-04 15:26:25,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=337266.6666666667, ans=0.125 2023-12-04 15:27:12,849 INFO [train.py:1087] (1/4) Epoch 57, batch 500, loss[loss=0.1465, simple_loss=0.2443, pruned_loss=0.0244, over 24719.00 frames. ], tot_loss[loss=0.1545, simple_loss=0.2471, pruned_loss=0.03098, over 4378693.67 frames. ], batch size: 69, lr: 4.30e-03, grad_scale: 32.0 2023-12-04 15:27:13,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=337466.6666666667, ans=0.125 2023-12-04 15:27:36,185 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=337533.3333333333, ans=0.04949747468305833 2023-12-04 15:27:45,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=337600.0, ans=0.0 2023-12-04 15:28:06,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=337666.6666666667, ans=0.125 2023-12-04 15:28:06,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=337666.6666666667, ans=0.2 2023-12-04 15:28:21,604 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:28:31,640 INFO [train.py:1087] (1/4) Epoch 57, batch 550, loss[loss=0.1567, simple_loss=0.2471, pruned_loss=0.03311, over 24761.00 frames. ], tot_loss[loss=0.1539, simple_loss=0.2466, pruned_loss=0.03059, over 4484117.10 frames. ], batch size: 66, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:28:35,243 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.80 vs. limit=15.0 2023-12-04 15:28:35,262 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=15.0 2023-12-04 15:28:50,089 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=337866.6666666667, ans=0.125 2023-12-04 15:28:56,033 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=337866.6666666667, ans=0.07 2023-12-04 15:28:58,571 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.296e+02 1.366e+02 1.475e+02 1.863e+02, threshold=2.731e+02, percent-clipped=0.0 2023-12-04 15:28:59,342 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.41 vs. limit=15.0 2023-12-04 15:29:13,376 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:29:42,203 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=338066.6666666667, ans=0.1 2023-12-04 15:29:48,225 INFO [train.py:1087] (1/4) Epoch 57, batch 600, loss[loss=0.1594, simple_loss=0.2554, pruned_loss=0.03165, over 24560.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.03056, over 4569869.62 frames. ], batch size: 62, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:29:50,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=338133.3333333333, ans=0.0 2023-12-04 15:30:11,235 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=338200.0, ans=0.5 2023-12-04 15:30:57,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=338400.0, ans=0.125 2023-12-04 15:31:06,695 INFO [train.py:1087] (1/4) Epoch 57, batch 650, loss[loss=0.1497, simple_loss=0.2413, pruned_loss=0.02903, over 24174.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2463, pruned_loss=0.03027, over 4634621.15 frames. ], batch size: 58, lr: 4.30e-03, grad_scale: 16.0 2023-12-04 15:31:31,815 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=338533.3333333333, ans=0.2 2023-12-04 15:31:34,229 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.254e+02 1.345e+02 1.461e+02 1.814e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 15:31:40,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=338600.0, ans=0.125 2023-12-04 15:31:44,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=338600.0, ans=0.05 2023-12-04 15:32:23,342 INFO [train.py:1087] (1/4) Epoch 57, batch 700, loss[loss=0.1525, simple_loss=0.245, pruned_loss=0.02997, over 24810.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2463, pruned_loss=0.03017, over 4689163.58 frames. ], batch size: 62, lr: 4.29e-03, grad_scale: 16.0 2023-12-04 15:32:57,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338933.3333333333, ans=0.1 2023-12-04 15:33:12,445 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.36 vs. limit=15.0 2023-12-04 15:33:36,075 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=339066.6666666667, ans=0.125 2023-12-04 15:33:37,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=339066.6666666667, ans=0.0 2023-12-04 15:33:41,803 INFO [train.py:1087] (1/4) Epoch 57, batch 750, loss[loss=0.1603, simple_loss=0.2448, pruned_loss=0.03787, over 24737.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2456, pruned_loss=0.02981, over 4728714.32 frames. ], batch size: 63, lr: 4.29e-03, grad_scale: 16.0 2023-12-04 15:34:05,869 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.27 vs. limit=6.0 2023-12-04 15:34:07,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=339200.0, ans=0.125 2023-12-04 15:34:09,572 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.274e+02 1.366e+02 1.466e+02 1.796e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-04 15:34:21,297 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=339266.6666666667, ans=0.125 2023-12-04 15:34:40,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=339333.3333333333, ans=0.125 2023-12-04 15:34:58,957 INFO [train.py:1087] (1/4) Epoch 57, batch 800, loss[loss=0.1473, simple_loss=0.2411, pruned_loss=0.02674, over 24543.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2456, pruned_loss=0.02987, over 4755974.56 frames. ], batch size: 62, lr: 4.29e-03, grad_scale: 32.0 2023-12-04 15:35:35,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=339600.0, ans=0.0 2023-12-04 15:36:04,461 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:36:09,426 INFO [train.py:1087] (1/4) Epoch 57, batch 850, loss[loss=0.1563, simple_loss=0.244, pruned_loss=0.03427, over 24506.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2458, pruned_loss=0.03004, over 4774853.27 frames. ], batch size: 77, lr: 4.29e-03, grad_scale: 32.0 2023-12-04 15:36:15,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=339800.0, ans=0.125 2023-12-04 15:36:33,913 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.047e+02 1.261e+02 1.353e+02 1.424e+02 1.970e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 15:37:22,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=340100.0, ans=0.0 2023-12-04 15:37:36,198 INFO [train.py:1087] (1/4) Epoch 58, batch 0, loss[loss=0.1401, simple_loss=0.2333, pruned_loss=0.02343, over 24793.00 frames. ], tot_loss[loss=0.1401, simple_loss=0.2333, pruned_loss=0.02343, over 24793.00 frames. ], batch size: 72, lr: 4.25e-03, grad_scale: 32.0 2023-12-04 15:37:36,200 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 15:37:52,586 INFO [train.py:1119] (1/4) Epoch 58, validation: loss=0.1514, simple_loss=0.2484, pruned_loss=0.02714, over 944034.00 frames. 2023-12-04 15:37:52,588 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 15:37:52,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=340100.0, ans=0.125 2023-12-04 15:38:05,261 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.93 vs. limit=15.0 2023-12-04 15:38:09,673 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.03 vs. limit=22.5 2023-12-04 15:38:31,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=340233.3333333333, ans=0.2 2023-12-04 15:38:39,243 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=340300.0, ans=0.0 2023-12-04 15:39:09,597 INFO [train.py:1087] (1/4) Epoch 58, batch 50, loss[loss=0.1431, simple_loss=0.238, pruned_loss=0.02409, over 24743.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2449, pruned_loss=0.02962, over 1092557.17 frames. ], batch size: 63, lr: 4.25e-03, grad_scale: 32.0 2023-12-04 15:39:44,474 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.141e+02 1.274e+02 1.368e+02 1.509e+02 2.414e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 15:39:47,764 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=340566.6666666667, ans=0.0 2023-12-04 15:40:25,857 INFO [train.py:1087] (1/4) Epoch 58, batch 100, loss[loss=0.1487, simple_loss=0.2419, pruned_loss=0.02779, over 23702.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2457, pruned_loss=0.0301, over 1923799.60 frames. ], batch size: 57, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:40:30,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=340766.6666666667, ans=22.5 2023-12-04 15:40:38,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=340766.6666666667, ans=0.125 2023-12-04 15:40:40,267 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-12-04 15:40:41,675 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.16 vs. limit=22.5 2023-12-04 15:41:00,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=340900.0, ans=0.2 2023-12-04 15:41:05,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=340900.0, ans=0.0 2023-12-04 15:41:18,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=340966.6666666667, ans=0.125 2023-12-04 15:41:19,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=340966.6666666667, ans=0.2 2023-12-04 15:41:21,093 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=340966.6666666667, ans=0.02 2023-12-04 15:41:28,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=341033.3333333333, ans=10.0 2023-12-04 15:41:28,517 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-12-04 15:41:31,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=341033.3333333333, ans=0.05 2023-12-04 15:41:42,188 INFO [train.py:1087] (1/4) Epoch 58, batch 150, loss[loss=0.1644, simple_loss=0.2563, pruned_loss=0.03624, over 22771.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2453, pruned_loss=0.02987, over 2570450.91 frames. ], batch size: 106, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:42:19,233 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.290e+02 1.368e+02 1.495e+02 2.252e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 15:42:31,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=341300.0, ans=0.125 2023-12-04 15:42:32,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=341300.0, ans=0.07 2023-12-04 15:42:35,376 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=341300.0, ans=0.025 2023-12-04 15:42:44,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=341366.6666666667, ans=0.0 2023-12-04 15:42:56,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=341366.6666666667, ans=0.1 2023-12-04 15:42:57,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=341366.6666666667, ans=0.0 2023-12-04 15:42:59,686 INFO [train.py:1087] (1/4) Epoch 58, batch 200, loss[loss=0.1552, simple_loss=0.247, pruned_loss=0.03165, over 24520.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2455, pruned_loss=0.03009, over 3070620.69 frames. ], batch size: 75, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:43:01,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341433.3333333333, ans=0.125 2023-12-04 15:43:20,646 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.99 vs. limit=15.0 2023-12-04 15:43:23,431 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=341500.0, ans=0.125 2023-12-04 15:43:25,743 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=22.5 2023-12-04 15:43:26,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=341500.0, ans=0.0 2023-12-04 15:43:31,841 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=341566.6666666667, ans=0.0 2023-12-04 15:44:05,002 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=341700.0, ans=0.1 2023-12-04 15:44:15,195 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-12-04 15:44:17,121 INFO [train.py:1087] (1/4) Epoch 58, batch 250, loss[loss=0.1434, simple_loss=0.2373, pruned_loss=0.02476, over 24563.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2453, pruned_loss=0.02987, over 3452229.41 frames. ], batch size: 66, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:44:54,086 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.304e+02 1.398e+02 1.543e+02 2.127e+02, threshold=2.795e+02, percent-clipped=0.0 2023-12-04 15:44:59,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=341900.0, ans=0.125 2023-12-04 15:45:08,280 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=341966.6666666667, ans=0.0 2023-12-04 15:45:30,241 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=342033.3333333333, ans=0.05 2023-12-04 15:45:34,529 INFO [train.py:1087] (1/4) Epoch 58, batch 300, loss[loss=0.1508, simple_loss=0.2434, pruned_loss=0.02907, over 24602.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2455, pruned_loss=0.02994, over 3754148.71 frames. ], batch size: 68, lr: 4.24e-03, grad_scale: 16.0 2023-12-04 15:45:45,566 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=342100.0, ans=0.125 2023-12-04 15:45:53,195 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.48 vs. limit=15.0 2023-12-04 15:46:12,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=342233.3333333333, ans=0.0 2023-12-04 15:46:23,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=342300.0, ans=0.0 2023-12-04 15:46:30,254 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=342300.0, ans=0.125 2023-12-04 15:46:40,770 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=342366.6666666667, ans=0.125 2023-12-04 15:46:44,218 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-12-04 15:46:49,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=342433.3333333333, ans=0.0 2023-12-04 15:46:49,610 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:46:50,649 INFO [train.py:1087] (1/4) Epoch 58, batch 350, loss[loss=0.1474, simple_loss=0.2425, pruned_loss=0.02614, over 24728.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2462, pruned_loss=0.0304, over 3970567.51 frames. ], batch size: 74, lr: 4.23e-03, grad_scale: 16.0 2023-12-04 15:47:25,557 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.239e+02 1.325e+02 1.427e+02 1.705e+02, threshold=2.650e+02, percent-clipped=0.0 2023-12-04 15:48:03,442 INFO [train.py:1087] (1/4) Epoch 58, batch 400, loss[loss=0.1591, simple_loss=0.2548, pruned_loss=0.03171, over 24856.00 frames. ], tot_loss[loss=0.1536, simple_loss=0.2462, pruned_loss=0.03044, over 4146851.21 frames. ], batch size: 68, lr: 4.23e-03, grad_scale: 32.0 2023-12-04 15:48:24,986 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=342833.3333333333, ans=0.0 2023-12-04 15:48:42,441 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=342900.0, ans=0.125 2023-12-04 15:48:53,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=342966.6666666667, ans=0.2 2023-12-04 15:49:14,451 INFO [train.py:1087] (1/4) Epoch 58, batch 450, loss[loss=0.1611, simple_loss=0.2504, pruned_loss=0.03586, over 24312.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2467, pruned_loss=0.03065, over 4270801.60 frames. ], batch size: 79, lr: 4.23e-03, grad_scale: 16.0 2023-12-04 15:49:17,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343100.0, ans=0.1 2023-12-04 15:49:19,228 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.65 vs. limit=15.0 2023-12-04 15:49:51,096 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.271e+02 1.365e+02 1.481e+02 1.907e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 15:50:07,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=343300.0, ans=0.125 2023-12-04 15:50:14,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=343366.6666666667, ans=0.07 2023-12-04 15:50:22,562 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.41 vs. limit=15.0 2023-12-04 15:50:25,080 INFO [train.py:1087] (1/4) Epoch 58, batch 500, loss[loss=0.1398, simple_loss=0.2309, pruned_loss=0.02439, over 24762.00 frames. ], tot_loss[loss=0.1538, simple_loss=0.2463, pruned_loss=0.03063, over 4377156.62 frames. ], batch size: 66, lr: 4.23e-03, grad_scale: 8.0 2023-12-04 15:50:25,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=343433.3333333333, ans=0.025 2023-12-04 15:50:30,795 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=343433.3333333333, ans=0.1 2023-12-04 15:50:32,180 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=343433.3333333333, ans=0.0 2023-12-04 15:50:51,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=343566.6666666667, ans=0.2 2023-12-04 15:51:01,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=343566.6666666667, ans=0.125 2023-12-04 15:51:11,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=343633.3333333333, ans=0.0 2023-12-04 15:51:35,267 INFO [train.py:1087] (1/4) Epoch 58, batch 550, loss[loss=0.1502, simple_loss=0.2422, pruned_loss=0.02908, over 24804.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2462, pruned_loss=0.03058, over 4473066.69 frames. ], batch size: 62, lr: 4.23e-03, grad_scale: 8.0 2023-12-04 15:51:35,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=343766.6666666667, ans=0.125 2023-12-04 15:51:36,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=343766.6666666667, ans=0.2 2023-12-04 15:51:36,967 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=343766.6666666667, ans=0.0 2023-12-04 15:52:11,259 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.282e+02 1.355e+02 1.434e+02 2.119e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 15:52:44,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=344100.0, ans=0.0 2023-12-04 15:52:45,350 INFO [train.py:1087] (1/4) Epoch 58, batch 600, loss[loss=0.1523, simple_loss=0.2465, pruned_loss=0.02902, over 24354.00 frames. ], tot_loss[loss=0.1537, simple_loss=0.2463, pruned_loss=0.03053, over 4554841.52 frames. ], batch size: 79, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:52:45,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=344100.0, ans=0.0 2023-12-04 15:52:57,023 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=344100.0, ans=0.125 2023-12-04 15:53:07,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=344166.6666666667, ans=0.2 2023-12-04 15:53:11,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=344166.6666666667, ans=0.0 2023-12-04 15:53:15,989 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-12-04 15:53:27,209 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=344300.0, ans=0.0 2023-12-04 15:53:55,581 INFO [train.py:1087] (1/4) Epoch 58, batch 650, loss[loss=0.1567, simple_loss=0.2446, pruned_loss=0.03437, over 24195.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2458, pruned_loss=0.03041, over 4604668.06 frames. ], batch size: 58, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:53:58,898 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.24 vs. limit=15.0 2023-12-04 15:54:12,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=344500.0, ans=0.04949747468305833 2023-12-04 15:54:22,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=344566.6666666667, ans=0.125 2023-12-04 15:54:31,386 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.273e+02 1.359e+02 1.478e+02 1.827e+02, threshold=2.719e+02, percent-clipped=0.0 2023-12-04 15:54:32,207 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2023-12-04 15:54:35,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=344566.6666666667, ans=0.125 2023-12-04 15:54:44,357 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 15:54:47,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=344633.3333333333, ans=0.1 2023-12-04 15:54:54,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344700.0, ans=0.1 2023-12-04 15:54:58,742 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-12-04 15:55:04,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=344766.6666666667, ans=0.125 2023-12-04 15:55:05,460 INFO [train.py:1087] (1/4) Epoch 58, batch 700, loss[loss=0.1408, simple_loss=0.2347, pruned_loss=0.02348, over 24778.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2455, pruned_loss=0.03023, over 4649552.09 frames. ], batch size: 70, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:55:09,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344766.6666666667, ans=0.1 2023-12-04 15:55:17,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=344766.6666666667, ans=10.0 2023-12-04 15:55:20,169 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-12-04 15:55:48,286 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-12-04 15:56:02,744 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=345033.3333333333, ans=0.0 2023-12-04 15:56:08,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=345033.3333333333, ans=0.04949747468305833 2023-12-04 15:56:15,859 INFO [train.py:1087] (1/4) Epoch 58, batch 750, loss[loss=0.146, simple_loss=0.2381, pruned_loss=0.02691, over 24774.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2454, pruned_loss=0.03023, over 4684465.93 frames. ], batch size: 71, lr: 4.22e-03, grad_scale: 8.0 2023-12-04 15:56:17,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=345100.0, ans=0.0 2023-12-04 15:56:31,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=345166.6666666667, ans=0.125 2023-12-04 15:56:38,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=345166.6666666667, ans=0.0 2023-12-04 15:56:45,287 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.35 vs. limit=15.0 2023-12-04 15:56:53,209 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.255e+02 1.327e+02 1.411e+02 2.341e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 15:57:05,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=345300.0, ans=0.0 2023-12-04 15:57:06,675 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=345300.0, ans=0.125 2023-12-04 15:57:13,575 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=22.5 2023-12-04 15:57:27,052 INFO [train.py:1087] (1/4) Epoch 58, batch 800, loss[loss=0.1448, simple_loss=0.2373, pruned_loss=0.02621, over 24549.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2452, pruned_loss=0.03014, over 4696011.36 frames. ], batch size: 63, lr: 4.22e-03, grad_scale: 16.0 2023-12-04 15:57:52,906 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=345566.6666666667, ans=0.125 2023-12-04 15:58:01,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=345566.6666666667, ans=0.125 2023-12-04 15:58:14,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=345633.3333333333, ans=0.1 2023-12-04 15:58:28,589 INFO [train.py:1087] (1/4) Epoch 58, batch 850, loss[loss=0.1475, simple_loss=0.2466, pruned_loss=0.02421, over 24814.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2452, pruned_loss=0.03009, over 4716196.34 frames. ], batch size: 72, lr: 4.21e-03, grad_scale: 16.0 2023-12-04 15:58:58,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=345900.0, ans=0.1 2023-12-04 15:59:00,496 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.280e+02 1.372e+02 1.495e+02 2.014e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 15:59:04,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=345966.6666666667, ans=0.2 2023-12-04 15:59:19,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=346033.3333333333, ans=0.125 2023-12-04 15:59:44,937 INFO [train.py:1087] (1/4) Epoch 59, batch 0, loss[loss=0.1515, simple_loss=0.2451, pruned_loss=0.02891, over 24338.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2451, pruned_loss=0.02891, over 24338.00 frames. ], batch size: 79, lr: 4.18e-03, grad_scale: 32.0 2023-12-04 15:59:44,938 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 15:59:56,221 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8690, 4.5019, 4.6765, 4.2934], device='cuda:1') 2023-12-04 16:00:01,547 INFO [train.py:1119] (1/4) Epoch 59, validation: loss=0.151, simple_loss=0.2482, pruned_loss=0.02689, over 944034.00 frames. 2023-12-04 16:00:01,549 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 16:00:05,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=346066.6666666667, ans=0.09899494936611666 2023-12-04 16:00:13,357 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:00:13,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=346133.3333333333, ans=0.125 2023-12-04 16:00:41,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=346266.6666666667, ans=0.125 2023-12-04 16:01:10,436 INFO [train.py:1087] (1/4) Epoch 59, batch 50, loss[loss=0.1712, simple_loss=0.2595, pruned_loss=0.04147, over 24722.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2455, pruned_loss=0.03017, over 1086002.24 frames. ], batch size: 61, lr: 4.17e-03, grad_scale: 32.0 2023-12-04 16:01:42,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=346533.3333333333, ans=0.0 2023-12-04 16:01:53,320 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.301e+02 1.399e+02 1.506e+02 1.775e+02, threshold=2.799e+02, percent-clipped=0.0 2023-12-04 16:02:22,203 INFO [train.py:1087] (1/4) Epoch 59, batch 100, loss[loss=0.1518, simple_loss=0.246, pruned_loss=0.0288, over 24722.00 frames. ], tot_loss[loss=0.1541, simple_loss=0.2469, pruned_loss=0.0307, over 1900188.18 frames. ], batch size: 67, lr: 4.17e-03, grad_scale: 16.0 2023-12-04 16:02:30,256 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=346733.3333333333, ans=0.125 2023-12-04 16:02:31,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=346733.3333333333, ans=0.125 2023-12-04 16:02:40,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=346800.0, ans=0.125 2023-12-04 16:03:02,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=346866.6666666667, ans=0.2 2023-12-04 16:03:02,315 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=346866.6666666667, ans=0.05 2023-12-04 16:03:19,331 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=347000.0, ans=0.0 2023-12-04 16:03:19,725 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.76 vs. limit=15.0 2023-12-04 16:03:31,126 INFO [train.py:1087] (1/4) Epoch 59, batch 150, loss[loss=0.1519, simple_loss=0.2474, pruned_loss=0.02824, over 24755.00 frames. ], tot_loss[loss=0.1534, simple_loss=0.2462, pruned_loss=0.03026, over 2540400.19 frames. ], batch size: 70, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:03:53,580 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-12-04 16:04:16,221 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.284e+02 1.386e+02 1.532e+02 2.042e+02, threshold=2.772e+02, percent-clipped=0.0 2023-12-04 16:04:39,896 INFO [train.py:1087] (1/4) Epoch 59, batch 200, loss[loss=0.1518, simple_loss=0.2475, pruned_loss=0.02807, over 24717.00 frames. ], tot_loss[loss=0.153, simple_loss=0.2458, pruned_loss=0.03011, over 3047615.27 frames. ], batch size: 61, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:04:56,899 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.91 vs. limit=15.0 2023-12-04 16:05:02,569 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347466.6666666667, ans=0.125 2023-12-04 16:05:48,964 INFO [train.py:1087] (1/4) Epoch 59, batch 250, loss[loss=0.1473, simple_loss=0.241, pruned_loss=0.02682, over 24574.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2461, pruned_loss=0.03039, over 3417992.34 frames. ], batch size: 64, lr: 4.17e-03, grad_scale: 8.0 2023-12-04 16:05:57,606 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.99 vs. limit=15.0 2023-12-04 16:05:59,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=347733.3333333333, ans=0.125 2023-12-04 16:06:10,873 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.94 vs. limit=15.0 2023-12-04 16:06:26,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=347866.6666666667, ans=0.04949747468305833 2023-12-04 16:06:33,871 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.290e+02 1.373e+02 1.506e+02 1.799e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 16:06:42,267 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-12-04 16:06:43,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=348000.0, ans=0.125 2023-12-04 16:06:51,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=348000.0, ans=0.0 2023-12-04 16:06:57,408 INFO [train.py:1087] (1/4) Epoch 59, batch 300, loss[loss=0.1571, simple_loss=0.2479, pruned_loss=0.03314, over 24560.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2458, pruned_loss=0.03032, over 3713470.81 frames. ], batch size: 66, lr: 4.16e-03, grad_scale: 8.0 2023-12-04 16:07:03,784 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-12-04 16:07:10,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=348133.3333333333, ans=0.1 2023-12-04 16:07:24,737 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348200.0, ans=0.125 2023-12-04 16:07:35,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=348200.0, ans=0.125 2023-12-04 16:07:35,754 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.63 vs. limit=10.0 2023-12-04 16:07:47,084 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=348266.6666666667, ans=0.1 2023-12-04 16:08:05,374 INFO [train.py:1087] (1/4) Epoch 59, batch 350, loss[loss=0.1616, simple_loss=0.2492, pruned_loss=0.03702, over 24749.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2452, pruned_loss=0.02986, over 3961613.36 frames. ], batch size: 61, lr: 4.16e-03, grad_scale: 8.0 2023-12-04 16:08:19,087 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=348466.6666666667, ans=0.125 2023-12-04 16:08:26,199 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=348466.6666666667, ans=0.07 2023-12-04 16:08:31,401 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=348466.6666666667, ans=0.125 2023-12-04 16:08:41,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=348533.3333333333, ans=10.0 2023-12-04 16:08:51,082 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.058e+02 1.283e+02 1.381e+02 1.500e+02 2.104e+02, threshold=2.763e+02, percent-clipped=0.0 2023-12-04 16:09:04,231 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=348666.6666666667, ans=0.0 2023-12-04 16:09:13,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=348666.6666666667, ans=0.1 2023-12-04 16:09:14,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=348733.3333333333, ans=0.0 2023-12-04 16:09:15,960 INFO [train.py:1087] (1/4) Epoch 59, batch 400, loss[loss=0.1567, simple_loss=0.248, pruned_loss=0.03272, over 21920.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2455, pruned_loss=0.03, over 4154922.61 frames. ], batch size: 128, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:09:27,164 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.20 vs. limit=15.0 2023-12-04 16:09:29,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=348800.0, ans=0.0 2023-12-04 16:09:56,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=348933.3333333333, ans=0.0 2023-12-04 16:10:01,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=348933.3333333333, ans=0.125 2023-12-04 16:10:24,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=349066.6666666667, ans=0.0 2023-12-04 16:10:25,336 INFO [train.py:1087] (1/4) Epoch 59, batch 450, loss[loss=0.1737, simple_loss=0.2621, pruned_loss=0.04263, over 24291.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2455, pruned_loss=0.02989, over 4309267.12 frames. ], batch size: 79, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:10:59,447 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.73 vs. limit=15.0 2023-12-04 16:11:09,787 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.274e+02 1.345e+02 1.464e+02 2.678e+02, threshold=2.690e+02, percent-clipped=0.0 2023-12-04 16:11:21,385 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=349333.3333333333, ans=0.125 2023-12-04 16:11:25,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=349333.3333333333, ans=0.125 2023-12-04 16:11:34,066 INFO [train.py:1087] (1/4) Epoch 59, batch 500, loss[loss=0.1477, simple_loss=0.2432, pruned_loss=0.02604, over 24717.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2456, pruned_loss=0.02992, over 4420659.12 frames. ], batch size: 69, lr: 4.16e-03, grad_scale: 16.0 2023-12-04 16:11:35,717 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=349400.0, ans=0.2 2023-12-04 16:11:40,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349400.0, ans=0.1 2023-12-04 16:12:24,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=349600.0, ans=0.5 2023-12-04 16:12:31,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349666.6666666667, ans=0.1 2023-12-04 16:12:31,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=349666.6666666667, ans=0.125 2023-12-04 16:12:41,934 INFO [train.py:1087] (1/4) Epoch 59, batch 550, loss[loss=0.153, simple_loss=0.2479, pruned_loss=0.02899, over 24797.00 frames. ], tot_loss[loss=0.1527, simple_loss=0.2456, pruned_loss=0.02988, over 4503526.93 frames. ], batch size: 62, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:12:54,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=349733.3333333333, ans=0.2 2023-12-04 16:13:09,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=349866.6666666667, ans=0.2 2023-12-04 16:13:28,810 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.263e+02 1.357e+02 1.457e+02 1.883e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-04 16:13:33,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=349933.3333333333, ans=0.125 2023-12-04 16:13:45,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=350000.0, ans=0.0 2023-12-04 16:13:52,088 INFO [train.py:1087] (1/4) Epoch 59, batch 600, loss[loss=0.1598, simple_loss=0.2505, pruned_loss=0.03456, over 24017.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2458, pruned_loss=0.02993, over 4570590.44 frames. ], batch size: 87, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:13:52,693 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-12-04 16:14:10,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=350133.3333333333, ans=0.1 2023-12-04 16:14:32,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=350266.6666666667, ans=0.0 2023-12-04 16:14:44,685 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.80 vs. limit=15.0 2023-12-04 16:14:53,950 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=350333.3333333333, ans=0.0 2023-12-04 16:15:01,660 INFO [train.py:1087] (1/4) Epoch 59, batch 650, loss[loss=0.1365, simple_loss=0.229, pruned_loss=0.02194, over 24734.00 frames. ], tot_loss[loss=0.1525, simple_loss=0.2454, pruned_loss=0.02977, over 4622331.00 frames. ], batch size: 67, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:15:35,028 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.32 vs. limit=15.0 2023-12-04 16:15:35,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=350533.3333333333, ans=0.0 2023-12-04 16:15:36,234 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.69 vs. limit=15.0 2023-12-04 16:15:47,221 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.290e+02 1.349e+02 1.494e+02 3.039e+02, threshold=2.697e+02, percent-clipped=1.0 2023-12-04 16:15:56,442 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-12-04 16:15:59,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=350666.6666666667, ans=0.0 2023-12-04 16:16:08,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=350666.6666666667, ans=0.0 2023-12-04 16:16:12,220 INFO [train.py:1087] (1/4) Epoch 59, batch 700, loss[loss=0.1499, simple_loss=0.2421, pruned_loss=0.0288, over 24725.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2452, pruned_loss=0.0296, over 4669068.21 frames. ], batch size: 67, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:16:36,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=350800.0, ans=0.0 2023-12-04 16:16:47,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=350866.6666666667, ans=0.125 2023-12-04 16:17:20,279 INFO [train.py:1087] (1/4) Epoch 59, batch 750, loss[loss=0.1507, simple_loss=0.2442, pruned_loss=0.02856, over 24755.00 frames. ], tot_loss[loss=0.152, simple_loss=0.245, pruned_loss=0.02947, over 4711427.98 frames. ], batch size: 66, lr: 4.15e-03, grad_scale: 16.0 2023-12-04 16:17:20,635 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=351066.6666666667, ans=0.0 2023-12-04 16:17:27,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=351066.6666666667, ans=0.1 2023-12-04 16:17:39,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=351133.3333333333, ans=0.2 2023-12-04 16:17:49,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=351200.0, ans=0.125 2023-12-04 16:18:01,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=351266.6666666667, ans=0.2 2023-12-04 16:18:06,839 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.253e+02 1.352e+02 1.498e+02 1.873e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 16:18:22,957 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=351333.3333333333, ans=0.0 2023-12-04 16:18:29,374 INFO [train.py:1087] (1/4) Epoch 59, batch 800, loss[loss=0.1451, simple_loss=0.2341, pruned_loss=0.02804, over 24764.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.02951, over 4745825.75 frames. ], batch size: 65, lr: 4.15e-03, grad_scale: 32.0 2023-12-04 16:18:32,170 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=351400.0, ans=0.125 2023-12-04 16:18:43,429 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=351466.6666666667, ans=0.125 2023-12-04 16:19:01,609 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.54 vs. limit=22.5 2023-12-04 16:19:14,666 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=351600.0, ans=0.5 2023-12-04 16:19:17,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=351600.0, ans=0.125 2023-12-04 16:19:20,025 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.98 vs. limit=12.0 2023-12-04 16:19:32,095 INFO [train.py:1087] (1/4) Epoch 59, batch 850, loss[loss=0.1615, simple_loss=0.2535, pruned_loss=0.03479, over 22839.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2448, pruned_loss=0.02933, over 4763612.40 frames. ], batch size: 106, lr: 4.14e-03, grad_scale: 16.0 2023-12-04 16:19:40,904 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-12-04 16:19:50,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=351800.0, ans=0.07 2023-12-04 16:19:50,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=351800.0, ans=0.0 2023-12-04 16:19:51,237 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:20:05,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=351866.6666666667, ans=0.125 2023-12-04 16:20:12,963 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.060e+02 1.281e+02 1.392e+02 1.518e+02 2.446e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 16:20:16,966 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=351933.3333333333, ans=0.07 2023-12-04 16:20:42,999 INFO [train.py:1087] (1/4) Epoch 60, batch 0, loss[loss=0.1546, simple_loss=0.2469, pruned_loss=0.03118, over 22745.00 frames. ], tot_loss[loss=0.1546, simple_loss=0.2469, pruned_loss=0.03118, over 22745.00 frames. ], batch size: 106, lr: 4.11e-03, grad_scale: 32.0 2023-12-04 16:20:43,002 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 16:20:59,314 INFO [train.py:1119] (1/4) Epoch 60, validation: loss=0.1512, simple_loss=0.2484, pruned_loss=0.027, over 944034.00 frames. 2023-12-04 16:20:59,315 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 16:21:39,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=352233.3333333333, ans=0.0 2023-12-04 16:21:45,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=352233.3333333333, ans=0.0 2023-12-04 16:21:47,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=352233.3333333333, ans=0.0 2023-12-04 16:21:47,293 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=352233.3333333333, ans=0.125 2023-12-04 16:21:47,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=352233.3333333333, ans=0.125 2023-12-04 16:22:08,451 INFO [train.py:1087] (1/4) Epoch 60, batch 50, loss[loss=0.1761, simple_loss=0.2628, pruned_loss=0.0447, over 16985.00 frames. ], tot_loss[loss=0.1554, simple_loss=0.2479, pruned_loss=0.0314, over 1068486.05 frames. ], batch size: 177, lr: 4.10e-03, grad_scale: 32.0 2023-12-04 16:22:11,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=352366.6666666667, ans=0.125 2023-12-04 16:22:14,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=352366.6666666667, ans=0.125 2023-12-04 16:22:14,838 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-12-04 16:22:24,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352433.3333333333, ans=0.1 2023-12-04 16:22:32,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=352433.3333333333, ans=0.0 2023-12-04 16:22:38,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352500.0, ans=0.1 2023-12-04 16:23:02,007 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.270e+02 1.378e+02 1.472e+02 2.466e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 16:23:17,457 INFO [train.py:1087] (1/4) Epoch 60, batch 100, loss[loss=0.1536, simple_loss=0.2543, pruned_loss=0.02642, over 21215.00 frames. ], tot_loss[loss=0.1535, simple_loss=0.2465, pruned_loss=0.03027, over 1895085.46 frames. ], batch size: 127, lr: 4.10e-03, grad_scale: 32.0 2023-12-04 16:23:19,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=352700.0, ans=0.2 2023-12-04 16:23:30,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=352766.6666666667, ans=0.125 2023-12-04 16:23:31,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=352766.6666666667, ans=0.05 2023-12-04 16:23:44,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=352833.3333333333, ans=0.0 2023-12-04 16:23:44,948 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.06 vs. limit=10.0 2023-12-04 16:24:03,870 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=15.0 2023-12-04 16:24:25,940 INFO [train.py:1087] (1/4) Epoch 60, batch 150, loss[loss=0.1668, simple_loss=0.2599, pruned_loss=0.03686, over 22982.00 frames. ], tot_loss[loss=0.1532, simple_loss=0.2464, pruned_loss=0.03005, over 2546206.86 frames. ], batch size: 106, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:24:32,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=353033.3333333333, ans=0.125 2023-12-04 16:24:35,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=353033.3333333333, ans=0.0 2023-12-04 16:24:49,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=353100.0, ans=0.125 2023-12-04 16:25:22,243 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.281e+02 1.355e+02 1.539e+02 2.129e+02, threshold=2.710e+02, percent-clipped=0.0 2023-12-04 16:25:36,395 INFO [train.py:1087] (1/4) Epoch 60, batch 200, loss[loss=0.1458, simple_loss=0.2391, pruned_loss=0.02631, over 24759.00 frames. ], tot_loss[loss=0.1528, simple_loss=0.2457, pruned_loss=0.02992, over 3042359.68 frames. ], batch size: 66, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:25:46,287 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-12-04 16:26:44,647 INFO [train.py:1087] (1/4) Epoch 60, batch 250, loss[loss=0.147, simple_loss=0.2389, pruned_loss=0.02756, over 24541.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2447, pruned_loss=0.02925, over 3451823.94 frames. ], batch size: 62, lr: 4.10e-03, grad_scale: 16.0 2023-12-04 16:27:39,687 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.54 vs. limit=15.0 2023-12-04 16:27:40,359 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.267e+02 1.344e+02 1.454e+02 1.870e+02, threshold=2.689e+02, percent-clipped=0.0 2023-12-04 16:27:54,526 INFO [train.py:1087] (1/4) Epoch 60, batch 300, loss[loss=0.1533, simple_loss=0.2475, pruned_loss=0.02949, over 24806.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2445, pruned_loss=0.02921, over 3747269.63 frames. ], batch size: 62, lr: 4.09e-03, grad_scale: 16.0 2023-12-04 16:28:00,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=354033.3333333333, ans=0.0 2023-12-04 16:28:52,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354300.0, ans=0.125 2023-12-04 16:28:54,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=354300.0, ans=0.1 2023-12-04 16:28:57,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=354300.0, ans=0.125 2023-12-04 16:29:02,436 INFO [train.py:1087] (1/4) Epoch 60, batch 350, loss[loss=0.1551, simple_loss=0.2442, pruned_loss=0.03299, over 24801.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02935, over 3993233.75 frames. ], batch size: 72, lr: 4.09e-03, grad_scale: 16.0 2023-12-04 16:29:05,749 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.11 vs. limit=12.0 2023-12-04 16:29:12,979 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=354366.6666666667, ans=0.125 2023-12-04 16:29:16,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=354433.3333333333, ans=0.125 2023-12-04 16:29:36,369 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:29:42,433 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=354566.6666666667, ans=0.125 2023-12-04 16:29:43,742 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:29:57,650 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.285e+02 1.349e+02 1.453e+02 2.187e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 16:30:11,208 INFO [train.py:1087] (1/4) Epoch 60, batch 400, loss[loss=0.1546, simple_loss=0.2461, pruned_loss=0.03152, over 24550.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02957, over 4179441.23 frames. ], batch size: 63, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:30:43,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=354833.3333333333, ans=0.125 2023-12-04 16:30:43,735 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.35 vs. limit=15.0 2023-12-04 16:30:51,481 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.30 vs. limit=22.5 2023-12-04 16:31:19,408 INFO [train.py:1087] (1/4) Epoch 60, batch 450, loss[loss=0.1513, simple_loss=0.2469, pruned_loss=0.0279, over 24114.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2453, pruned_loss=0.02974, over 4301487.15 frames. ], batch size: 58, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:31:28,805 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=355033.3333333333, ans=0.125 2023-12-04 16:31:46,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=355166.6666666667, ans=0.125 2023-12-04 16:31:52,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=355166.6666666667, ans=0.125 2023-12-04 16:32:13,259 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.286e+02 1.410e+02 1.537e+02 2.012e+02, threshold=2.820e+02, percent-clipped=0.0 2023-12-04 16:32:28,602 INFO [train.py:1087] (1/4) Epoch 60, batch 500, loss[loss=0.1505, simple_loss=0.2451, pruned_loss=0.0279, over 24793.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2453, pruned_loss=0.02968, over 4404452.03 frames. ], batch size: 72, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:32:44,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=355433.3333333333, ans=0.125 2023-12-04 16:32:45,117 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2023-12-04 16:33:13,694 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=355566.6666666667, ans=0.125 2023-12-04 16:33:18,092 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-12-04 16:33:18,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=355566.6666666667, ans=0.125 2023-12-04 16:33:20,413 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:33:28,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=355633.3333333333, ans=0.125 2023-12-04 16:33:37,222 INFO [train.py:1087] (1/4) Epoch 60, batch 550, loss[loss=0.1557, simple_loss=0.2494, pruned_loss=0.03102, over 21747.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2453, pruned_loss=0.02979, over 4481440.79 frames. ], batch size: 127, lr: 4.09e-03, grad_scale: 32.0 2023-12-04 16:33:51,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=355766.6666666667, ans=0.025 2023-12-04 16:34:01,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=355766.6666666667, ans=0.2 2023-12-04 16:34:18,963 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.05 vs. limit=15.0 2023-12-04 16:34:31,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=355966.6666666667, ans=15.0 2023-12-04 16:34:32,287 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.271e+02 1.366e+02 1.477e+02 1.822e+02, threshold=2.733e+02, percent-clipped=0.0 2023-12-04 16:34:44,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=356033.3333333333, ans=0.09899494936611666 2023-12-04 16:34:45,839 INFO [train.py:1087] (1/4) Epoch 60, batch 600, loss[loss=0.1495, simple_loss=0.2424, pruned_loss=0.02825, over 24710.00 frames. ], tot_loss[loss=0.1523, simple_loss=0.2451, pruned_loss=0.02974, over 4566754.17 frames. ], batch size: 69, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:34:59,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=356100.0, ans=0.125 2023-12-04 16:35:17,245 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=356166.6666666667, ans=0.2 2023-12-04 16:35:41,635 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356300.0, ans=0.1 2023-12-04 16:35:44,035 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=356300.0, ans=0.125 2023-12-04 16:35:46,309 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.23 vs. limit=15.0 2023-12-04 16:35:56,922 INFO [train.py:1087] (1/4) Epoch 60, batch 650, loss[loss=0.1438, simple_loss=0.2427, pruned_loss=0.02244, over 24804.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2448, pruned_loss=0.02955, over 4623032.66 frames. ], batch size: 72, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:36:27,393 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-12-04 16:36:41,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=356566.6666666667, ans=0.125 2023-12-04 16:36:50,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=356566.6666666667, ans=0.2 2023-12-04 16:36:52,436 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.038e+02 1.271e+02 1.373e+02 1.481e+02 1.948e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 16:36:59,115 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-12-04 16:37:05,031 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.70 vs. limit=15.0 2023-12-04 16:37:07,562 INFO [train.py:1087] (1/4) Epoch 60, batch 700, loss[loss=0.1411, simple_loss=0.2349, pruned_loss=0.02369, over 24549.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2445, pruned_loss=0.02925, over 4667847.74 frames. ], batch size: 63, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:37:22,066 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-12-04 16:37:45,194 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.28 vs. limit=10.0 2023-12-04 16:37:54,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=356900.0, ans=0.125 2023-12-04 16:38:01,108 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.73 vs. limit=22.5 2023-12-04 16:38:15,938 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.92 vs. limit=10.0 2023-12-04 16:38:16,509 INFO [train.py:1087] (1/4) Epoch 60, batch 750, loss[loss=0.164, simple_loss=0.2547, pruned_loss=0.03668, over 23588.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2444, pruned_loss=0.02923, over 4694807.68 frames. ], batch size: 94, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:38:36,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=357100.0, ans=0.125 2023-12-04 16:38:44,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=357166.6666666667, ans=0.5 2023-12-04 16:38:50,105 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:39:09,331 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=357233.3333333333, ans=0.0 2023-12-04 16:39:12,771 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.281e+02 1.351e+02 1.479e+02 1.994e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 16:39:14,988 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-12-04 16:39:26,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=357366.6666666667, ans=0.2 2023-12-04 16:39:27,500 INFO [train.py:1087] (1/4) Epoch 60, batch 800, loss[loss=0.1531, simple_loss=0.2444, pruned_loss=0.03092, over 24759.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2447, pruned_loss=0.0295, over 4710423.73 frames. ], batch size: 65, lr: 4.08e-03, grad_scale: 32.0 2023-12-04 16:39:53,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=357500.0, ans=0.125 2023-12-04 16:39:55,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=357500.0, ans=0.1 2023-12-04 16:40:10,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=357566.6666666667, ans=0.2 2023-12-04 16:40:25,041 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.20 vs. limit=22.5 2023-12-04 16:40:30,187 INFO [train.py:1087] (1/4) Epoch 60, batch 850, loss[loss=0.1385, simple_loss=0.2291, pruned_loss=0.02394, over 24578.00 frames. ], tot_loss[loss=0.152, simple_loss=0.245, pruned_loss=0.02954, over 4725076.12 frames. ], batch size: 65, lr: 4.07e-03, grad_scale: 32.0 2023-12-04 16:40:33,905 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=357700.0, ans=0.125 2023-12-04 16:40:42,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=357766.6666666667, ans=0.2 2023-12-04 16:40:43,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=357766.6666666667, ans=0.0 2023-12-04 16:40:55,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=357833.3333333333, ans=6.0 2023-12-04 16:40:55,428 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-12-04 16:41:00,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=357833.3333333333, ans=0.0 2023-12-04 16:41:10,237 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=357900.0, ans=0.125 2023-12-04 16:41:16,879 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=15.0 2023-12-04 16:41:18,300 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.265e+02 1.348e+02 1.477e+02 2.020e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 16:41:18,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=357966.6666666667, ans=0.0 2023-12-04 16:41:45,581 INFO [train.py:1087] (1/4) Epoch 61, batch 0, loss[loss=0.1406, simple_loss=0.2365, pruned_loss=0.0223, over 24706.00 frames. ], tot_loss[loss=0.1406, simple_loss=0.2365, pruned_loss=0.0223, over 24706.00 frames. ], batch size: 74, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:41:45,582 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 16:42:02,321 INFO [train.py:1119] (1/4) Epoch 61, validation: loss=0.1508, simple_loss=0.248, pruned_loss=0.0268, over 944034.00 frames. 2023-12-04 16:42:02,322 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 16:42:18,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358066.6666666667, ans=0.1 2023-12-04 16:42:52,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=358200.0, ans=0.125 2023-12-04 16:42:56,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=358266.6666666667, ans=0.0 2023-12-04 16:43:12,538 INFO [train.py:1087] (1/4) Epoch 61, batch 50, loss[loss=0.1653, simple_loss=0.2565, pruned_loss=0.03711, over 22796.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2445, pruned_loss=0.02983, over 1081422.18 frames. ], batch size: 106, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:43:14,266 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=358333.3333333333, ans=0.125 2023-12-04 16:43:15,730 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=358333.3333333333, ans=0.125 2023-12-04 16:43:23,521 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.86 vs. limit=15.0 2023-12-04 16:43:38,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=358466.6666666667, ans=0.125 2023-12-04 16:43:52,942 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358533.3333333333, ans=0.1 2023-12-04 16:44:13,608 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.266e+02 1.367e+02 1.459e+02 2.673e+02, threshold=2.733e+02, percent-clipped=0.0 2023-12-04 16:44:17,071 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2023-12-04 16:44:20,741 INFO [train.py:1087] (1/4) Epoch 61, batch 100, loss[loss=0.1521, simple_loss=0.2455, pruned_loss=0.02932, over 24787.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2439, pruned_loss=0.02904, over 1920038.46 frames. ], batch size: 62, lr: 4.04e-03, grad_scale: 32.0 2023-12-04 16:44:35,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=358733.3333333333, ans=0.125 2023-12-04 16:44:51,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=358800.0, ans=0.0 2023-12-04 16:44:53,547 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.65 vs. limit=15.0 2023-12-04 16:44:58,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=358800.0, ans=0.0 2023-12-04 16:45:30,334 INFO [train.py:1087] (1/4) Epoch 61, batch 150, loss[loss=0.1521, simple_loss=0.2479, pruned_loss=0.02816, over 24703.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2444, pruned_loss=0.02918, over 2549611.25 frames. ], batch size: 69, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:45:33,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=359000.0, ans=0.09899494936611666 2023-12-04 16:45:34,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=359000.0, ans=0.125 2023-12-04 16:45:37,985 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.30 vs. limit=22.5 2023-12-04 16:46:03,203 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=359133.3333333333, ans=0.0 2023-12-04 16:46:13,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=359200.0, ans=0.2 2023-12-04 16:46:33,180 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.244e+02 1.317e+02 1.412e+02 1.866e+02, threshold=2.633e+02, percent-clipped=0.0 2023-12-04 16:46:39,783 INFO [train.py:1087] (1/4) Epoch 61, batch 200, loss[loss=0.1638, simple_loss=0.2515, pruned_loss=0.03805, over 24484.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2444, pruned_loss=0.02926, over 3046063.27 frames. ], batch size: 75, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:47:28,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=359533.3333333333, ans=0.125 2023-12-04 16:47:38,954 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=359600.0, ans=0.125 2023-12-04 16:47:49,383 INFO [train.py:1087] (1/4) Epoch 61, batch 250, loss[loss=0.1511, simple_loss=0.2423, pruned_loss=0.02991, over 24779.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2442, pruned_loss=0.02942, over 3445742.18 frames. ], batch size: 70, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:47:54,133 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.05 vs. limit=22.5 2023-12-04 16:48:17,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=359800.0, ans=0.125 2023-12-04 16:48:18,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=359800.0, ans=0.125 2023-12-04 16:48:35,231 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=359866.6666666667, ans=0.125 2023-12-04 16:48:37,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=359866.6666666667, ans=0.125 2023-12-04 16:48:51,444 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.284e+02 1.364e+02 1.504e+02 1.772e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 16:48:56,557 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=359933.3333333333, ans=0.2 2023-12-04 16:48:59,327 INFO [train.py:1087] (1/4) Epoch 61, batch 300, loss[loss=0.1562, simple_loss=0.2483, pruned_loss=0.032, over 24521.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2444, pruned_loss=0.02943, over 3747313.62 frames. ], batch size: 77, lr: 4.03e-03, grad_scale: 32.0 2023-12-04 16:49:17,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360066.6666666667, ans=0.1 2023-12-04 16:50:07,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=360266.6666666667, ans=0.0 2023-12-04 16:50:07,980 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-12-04 16:50:09,572 INFO [train.py:1087] (1/4) Epoch 61, batch 350, loss[loss=0.1574, simple_loss=0.2513, pruned_loss=0.03173, over 24137.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02938, over 3985932.17 frames. ], batch size: 58, lr: 4.03e-03, grad_scale: 16.0 2023-12-04 16:50:17,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=360333.3333333333, ans=0.0 2023-12-04 16:50:22,582 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-12-04 16:50:30,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=360400.0, ans=0.0 2023-12-04 16:50:38,364 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=360466.6666666667, ans=0.0 2023-12-04 16:51:14,391 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.266e+02 1.361e+02 1.516e+02 2.406e+02, threshold=2.722e+02, percent-clipped=0.0 2023-12-04 16:51:19,702 INFO [train.py:1087] (1/4) Epoch 61, batch 400, loss[loss=0.1538, simple_loss=0.2454, pruned_loss=0.03112, over 24740.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2444, pruned_loss=0.02929, over 4175176.02 frames. ], batch size: 61, lr: 4.02e-03, grad_scale: 32.0 2023-12-04 16:51:42,848 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=360733.3333333333, ans=0.125 2023-12-04 16:51:52,617 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=360800.0, ans=0.1 2023-12-04 16:51:58,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=360800.0, ans=0.2 2023-12-04 16:52:00,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360866.6666666667, ans=0.125 2023-12-04 16:52:29,220 INFO [train.py:1087] (1/4) Epoch 61, batch 450, loss[loss=0.1545, simple_loss=0.2543, pruned_loss=0.02739, over 21052.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02932, over 4316006.60 frames. ], batch size: 127, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:52:32,498 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-12-04 16:52:37,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=361000.0, ans=0.125 2023-12-04 16:52:44,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=361066.6666666667, ans=0.0 2023-12-04 16:52:55,737 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361133.3333333333, ans=0.1 2023-12-04 16:53:19,527 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=361200.0, ans=0.125 2023-12-04 16:53:34,669 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.242e+02 1.346e+02 1.491e+02 2.478e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 16:53:40,304 INFO [train.py:1087] (1/4) Epoch 61, batch 500, loss[loss=0.1453, simple_loss=0.2398, pruned_loss=0.02542, over 24848.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02933, over 4411545.28 frames. ], batch size: 68, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:53:44,555 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.72 vs. limit=10.0 2023-12-04 16:53:51,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=361333.3333333333, ans=0.0 2023-12-04 16:54:05,221 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=361400.0, ans=0.0 2023-12-04 16:54:09,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=361466.6666666667, ans=0.0 2023-12-04 16:54:41,097 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.65 vs. limit=15.0 2023-12-04 16:54:43,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=361600.0, ans=0.0 2023-12-04 16:54:47,922 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-12-04 16:54:49,777 INFO [train.py:1087] (1/4) Epoch 61, batch 550, loss[loss=0.1411, simple_loss=0.2338, pruned_loss=0.02419, over 24566.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02942, over 4477951.86 frames. ], batch size: 64, lr: 4.02e-03, grad_scale: 16.0 2023-12-04 16:54:56,082 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361666.6666666667, ans=0.1 2023-12-04 16:54:58,243 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.13 vs. limit=15.0 2023-12-04 16:55:03,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361666.6666666667, ans=0.1 2023-12-04 16:55:05,805 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=361733.3333333333, ans=0.125 2023-12-04 16:55:14,963 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=361733.3333333333, ans=0.0 2023-12-04 16:55:27,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=22.5 2023-12-04 16:55:47,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=361933.3333333333, ans=0.125 2023-12-04 16:55:55,315 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=361933.3333333333, ans=0.0 2023-12-04 16:55:57,441 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.074e+02 1.276e+02 1.374e+02 1.486e+02 2.079e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 16:56:00,080 INFO [train.py:1087] (1/4) Epoch 61, batch 600, loss[loss=0.166, simple_loss=0.2551, pruned_loss=0.03842, over 24329.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2448, pruned_loss=0.02955, over 4552366.32 frames. ], batch size: 79, lr: 4.02e-03, grad_scale: 8.0 2023-12-04 16:56:49,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=362200.0, ans=0.125 2023-12-04 16:57:09,904 INFO [train.py:1087] (1/4) Epoch 61, batch 650, loss[loss=0.15, simple_loss=0.2453, pruned_loss=0.0274, over 24610.00 frames. ], tot_loss[loss=0.1515, simple_loss=0.2444, pruned_loss=0.02926, over 4617944.59 frames. ], batch size: 68, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:57:10,370 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=362333.3333333333, ans=0.0 2023-12-04 16:57:24,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=362400.0, ans=0.125 2023-12-04 16:57:46,267 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-12-04 16:58:02,786 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-12-04 16:58:16,590 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.291e+02 1.410e+02 1.493e+02 2.580e+02, threshold=2.821e+02, percent-clipped=0.0 2023-12-04 16:58:19,700 INFO [train.py:1087] (1/4) Epoch 61, batch 700, loss[loss=0.1798, simple_loss=0.2649, pruned_loss=0.04732, over 16318.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.02941, over 4653702.15 frames. ], batch size: 177, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:58:21,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=362666.6666666667, ans=0.125 2023-12-04 16:58:22,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=362666.6666666667, ans=0.0 2023-12-04 16:58:24,491 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.48 vs. limit=15.0 2023-12-04 16:58:50,077 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 16:58:52,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=362800.0, ans=0.0 2023-12-04 16:58:58,987 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-12-04 16:59:15,334 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=362933.3333333333, ans=0.0 2023-12-04 16:59:20,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=362933.3333333333, ans=0.125 2023-12-04 16:59:30,035 INFO [train.py:1087] (1/4) Epoch 61, batch 750, loss[loss=0.1718, simple_loss=0.2632, pruned_loss=0.04024, over 22752.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2449, pruned_loss=0.02951, over 4660328.07 frames. ], batch size: 106, lr: 4.01e-03, grad_scale: 8.0 2023-12-04 16:59:31,065 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-12-04 16:59:40,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=363000.0, ans=0.0 2023-12-04 16:59:55,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=363066.6666666667, ans=0.1 2023-12-04 16:59:59,257 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=363133.3333333333, ans=0.125 2023-12-04 17:00:32,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=363266.6666666667, ans=0.125 2023-12-04 17:00:34,338 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=363266.6666666667, ans=0.0 2023-12-04 17:00:36,602 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.303e+02 1.435e+02 1.618e+02 2.179e+02, threshold=2.869e+02, percent-clipped=0.0 2023-12-04 17:00:39,240 INFO [train.py:1087] (1/4) Epoch 61, batch 800, loss[loss=0.1563, simple_loss=0.2522, pruned_loss=0.03019, over 24058.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02939, over 4700102.50 frames. ], batch size: 87, lr: 4.01e-03, grad_scale: 16.0 2023-12-04 17:00:40,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=363333.3333333333, ans=0.035 2023-12-04 17:00:43,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=363333.3333333333, ans=0.07 2023-12-04 17:00:53,566 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.09 vs. limit=15.0 2023-12-04 17:01:22,269 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.66 vs. limit=12.0 2023-12-04 17:01:28,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363600.0, ans=0.125 2023-12-04 17:01:33,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=363600.0, ans=0.07 2023-12-04 17:01:41,739 INFO [train.py:1087] (1/4) Epoch 61, batch 850, loss[loss=0.1614, simple_loss=0.2503, pruned_loss=0.03629, over 24504.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02955, over 4717686.23 frames. ], batch size: 75, lr: 4.01e-03, grad_scale: 16.0 2023-12-04 17:01:43,641 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-12-04 17:02:14,465 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=363800.0, ans=0.125 2023-12-04 17:02:21,780 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363866.6666666667, ans=0.125 2023-12-04 17:02:29,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=363866.6666666667, ans=0.1 2023-12-04 17:02:48,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=363966.6666666667, ans=0.125 2023-12-04 17:03:02,154 INFO [train.py:1087] (1/4) Epoch 62, batch 0, loss[loss=0.1494, simple_loss=0.2454, pruned_loss=0.02668, over 24227.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2454, pruned_loss=0.02668, over 24227.00 frames. ], batch size: 82, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:03:02,155 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 17:03:18,750 INFO [train.py:1119] (1/4) Epoch 62, validation: loss=0.1507, simple_loss=0.2477, pruned_loss=0.02683, over 944034.00 frames. 2023-12-04 17:03:18,751 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 17:03:22,805 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.310e+02 1.390e+02 1.540e+02 2.512e+02, threshold=2.781e+02, percent-clipped=0.0 2023-12-04 17:03:34,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=364033.3333333333, ans=0.0 2023-12-04 17:03:39,199 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.16 vs. limit=22.5 2023-12-04 17:03:48,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=364100.0, ans=0.07 2023-12-04 17:03:54,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=364100.0, ans=0.125 2023-12-04 17:04:01,525 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=364166.6666666667, ans=0.0 2023-12-04 17:04:16,117 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=364233.3333333333, ans=0.0 2023-12-04 17:04:27,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=364300.0, ans=0.0 2023-12-04 17:04:28,807 INFO [train.py:1087] (1/4) Epoch 62, batch 50, loss[loss=0.1588, simple_loss=0.2527, pruned_loss=0.0324, over 21419.00 frames. ], tot_loss[loss=0.154, simple_loss=0.2474, pruned_loss=0.03026, over 1072924.37 frames. ], batch size: 127, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:04:35,276 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-12-04 17:04:36,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=364300.0, ans=0.0 2023-12-04 17:05:11,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=364500.0, ans=0.09899494936611666 2023-12-04 17:05:35,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=364566.6666666667, ans=0.125 2023-12-04 17:05:38,095 INFO [train.py:1087] (1/4) Epoch 62, batch 100, loss[loss=0.1539, simple_loss=0.2464, pruned_loss=0.03068, over 24282.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2454, pruned_loss=0.0293, over 1912514.91 frames. ], batch size: 79, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:05:42,782 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.257e+02 1.334e+02 1.479e+02 2.248e+02, threshold=2.667e+02, percent-clipped=0.0 2023-12-04 17:05:44,536 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=364633.3333333333, ans=0.09899494936611666 2023-12-04 17:06:13,687 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-12-04 17:06:17,569 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=364766.6666666667, ans=0.0 2023-12-04 17:06:31,979 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=364833.3333333333, ans=0.125 2023-12-04 17:06:32,499 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.26 vs. limit=15.0 2023-12-04 17:06:37,328 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=364900.0, ans=0.125 2023-12-04 17:06:44,446 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364900.0, ans=0.0 2023-12-04 17:06:48,100 INFO [train.py:1087] (1/4) Epoch 62, batch 150, loss[loss=0.141, simple_loss=0.2357, pruned_loss=0.02313, over 24795.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2456, pruned_loss=0.02967, over 2542035.10 frames. ], batch size: 73, lr: 3.97e-03, grad_scale: 32.0 2023-12-04 17:07:04,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365033.3333333333, ans=0.125 2023-12-04 17:07:15,215 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:07:23,489 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=365100.0, ans=0.125 2023-12-04 17:07:33,344 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=365166.6666666667, ans=0.125 2023-12-04 17:07:40,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365166.6666666667, ans=0.1 2023-12-04 17:07:58,194 INFO [train.py:1087] (1/4) Epoch 62, batch 200, loss[loss=0.1502, simple_loss=0.2429, pruned_loss=0.02876, over 24467.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2448, pruned_loss=0.02922, over 3055063.49 frames. ], batch size: 77, lr: 3.97e-03, grad_scale: 16.0 2023-12-04 17:08:03,390 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.115e+02 1.286e+02 1.429e+02 1.574e+02 2.172e+02, threshold=2.858e+02, percent-clipped=0.0 2023-12-04 17:08:49,136 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.79 vs. limit=22.5 2023-12-04 17:08:55,747 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:09:07,628 INFO [train.py:1087] (1/4) Epoch 62, batch 250, loss[loss=0.1434, simple_loss=0.2363, pruned_loss=0.02525, over 24731.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2451, pruned_loss=0.02934, over 3437770.03 frames. ], batch size: 67, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:09:08,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=365633.3333333333, ans=0.0 2023-12-04 17:09:10,701 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:09:10,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365633.3333333333, ans=0.125 2023-12-04 17:09:37,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=365766.6666666667, ans=0.125 2023-12-04 17:09:37,942 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=365766.6666666667, ans=0.125 2023-12-04 17:09:56,223 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.49 vs. limit=15.0 2023-12-04 17:09:57,453 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-12-04 17:10:01,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=365833.3333333333, ans=0.1 2023-12-04 17:10:18,257 INFO [train.py:1087] (1/4) Epoch 62, batch 300, loss[loss=0.1544, simple_loss=0.2474, pruned_loss=0.03076, over 23938.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.02944, over 3741463.13 frames. ], batch size: 87, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:10:23,644 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.294e+02 1.414e+02 1.524e+02 2.154e+02, threshold=2.827e+02, percent-clipped=0.0 2023-12-04 17:10:24,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=365966.6666666667, ans=0.07 2023-12-04 17:10:31,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=366033.3333333333, ans=0.0 2023-12-04 17:10:43,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=366033.3333333333, ans=0.0 2023-12-04 17:10:49,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=366100.0, ans=0.125 2023-12-04 17:11:24,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=366233.3333333333, ans=0.125 2023-12-04 17:11:27,259 INFO [train.py:1087] (1/4) Epoch 62, batch 350, loss[loss=0.1487, simple_loss=0.2417, pruned_loss=0.02784, over 24857.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2449, pruned_loss=0.02948, over 3984623.35 frames. ], batch size: 68, lr: 3.96e-03, grad_scale: 16.0 2023-12-04 17:11:46,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=366366.6666666667, ans=0.0 2023-12-04 17:11:51,174 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.56 vs. limit=15.0 2023-12-04 17:12:01,537 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366433.3333333333, ans=0.1 2023-12-04 17:12:10,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=366500.0, ans=10.0 2023-12-04 17:12:12,021 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:12:37,906 INFO [train.py:1087] (1/4) Epoch 62, batch 400, loss[loss=0.155, simple_loss=0.2478, pruned_loss=0.03112, over 23584.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2451, pruned_loss=0.02957, over 4154657.01 frames. ], batch size: 94, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:12:43,082 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.289e+02 1.370e+02 1.489e+02 1.950e+02, threshold=2.740e+02, percent-clipped=0.0 2023-12-04 17:13:02,715 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.40 vs. limit=6.0 2023-12-04 17:13:08,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=366766.6666666667, ans=10.0 2023-12-04 17:13:30,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=366833.3333333333, ans=0.125 2023-12-04 17:13:36,034 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-12-04 17:13:44,252 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.01 vs. limit=10.0 2023-12-04 17:13:49,177 INFO [train.py:1087] (1/4) Epoch 62, batch 450, loss[loss=0.1651, simple_loss=0.2597, pruned_loss=0.03529, over 21470.00 frames. ], tot_loss[loss=0.1526, simple_loss=0.2455, pruned_loss=0.02984, over 4274939.79 frames. ], batch size: 127, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:14:10,262 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.68 vs. limit=15.0 2023-12-04 17:14:19,994 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=367100.0, ans=0.125 2023-12-04 17:14:30,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=367166.6666666667, ans=0.2 2023-12-04 17:14:33,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=367166.6666666667, ans=0.0 2023-12-04 17:14:58,539 INFO [train.py:1087] (1/4) Epoch 62, batch 500, loss[loss=0.1528, simple_loss=0.2436, pruned_loss=0.03098, over 24103.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2451, pruned_loss=0.02968, over 4384797.59 frames. ], batch size: 87, lr: 3.96e-03, grad_scale: 32.0 2023-12-04 17:15:01,507 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=367300.0, ans=0.5 2023-12-04 17:15:04,486 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.254e+02 1.390e+02 1.611e+02 2.032e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 17:15:12,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=367366.6666666667, ans=0.125 2023-12-04 17:16:08,859 INFO [train.py:1087] (1/4) Epoch 62, batch 550, loss[loss=0.1529, simple_loss=0.2528, pruned_loss=0.02655, over 24771.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2453, pruned_loss=0.02954, over 4485902.39 frames. ], batch size: 64, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:16:13,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=367633.3333333333, ans=0.125 2023-12-04 17:16:24,042 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=367700.0, ans=0.2 2023-12-04 17:16:25,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=367700.0, ans=0.09899494936611666 2023-12-04 17:16:29,052 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.99 vs. limit=12.0 2023-12-04 17:16:32,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=367700.0, ans=0.0 2023-12-04 17:16:39,645 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.07 vs. limit=15.0 2023-12-04 17:16:43,561 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.23 vs. limit=15.0 2023-12-04 17:16:46,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=367766.6666666667, ans=0.125 2023-12-04 17:17:03,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=367833.3333333333, ans=0.0 2023-12-04 17:17:19,977 INFO [train.py:1087] (1/4) Epoch 62, batch 600, loss[loss=0.1639, simple_loss=0.254, pruned_loss=0.0369, over 24203.00 frames. ], tot_loss[loss=0.1524, simple_loss=0.2454, pruned_loss=0.02973, over 4557775.05 frames. ], batch size: 82, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:17:25,846 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.279e+02 1.362e+02 1.462e+02 1.926e+02, threshold=2.724e+02, percent-clipped=0.0 2023-12-04 17:17:26,331 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=367966.6666666667, ans=0.0 2023-12-04 17:17:32,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=367966.6666666667, ans=0.2 2023-12-04 17:17:36,557 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=368033.3333333333, ans=0.0 2023-12-04 17:18:07,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=368166.6666666667, ans=0.2 2023-12-04 17:18:30,221 INFO [train.py:1087] (1/4) Epoch 62, batch 650, loss[loss=0.1531, simple_loss=0.2544, pruned_loss=0.02593, over 24576.00 frames. ], tot_loss[loss=0.152, simple_loss=0.2452, pruned_loss=0.02943, over 4616857.47 frames. ], batch size: 65, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:18:46,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=368366.6666666667, ans=0.125 2023-12-04 17:18:55,617 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=12.0 2023-12-04 17:19:02,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=368433.3333333333, ans=0.125 2023-12-04 17:19:07,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=368433.3333333333, ans=0.125 2023-12-04 17:19:29,307 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.66 vs. limit=12.0 2023-12-04 17:19:31,599 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=368566.6666666667, ans=0.0 2023-12-04 17:19:40,839 INFO [train.py:1087] (1/4) Epoch 62, batch 700, loss[loss=0.1565, simple_loss=0.2504, pruned_loss=0.03133, over 23055.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2451, pruned_loss=0.02951, over 4652189.95 frames. ], batch size: 106, lr: 3.95e-03, grad_scale: 32.0 2023-12-04 17:19:46,064 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.235e+02 1.327e+02 1.428e+02 1.711e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 17:20:07,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=368766.6666666667, ans=0.125 2023-12-04 17:20:26,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=368833.3333333333, ans=0.0 2023-12-04 17:20:28,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=368833.3333333333, ans=0.0 2023-12-04 17:20:43,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=368900.0, ans=0.0 2023-12-04 17:20:50,478 INFO [train.py:1087] (1/4) Epoch 62, batch 750, loss[loss=0.1551, simple_loss=0.2499, pruned_loss=0.03021, over 23446.00 frames. ], tot_loss[loss=0.1522, simple_loss=0.2452, pruned_loss=0.02961, over 4675916.09 frames. ], batch size: 94, lr: 3.95e-03, grad_scale: 16.0 2023-12-04 17:21:06,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=369033.3333333333, ans=0.125 2023-12-04 17:21:16,582 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.89 vs. limit=12.0 2023-12-04 17:21:24,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=369100.0, ans=0.0 2023-12-04 17:21:27,952 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:21:34,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=369166.6666666667, ans=0.04949747468305833 2023-12-04 17:21:35,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=369166.6666666667, ans=0.0 2023-12-04 17:21:58,805 INFO [train.py:1087] (1/4) Epoch 62, batch 800, loss[loss=0.1489, simple_loss=0.2365, pruned_loss=0.03062, over 24490.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2451, pruned_loss=0.0296, over 4700439.53 frames. ], batch size: 77, lr: 3.94e-03, grad_scale: 32.0 2023-12-04 17:22:04,881 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.72 vs. limit=15.0 2023-12-04 17:22:05,819 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.274e+02 1.367e+02 1.528e+02 1.973e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 17:22:08,542 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=369300.0, ans=0.0 2023-12-04 17:22:14,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=369366.6666666667, ans=0.0 2023-12-04 17:22:21,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=369366.6666666667, ans=0.0 2023-12-04 17:22:27,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369433.3333333333, ans=0.1 2023-12-04 17:22:36,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=369500.0, ans=0.0 2023-12-04 17:22:53,889 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.33 vs. limit=15.0 2023-12-04 17:22:54,648 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=369566.6666666667, ans=0.0 2023-12-04 17:23:00,475 INFO [train.py:1087] (1/4) Epoch 62, batch 850, loss[loss=0.1514, simple_loss=0.246, pruned_loss=0.02834, over 24728.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2448, pruned_loss=0.02942, over 4741560.09 frames. ], batch size: 69, lr: 3.94e-03, grad_scale: 32.0 2023-12-04 17:23:06,710 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=369633.3333333333, ans=0.2 2023-12-04 17:23:09,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=369633.3333333333, ans=0.125 2023-12-04 17:23:20,943 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=369700.0, ans=0.125 2023-12-04 17:23:35,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=369833.3333333333, ans=0.2 2023-12-04 17:23:49,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=369900.0, ans=0.1 2023-12-04 17:24:05,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=369933.3333333333, ans=0.0 2023-12-04 17:24:17,811 INFO [train.py:1087] (1/4) Epoch 63, batch 0, loss[loss=0.1452, simple_loss=0.2371, pruned_loss=0.02667, over 24550.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.2371, pruned_loss=0.02667, over 24550.00 frames. ], batch size: 66, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:24:17,813 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 17:24:28,961 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.4.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.8113, 3.9987, 3.3319, 3.8119], device='cuda:1') 2023-12-04 17:24:33,804 INFO [train.py:1119] (1/4) Epoch 63, validation: loss=0.1507, simple_loss=0.2477, pruned_loss=0.02691, over 944034.00 frames. 2023-12-04 17:24:33,805 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 17:24:47,338 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.268e+02 1.369e+02 1.467e+02 2.072e+02, threshold=2.738e+02, percent-clipped=0.0 2023-12-04 17:24:52,256 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=15.0 2023-12-04 17:25:39,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=370200.0, ans=0.05 2023-12-04 17:25:41,309 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.29 vs. limit=22.5 2023-12-04 17:25:43,157 INFO [train.py:1087] (1/4) Epoch 63, batch 50, loss[loss=0.1607, simple_loss=0.2517, pruned_loss=0.03489, over 24091.00 frames. ], tot_loss[loss=0.1533, simple_loss=0.2466, pruned_loss=0.02996, over 1083043.74 frames. ], batch size: 87, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:25:51,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370266.6666666667, ans=0.1 2023-12-04 17:26:52,599 INFO [train.py:1087] (1/4) Epoch 63, batch 100, loss[loss=0.1465, simple_loss=0.2419, pruned_loss=0.02561, over 24719.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2451, pruned_loss=0.02937, over 1913819.03 frames. ], batch size: 69, lr: 3.91e-03, grad_scale: 32.0 2023-12-04 17:26:58,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370600.0, ans=0.1 2023-12-04 17:27:07,273 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.263e+02 1.363e+02 1.431e+02 2.144e+02, threshold=2.726e+02, percent-clipped=0.0 2023-12-04 17:27:20,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=370733.3333333333, ans=0.0 2023-12-04 17:27:33,699 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:27:36,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=370800.0, ans=0.125 2023-12-04 17:27:41,236 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.46 vs. limit=10.0 2023-12-04 17:27:45,042 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-12-04 17:27:56,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=370866.6666666667, ans=0.125 2023-12-04 17:28:03,074 INFO [train.py:1087] (1/4) Epoch 63, batch 150, loss[loss=0.1513, simple_loss=0.2475, pruned_loss=0.02752, over 23042.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.245, pruned_loss=0.02925, over 2555499.56 frames. ], batch size: 106, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:28:41,064 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=371066.6666666667, ans=0.2 2023-12-04 17:29:13,702 INFO [train.py:1087] (1/4) Epoch 63, batch 200, loss[loss=0.1616, simple_loss=0.2588, pruned_loss=0.03223, over 24311.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.245, pruned_loss=0.02926, over 3060483.38 frames. ], batch size: 79, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:29:17,137 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.12 vs. limit=15.0 2023-12-04 17:29:18,439 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-12-04 17:29:26,805 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.284e+02 1.367e+02 1.487e+02 1.776e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-04 17:29:38,717 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=371333.3333333333, ans=0.0 2023-12-04 17:30:08,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=371533.3333333333, ans=0.125 2023-12-04 17:30:08,158 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=371533.3333333333, ans=0.125 2023-12-04 17:30:08,604 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-12-04 17:30:21,836 INFO [train.py:1087] (1/4) Epoch 63, batch 250, loss[loss=0.1588, simple_loss=0.2515, pruned_loss=0.03299, over 24587.00 frames. ], tot_loss[loss=0.1517, simple_loss=0.2447, pruned_loss=0.02931, over 3451379.17 frames. ], batch size: 65, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:30:38,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=371666.6666666667, ans=0.5 2023-12-04 17:30:41,496 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=371666.6666666667, ans=0.125 2023-12-04 17:30:44,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=371666.6666666667, ans=0.125 2023-12-04 17:30:45,909 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-12-04 17:31:01,262 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.68 vs. limit=15.0 2023-12-04 17:31:30,991 INFO [train.py:1087] (1/4) Epoch 63, batch 300, loss[loss=0.159, simple_loss=0.25, pruned_loss=0.03398, over 23940.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2439, pruned_loss=0.02878, over 3767896.48 frames. ], batch size: 87, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:31:42,949 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.14 vs. limit=22.5 2023-12-04 17:31:45,099 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.246e+02 1.351e+02 1.452e+02 1.759e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 17:32:05,013 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.92 vs. limit=10.0 2023-12-04 17:32:19,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=372133.3333333333, ans=0.0 2023-12-04 17:32:19,509 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.74 vs. limit=15.0 2023-12-04 17:32:23,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=372133.3333333333, ans=0.2 2023-12-04 17:32:40,445 INFO [train.py:1087] (1/4) Epoch 63, batch 350, loss[loss=0.1476, simple_loss=0.2407, pruned_loss=0.02725, over 24798.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2444, pruned_loss=0.02921, over 4003283.69 frames. ], batch size: 73, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:32:43,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=372266.6666666667, ans=0.125 2023-12-04 17:33:11,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=372400.0, ans=0.125 2023-12-04 17:33:21,833 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.58 vs. limit=15.0 2023-12-04 17:33:40,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=372533.3333333333, ans=0.0 2023-12-04 17:33:48,548 INFO [train.py:1087] (1/4) Epoch 63, batch 400, loss[loss=0.1541, simple_loss=0.2505, pruned_loss=0.02879, over 24794.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2447, pruned_loss=0.02953, over 4172450.88 frames. ], batch size: 72, lr: 3.90e-03, grad_scale: 32.0 2023-12-04 17:33:50,281 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=372600.0, ans=0.125 2023-12-04 17:33:56,805 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=372600.0, ans=0.125 2023-12-04 17:33:58,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=372600.0, ans=0.0 2023-12-04 17:34:01,188 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-12-04 17:34:02,378 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.304e+02 1.404e+02 1.512e+02 1.872e+02, threshold=2.808e+02, percent-clipped=0.0 2023-12-04 17:34:26,269 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=372733.3333333333, ans=0.0 2023-12-04 17:34:27,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=372733.3333333333, ans=0.125 2023-12-04 17:34:35,756 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.60 vs. limit=15.0 2023-12-04 17:34:47,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=372866.6666666667, ans=0.125 2023-12-04 17:34:49,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=372866.6666666667, ans=0.125 2023-12-04 17:34:57,666 INFO [train.py:1087] (1/4) Epoch 63, batch 450, loss[loss=0.1742, simple_loss=0.2588, pruned_loss=0.0448, over 16525.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.2448, pruned_loss=0.02968, over 4306727.71 frames. ], batch size: 177, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:35:04,186 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=15.0 2023-12-04 17:35:05,484 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-12-04 17:35:21,083 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=12.0 2023-12-04 17:35:27,727 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-12-04 17:35:46,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=373133.3333333333, ans=0.125 2023-12-04 17:36:08,734 INFO [train.py:1087] (1/4) Epoch 63, batch 500, loss[loss=0.1498, simple_loss=0.2408, pruned_loss=0.0294, over 24781.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02963, over 4426395.72 frames. ], batch size: 71, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:36:11,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=373266.6666666667, ans=0.125 2023-12-04 17:36:18,470 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=373266.6666666667, ans=0.1 2023-12-04 17:36:26,474 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.037e+02 1.250e+02 1.321e+02 1.428e+02 2.145e+02, threshold=2.643e+02, percent-clipped=0.0 2023-12-04 17:36:44,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=373400.0, ans=0.0 2023-12-04 17:36:53,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=373400.0, ans=0.0 2023-12-04 17:36:58,418 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=373466.6666666667, ans=0.0 2023-12-04 17:37:21,409 INFO [train.py:1087] (1/4) Epoch 63, batch 550, loss[loss=0.1465, simple_loss=0.2404, pruned_loss=0.02631, over 24791.00 frames. ], tot_loss[loss=0.1521, simple_loss=0.245, pruned_loss=0.02962, over 4506639.84 frames. ], batch size: 72, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:37:22,237 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.93 vs. limit=22.5 2023-12-04 17:37:27,061 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-12-04 17:37:34,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=373666.6666666667, ans=0.0 2023-12-04 17:37:39,978 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=373666.6666666667, ans=0.0 2023-12-04 17:37:50,964 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.88 vs. limit=15.0 2023-12-04 17:38:29,463 INFO [train.py:1087] (1/4) Epoch 63, batch 600, loss[loss=0.1429, simple_loss=0.2338, pruned_loss=0.02603, over 24558.00 frames. ], tot_loss[loss=0.1518, simple_loss=0.2448, pruned_loss=0.02939, over 4577949.67 frames. ], batch size: 66, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:38:32,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=373933.3333333333, ans=0.0 2023-12-04 17:38:40,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=373933.3333333333, ans=0.125 2023-12-04 17:38:43,611 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.277e+02 1.352e+02 1.461e+02 2.062e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 17:39:02,883 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=15.0 2023-12-04 17:39:14,163 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=374133.3333333333, ans=0.125 2023-12-04 17:39:16,904 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-12-04 17:39:30,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=374200.0, ans=0.125 2023-12-04 17:39:33,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=374200.0, ans=0.125 2023-12-04 17:39:36,450 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-12-04 17:39:36,774 INFO [train.py:1087] (1/4) Epoch 63, batch 650, loss[loss=0.1538, simple_loss=0.2456, pruned_loss=0.03102, over 24530.00 frames. ], tot_loss[loss=0.1519, simple_loss=0.2448, pruned_loss=0.02949, over 4632282.63 frames. ], batch size: 75, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:39:39,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=374266.6666666667, ans=0.0 2023-12-04 17:39:46,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=374266.6666666667, ans=0.0 2023-12-04 17:39:57,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=374333.3333333333, ans=0.0 2023-12-04 17:40:09,883 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=374400.0, ans=0.025 2023-12-04 17:40:40,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=374533.3333333333, ans=0.0 2023-12-04 17:40:44,454 INFO [train.py:1087] (1/4) Epoch 63, batch 700, loss[loss=0.1483, simple_loss=0.2434, pruned_loss=0.02658, over 24805.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02933, over 4665388.01 frames. ], batch size: 73, lr: 3.89e-03, grad_scale: 32.0 2023-12-04 17:40:58,411 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.326e+02 1.433e+02 1.543e+02 2.132e+02, threshold=2.865e+02, percent-clipped=0.0 2023-12-04 17:41:06,949 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-12-04 17:41:09,269 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.43 vs. limit=15.0 2023-12-04 17:41:24,807 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.08 vs. limit=15.0 2023-12-04 17:41:26,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=374800.0, ans=0.125 2023-12-04 17:41:32,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=374800.0, ans=6.0 2023-12-04 17:41:35,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=374800.0, ans=0.0 2023-12-04 17:41:52,551 INFO [train.py:1087] (1/4) Epoch 63, batch 750, loss[loss=0.1659, simple_loss=0.2528, pruned_loss=0.03952, over 24161.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02928, over 4706855.87 frames. ], batch size: 87, lr: 3.88e-03, grad_scale: 32.0 2023-12-04 17:42:17,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=375066.6666666667, ans=0.125 2023-12-04 17:42:28,981 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-12-04 17:42:34,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=375133.3333333333, ans=0.125 2023-12-04 17:42:59,298 INFO [train.py:1087] (1/4) Epoch 63, batch 800, loss[loss=0.1492, simple_loss=0.2427, pruned_loss=0.02783, over 24767.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2441, pruned_loss=0.02903, over 4740581.27 frames. ], batch size: 66, lr: 3.88e-03, grad_scale: 32.0 2023-12-04 17:43:13,168 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.287e+02 1.371e+02 1.480e+02 1.982e+02, threshold=2.741e+02, percent-clipped=0.0 2023-12-04 17:43:33,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=375400.0, ans=0.125 2023-12-04 17:43:43,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=375466.6666666667, ans=0.125 2023-12-04 17:43:45,307 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-12-04 17:43:59,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=375600.0, ans=0.0 2023-12-04 17:44:00,204 INFO [train.py:1087] (1/4) Epoch 63, batch 850, loss[loss=0.1552, simple_loss=0.2426, pruned_loss=0.03394, over 24463.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2445, pruned_loss=0.02938, over 4758977.71 frames. ], batch size: 77, lr: 3.88e-03, grad_scale: 16.0 2023-12-04 17:44:31,021 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:44:37,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=375800.0, ans=0.0 2023-12-04 17:44:43,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=375800.0, ans=0.1 2023-12-04 17:44:44,785 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=375800.0, ans=0.0 2023-12-04 17:44:47,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=375866.6666666667, ans=0.0 2023-12-04 17:45:07,834 INFO [train.py:1087] (1/4) Epoch 64, batch 0, loss[loss=0.139, simple_loss=0.2324, pruned_loss=0.02281, over 24702.00 frames. ], tot_loss[loss=0.139, simple_loss=0.2324, pruned_loss=0.02281, over 24702.00 frames. ], batch size: 69, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:45:07,836 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 17:45:23,277 INFO [train.py:1119] (1/4) Epoch 64, validation: loss=0.1503, simple_loss=0.2474, pruned_loss=0.02664, over 944034.00 frames. 2023-12-04 17:45:23,277 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 17:45:25,244 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.41 vs. limit=15.0 2023-12-04 17:45:28,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=375900.0, ans=0.0 2023-12-04 17:45:43,719 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=375966.6666666667, ans=0.0 2023-12-04 17:45:43,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=375966.6666666667, ans=0.125 2023-12-04 17:45:44,664 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.332e+02 1.453e+02 1.597e+02 2.099e+02, threshold=2.906e+02, percent-clipped=0.0 2023-12-04 17:45:46,200 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-12-04 17:46:03,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376100.0, ans=0.1 2023-12-04 17:46:30,843 INFO [train.py:1087] (1/4) Epoch 64, batch 50, loss[loss=0.1415, simple_loss=0.2356, pruned_loss=0.02372, over 24797.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2437, pruned_loss=0.02859, over 1090115.53 frames. ], batch size: 71, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:46:41,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=376233.3333333333, ans=0.125 2023-12-04 17:46:43,940 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=376300.0, ans=0.125 2023-12-04 17:47:13,709 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=8.67 vs. limit=15.0 2023-12-04 17:47:18,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=376433.3333333333, ans=0.0 2023-12-04 17:47:34,170 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.71 vs. limit=15.0 2023-12-04 17:47:36,069 INFO [train.py:1087] (1/4) Epoch 64, batch 100, loss[loss=0.1452, simple_loss=0.2396, pruned_loss=0.02538, over 24781.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2431, pruned_loss=0.02809, over 1918639.10 frames. ], batch size: 73, lr: 3.85e-03, grad_scale: 32.0 2023-12-04 17:47:44,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=376566.6666666667, ans=0.0 2023-12-04 17:47:53,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376633.3333333333, ans=0.125 2023-12-04 17:47:53,754 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.36 vs. limit=10.0 2023-12-04 17:47:57,753 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.310e+02 1.417e+02 1.571e+02 2.494e+02, threshold=2.835e+02, percent-clipped=0.0 2023-12-04 17:48:05,844 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=376700.0, ans=0.0 2023-12-04 17:48:20,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376766.6666666667, ans=0.125 2023-12-04 17:48:24,047 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.06 vs. limit=15.0 2023-12-04 17:48:36,304 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2023-12-04 17:48:42,182 INFO [train.py:1087] (1/4) Epoch 64, batch 150, loss[loss=0.1496, simple_loss=0.243, pruned_loss=0.02805, over 21655.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2428, pruned_loss=0.02846, over 2571546.23 frames. ], batch size: 128, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:49:03,648 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=376966.6666666667, ans=0.95 2023-12-04 17:49:03,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=376966.6666666667, ans=0.125 2023-12-04 17:49:20,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=377033.3333333333, ans=0.125 2023-12-04 17:49:50,773 INFO [train.py:1087] (1/4) Epoch 64, batch 200, loss[loss=0.1431, simple_loss=0.2364, pruned_loss=0.02486, over 24797.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2436, pruned_loss=0.02894, over 3040638.68 frames. ], batch size: 71, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:50:01,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=377233.3333333333, ans=0.125 2023-12-04 17:50:04,629 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.38 vs. limit=22.5 2023-12-04 17:50:13,828 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.291e+02 1.424e+02 1.631e+02 2.278e+02, threshold=2.847e+02, percent-clipped=0.0 2023-12-04 17:50:29,273 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.04 vs. limit=10.0 2023-12-04 17:50:57,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=377566.6666666667, ans=0.125 2023-12-04 17:50:57,984 INFO [train.py:1087] (1/4) Epoch 64, batch 250, loss[loss=0.1631, simple_loss=0.2595, pruned_loss=0.03334, over 21300.00 frames. ], tot_loss[loss=0.1512, simple_loss=0.2441, pruned_loss=0.02916, over 3426351.09 frames. ], batch size: 128, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:51:30,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=377700.0, ans=0.125 2023-12-04 17:51:32,934 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-12-04 17:51:47,198 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=377766.6666666667, ans=0.125 2023-12-04 17:52:01,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377833.3333333333, ans=0.1 2023-12-04 17:52:05,731 INFO [train.py:1087] (1/4) Epoch 64, batch 300, loss[loss=0.1571, simple_loss=0.2462, pruned_loss=0.03394, over 24559.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2443, pruned_loss=0.02942, over 3713431.58 frames. ], batch size: 66, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:52:08,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=377900.0, ans=0.125 2023-12-04 17:52:13,905 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=377900.0, ans=0.0 2023-12-04 17:52:19,337 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:52:23,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=377966.6666666667, ans=0.125 2023-12-04 17:52:24,820 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=377966.6666666667, ans=0.125 2023-12-04 17:52:28,297 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.270e+02 1.343e+02 1.484e+02 1.945e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 17:52:54,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378100.0, ans=0.125 2023-12-04 17:53:13,689 INFO [train.py:1087] (1/4) Epoch 64, batch 350, loss[loss=0.1586, simple_loss=0.2499, pruned_loss=0.03363, over 23509.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2443, pruned_loss=0.02944, over 3956071.24 frames. ], batch size: 94, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:53:16,367 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=378233.3333333333, ans=0.0 2023-12-04 17:53:16,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=378233.3333333333, ans=0.125 2023-12-04 17:53:28,817 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=15.0 2023-12-04 17:53:38,851 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=378300.0, ans=0.2 2023-12-04 17:53:51,772 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-12-04 17:54:00,686 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=378433.3333333333, ans=0.2 2023-12-04 17:54:13,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=378500.0, ans=0.0 2023-12-04 17:54:19,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=378500.0, ans=0.125 2023-12-04 17:54:22,730 INFO [train.py:1087] (1/4) Epoch 64, batch 400, loss[loss=0.1556, simple_loss=0.2462, pruned_loss=0.03246, over 24487.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2441, pruned_loss=0.02929, over 4139838.04 frames. ], batch size: 75, lr: 3.84e-03, grad_scale: 16.0 2023-12-04 17:54:26,896 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=378566.6666666667, ans=0.2 2023-12-04 17:54:28,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=378566.6666666667, ans=0.125 2023-12-04 17:54:47,360 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.264e+02 1.356e+02 1.518e+02 1.714e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 17:54:53,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=378700.0, ans=0.125 2023-12-04 17:54:56,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=378700.0, ans=0.125 2023-12-04 17:55:10,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=378766.6666666667, ans=0.125 2023-12-04 17:55:16,236 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=15.0 2023-12-04 17:55:29,800 INFO [train.py:1087] (1/4) Epoch 64, batch 450, loss[loss=0.1441, simple_loss=0.2406, pruned_loss=0.02387, over 24765.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2437, pruned_loss=0.02898, over 4294387.34 frames. ], batch size: 71, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:55:50,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=378966.6666666667, ans=0.04949747468305833 2023-12-04 17:55:50,417 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378966.6666666667, ans=0.1 2023-12-04 17:56:06,044 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=379033.3333333333, ans=0.125 2023-12-04 17:56:18,358 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-12-04 17:56:38,606 INFO [train.py:1087] (1/4) Epoch 64, batch 500, loss[loss=0.1538, simple_loss=0.246, pruned_loss=0.03081, over 24782.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2436, pruned_loss=0.02885, over 4409347.29 frames. ], batch size: 70, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:57:01,428 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.261e+02 1.354e+02 1.492e+02 2.080e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-04 17:57:08,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379366.6666666667, ans=0.1 2023-12-04 17:57:14,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=379366.6666666667, ans=0.125 2023-12-04 17:57:25,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=379433.3333333333, ans=12.0 2023-12-04 17:57:26,534 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.34 vs. limit=15.0 2023-12-04 17:57:27,290 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=379433.3333333333, ans=0.125 2023-12-04 17:57:29,739 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=379433.3333333333, ans=0.0 2023-12-04 17:57:45,063 INFO [train.py:1087] (1/4) Epoch 64, batch 550, loss[loss=0.1476, simple_loss=0.236, pruned_loss=0.02959, over 24472.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2434, pruned_loss=0.02876, over 4498514.74 frames. ], batch size: 75, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:57:52,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379566.6666666667, ans=0.125 2023-12-04 17:57:55,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=379566.6666666667, ans=0.0 2023-12-04 17:58:18,675 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 17:58:40,973 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=379833.3333333333, ans=0.125 2023-12-04 17:58:54,854 INFO [train.py:1087] (1/4) Epoch 64, batch 600, loss[loss=0.1627, simple_loss=0.2566, pruned_loss=0.03437, over 24230.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2439, pruned_loss=0.02908, over 4565832.84 frames. ], batch size: 58, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 17:59:20,045 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.164e+02 1.313e+02 1.387e+02 1.529e+02 2.312e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 17:59:28,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=380033.3333333333, ans=0.125 2023-12-04 17:59:59,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380166.6666666667, ans=0.1 2023-12-04 18:00:04,005 INFO [train.py:1087] (1/4) Epoch 64, batch 650, loss[loss=0.1511, simple_loss=0.2434, pruned_loss=0.02939, over 24155.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2438, pruned_loss=0.02899, over 4630691.66 frames. ], batch size: 58, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 18:00:17,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=380300.0, ans=0.2 2023-12-04 18:00:18,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=380300.0, ans=0.125 2023-12-04 18:00:20,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=380300.0, ans=0.125 2023-12-04 18:00:26,485 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.21 vs. limit=6.0 2023-12-04 18:00:45,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=380433.3333333333, ans=0.125 2023-12-04 18:00:59,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=380500.0, ans=0.0 2023-12-04 18:01:11,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=380500.0, ans=0.2 2023-12-04 18:01:13,802 INFO [train.py:1087] (1/4) Epoch 64, batch 700, loss[loss=0.1515, simple_loss=0.2481, pruned_loss=0.02748, over 21630.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2436, pruned_loss=0.02869, over 4682503.29 frames. ], batch size: 127, lr: 3.83e-03, grad_scale: 16.0 2023-12-04 18:01:20,320 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.82 vs. limit=10.0 2023-12-04 18:01:31,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=380633.3333333333, ans=0.0 2023-12-04 18:01:36,761 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380633.3333333333, ans=0.1 2023-12-04 18:01:37,714 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.273e+02 1.345e+02 1.468e+02 2.516e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-04 18:01:49,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=380700.0, ans=0.09899494936611666 2023-12-04 18:01:49,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=380700.0, ans=0.125 2023-12-04 18:01:59,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=380766.6666666667, ans=0.0 2023-12-04 18:01:59,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380766.6666666667, ans=0.1 2023-12-04 18:02:11,495 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=380833.3333333333, ans=0.125 2023-12-04 18:02:22,225 INFO [train.py:1087] (1/4) Epoch 64, batch 750, loss[loss=0.1375, simple_loss=0.2344, pruned_loss=0.0203, over 24799.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2439, pruned_loss=0.02886, over 4689969.26 frames. ], batch size: 72, lr: 3.82e-03, grad_scale: 16.0 2023-12-04 18:02:50,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=381033.3333333333, ans=0.1 2023-12-04 18:02:54,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=381033.3333333333, ans=0.125 2023-12-04 18:02:54,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=381033.3333333333, ans=0.125 2023-12-04 18:02:55,753 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-12-04 18:03:25,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=381166.6666666667, ans=0.0 2023-12-04 18:03:30,798 INFO [train.py:1087] (1/4) Epoch 64, batch 800, loss[loss=0.1434, simple_loss=0.2361, pruned_loss=0.02535, over 24593.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2436, pruned_loss=0.02871, over 4721862.67 frames. ], batch size: 68, lr: 3.82e-03, grad_scale: 32.0 2023-12-04 18:03:43,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=381300.0, ans=0.125 2023-12-04 18:03:55,029 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.286e+02 1.344e+02 1.464e+02 1.844e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 18:04:02,446 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381366.6666666667, ans=0.1 2023-12-04 18:04:15,882 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=381433.3333333333, ans=0.0 2023-12-04 18:04:34,038 INFO [train.py:1087] (1/4) Epoch 64, batch 850, loss[loss=0.1638, simple_loss=0.2586, pruned_loss=0.03451, over 21431.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2435, pruned_loss=0.02858, over 4751304.09 frames. ], batch size: 127, lr: 3.82e-03, grad_scale: 32.0 2023-12-04 18:04:43,921 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-12-04 18:04:49,716 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=381633.3333333333, ans=0.0 2023-12-04 18:05:46,375 INFO [train.py:1087] (1/4) Epoch 65, batch 0, loss[loss=0.1605, simple_loss=0.2519, pruned_loss=0.03459, over 23484.00 frames. ], tot_loss[loss=0.1605, simple_loss=0.2519, pruned_loss=0.03459, over 23484.00 frames. ], batch size: 94, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:05:46,376 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 18:06:02,504 INFO [train.py:1119] (1/4) Epoch 65, validation: loss=0.1513, simple_loss=0.2479, pruned_loss=0.02732, over 944034.00 frames. 2023-12-04 18:06:02,505 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 18:06:25,852 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-12-04 18:06:26,703 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=381933.3333333333, ans=0.125 2023-12-04 18:06:34,377 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.289e+02 1.379e+02 1.555e+02 2.140e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-04 18:06:40,337 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=382000.0, ans=0.1 2023-12-04 18:06:45,433 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382066.6666666667, ans=0.1 2023-12-04 18:06:54,238 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=382066.6666666667, ans=0.125 2023-12-04 18:07:01,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=382133.3333333333, ans=0.125 2023-12-04 18:07:10,721 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=382200.0, ans=0.0 2023-12-04 18:07:11,816 INFO [train.py:1087] (1/4) Epoch 65, batch 50, loss[loss=0.1456, simple_loss=0.2358, pruned_loss=0.02775, over 24570.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2442, pruned_loss=0.02871, over 1071966.26 frames. ], batch size: 65, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:07:22,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382200.0, ans=0.1 2023-12-04 18:07:30,502 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.18 vs. limit=22.5 2023-12-04 18:07:35,246 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382266.6666666667, ans=0.1 2023-12-04 18:08:17,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=382533.3333333333, ans=0.1 2023-12-04 18:08:18,651 INFO [train.py:1087] (1/4) Epoch 65, batch 100, loss[loss=0.1492, simple_loss=0.2418, pruned_loss=0.02832, over 24508.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2449, pruned_loss=0.02917, over 1884100.01 frames. ], batch size: 75, lr: 3.79e-03, grad_scale: 32.0 2023-12-04 18:08:42,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=382600.0, ans=0.125 2023-12-04 18:08:49,047 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.282e+02 1.383e+02 1.510e+02 1.893e+02, threshold=2.766e+02, percent-clipped=0.0 2023-12-04 18:09:02,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=382733.3333333333, ans=0.2 2023-12-04 18:09:08,785 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=382733.3333333333, ans=0.07 2023-12-04 18:09:13,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=382800.0, ans=0.1 2023-12-04 18:09:21,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=382800.0, ans=0.0 2023-12-04 18:09:23,268 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=382800.0, ans=0.0 2023-12-04 18:09:25,495 INFO [train.py:1087] (1/4) Epoch 65, batch 150, loss[loss=0.1469, simple_loss=0.2396, pruned_loss=0.02715, over 24702.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2449, pruned_loss=0.02896, over 2527702.74 frames. ], batch size: 74, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:09:28,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=382866.6666666667, ans=0.0 2023-12-04 18:09:30,035 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.62 vs. limit=22.5 2023-12-04 18:09:42,344 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=15.0 2023-12-04 18:09:48,919 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=382933.3333333333, ans=0.125 2023-12-04 18:09:55,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=383000.0, ans=0.0 2023-12-04 18:10:33,461 INFO [train.py:1087] (1/4) Epoch 65, batch 200, loss[loss=0.1366, simple_loss=0.2309, pruned_loss=0.02115, over 24792.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2447, pruned_loss=0.02875, over 3043028.51 frames. ], batch size: 72, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:10:40,454 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=383200.0, ans=0.125 2023-12-04 18:10:56,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=383266.6666666667, ans=0.125 2023-12-04 18:11:05,116 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.275e+02 1.374e+02 1.477e+02 1.891e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 18:11:18,185 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=383400.0, ans=0.2 2023-12-04 18:11:38,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=383466.6666666667, ans=0.125 2023-12-04 18:11:40,373 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=383466.6666666667, ans=0.0 2023-12-04 18:11:42,634 INFO [train.py:1087] (1/4) Epoch 65, batch 250, loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02828, over 24570.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2441, pruned_loss=0.02875, over 3441755.88 frames. ], batch size: 65, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:11:44,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=383533.3333333333, ans=0.125 2023-12-04 18:11:49,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=383533.3333333333, ans=0.125 2023-12-04 18:11:56,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=383600.0, ans=0.1 2023-12-04 18:11:56,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=383600.0, ans=0.07 2023-12-04 18:12:01,633 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=383600.0, ans=0.0 2023-12-04 18:12:27,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383733.3333333333, ans=0.1 2023-12-04 18:12:31,506 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=383733.3333333333, ans=0.125 2023-12-04 18:12:48,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=383800.0, ans=0.1 2023-12-04 18:12:48,607 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=383800.0, ans=0.1 2023-12-04 18:12:50,925 INFO [train.py:1087] (1/4) Epoch 65, batch 300, loss[loss=0.1554, simple_loss=0.2485, pruned_loss=0.03121, over 24724.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2435, pruned_loss=0.0284, over 3757196.63 frames. ], batch size: 61, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:13:04,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=383933.3333333333, ans=0.125 2023-12-04 18:13:05,830 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=383933.3333333333, ans=0.0 2023-12-04 18:13:22,455 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.249e+02 1.342e+02 1.427e+02 2.138e+02, threshold=2.684e+02, percent-clipped=0.0 2023-12-04 18:13:33,483 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=384066.6666666667, ans=0.125 2023-12-04 18:13:51,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384133.3333333333, ans=0.125 2023-12-04 18:13:59,751 INFO [train.py:1087] (1/4) Epoch 65, batch 350, loss[loss=0.1435, simple_loss=0.2338, pruned_loss=0.0266, over 24757.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2436, pruned_loss=0.02857, over 3995077.22 frames. ], batch size: 66, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:14:47,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=384400.0, ans=0.2 2023-12-04 18:14:52,305 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-12-04 18:14:54,861 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.25 vs. limit=6.0 2023-12-04 18:15:04,977 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:15:08,622 INFO [train.py:1087] (1/4) Epoch 65, batch 400, loss[loss=0.144, simple_loss=0.2368, pruned_loss=0.02559, over 24546.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2433, pruned_loss=0.02842, over 4183040.80 frames. ], batch size: 62, lr: 3.78e-03, grad_scale: 32.0 2023-12-04 18:15:41,144 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.068e+02 1.269e+02 1.375e+02 1.530e+02 1.934e+02, threshold=2.749e+02, percent-clipped=0.0 2023-12-04 18:16:16,389 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-12-04 18:16:18,104 INFO [train.py:1087] (1/4) Epoch 65, batch 450, loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.02617, over 24788.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2434, pruned_loss=0.0285, over 4330323.51 frames. ], batch size: 73, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:16:19,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=384866.6666666667, ans=0.0 2023-12-04 18:16:22,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=384866.6666666667, ans=0.125 2023-12-04 18:16:28,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=384866.6666666667, ans=0.125 2023-12-04 18:16:47,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=385000.0, ans=0.0 2023-12-04 18:17:02,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=385066.6666666667, ans=0.2 2023-12-04 18:17:06,514 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.96 vs. limit=22.5 2023-12-04 18:17:14,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385133.3333333333, ans=0.1 2023-12-04 18:17:26,537 INFO [train.py:1087] (1/4) Epoch 65, batch 500, loss[loss=0.1558, simple_loss=0.246, pruned_loss=0.03283, over 24288.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2436, pruned_loss=0.0285, over 4433683.00 frames. ], batch size: 79, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:17:38,677 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:17:57,032 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.262e+02 1.332e+02 1.433e+02 2.134e+02, threshold=2.663e+02, percent-clipped=0.0 2023-12-04 18:18:01,359 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=385333.3333333333, ans=10.0 2023-12-04 18:18:06,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=385400.0, ans=0.5 2023-12-04 18:18:14,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=385400.0, ans=0.025 2023-12-04 18:18:29,517 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=385466.6666666667, ans=0.0 2023-12-04 18:18:34,268 INFO [train.py:1087] (1/4) Epoch 65, batch 550, loss[loss=0.1617, simple_loss=0.2508, pruned_loss=0.03634, over 17344.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2436, pruned_loss=0.02873, over 4499245.73 frames. ], batch size: 177, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:18:34,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=385533.3333333333, ans=0.1 2023-12-04 18:19:11,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=385666.6666666667, ans=0.125 2023-12-04 18:19:25,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=385733.3333333333, ans=0.0 2023-12-04 18:19:27,292 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=385733.3333333333, ans=0.125 2023-12-04 18:19:43,707 INFO [train.py:1087] (1/4) Epoch 65, batch 600, loss[loss=0.1525, simple_loss=0.2384, pruned_loss=0.03329, over 24510.00 frames. ], tot_loss[loss=0.151, simple_loss=0.244, pruned_loss=0.029, over 4556086.16 frames. ], batch size: 75, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:20:17,302 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.252e+02 1.364e+02 1.482e+02 2.086e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 18:20:21,464 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=386000.0, ans=0.2 2023-12-04 18:20:21,979 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-12-04 18:20:54,482 INFO [train.py:1087] (1/4) Epoch 65, batch 650, loss[loss=0.1471, simple_loss=0.2401, pruned_loss=0.02707, over 24804.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2436, pruned_loss=0.02894, over 4617049.23 frames. ], batch size: 72, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:21:04,566 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-12-04 18:21:35,678 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=386400.0, ans=0.1 2023-12-04 18:21:48,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386466.6666666667, ans=0.1 2023-12-04 18:21:48,730 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=386466.6666666667, ans=0.125 2023-12-04 18:21:56,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=386466.6666666667, ans=0.2 2023-12-04 18:22:02,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=386466.6666666667, ans=0.0 2023-12-04 18:22:03,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386533.3333333333, ans=0.1 2023-12-04 18:22:04,482 INFO [train.py:1087] (1/4) Epoch 65, batch 700, loss[loss=0.1459, simple_loss=0.2404, pruned_loss=0.0257, over 24778.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2437, pruned_loss=0.0289, over 4672731.54 frames. ], batch size: 70, lr: 3.77e-03, grad_scale: 32.0 2023-12-04 18:22:13,210 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=386533.3333333333, ans=0.125 2023-12-04 18:22:27,770 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=386600.0, ans=0.125 2023-12-04 18:22:30,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386666.6666666667, ans=0.1 2023-12-04 18:22:36,833 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.292e+02 1.390e+02 1.496e+02 2.025e+02, threshold=2.779e+02, percent-clipped=0.0 2023-12-04 18:22:38,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=386666.6666666667, ans=0.0 2023-12-04 18:22:43,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=386666.6666666667, ans=0.0 2023-12-04 18:23:03,527 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.90 vs. limit=15.0 2023-12-04 18:23:09,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=386800.0, ans=0.0 2023-12-04 18:23:15,137 INFO [train.py:1087] (1/4) Epoch 65, batch 750, loss[loss=0.1577, simple_loss=0.2487, pruned_loss=0.0333, over 24797.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2436, pruned_loss=0.02877, over 4700948.38 frames. ], batch size: 62, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:23:19,548 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=386866.6666666667, ans=0.125 2023-12-04 18:23:40,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=387000.0, ans=0.125 2023-12-04 18:24:00,463 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.35 vs. limit=10.0 2023-12-04 18:24:08,260 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.93 vs. limit=22.5 2023-12-04 18:24:22,883 INFO [train.py:1087] (1/4) Epoch 65, batch 800, loss[loss=0.1491, simple_loss=0.244, pruned_loss=0.02713, over 24727.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2439, pruned_loss=0.02889, over 4706990.25 frames. ], batch size: 67, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:24:36,069 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:24:52,390 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.265e+02 1.343e+02 1.484e+02 1.892e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 18:24:56,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=387333.3333333333, ans=0.2 2023-12-04 18:24:57,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=387333.3333333333, ans=0.0 2023-12-04 18:24:58,618 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=387333.3333333333, ans=0.09899494936611666 2023-12-04 18:25:24,810 INFO [train.py:1087] (1/4) Epoch 65, batch 850, loss[loss=0.1463, simple_loss=0.2426, pruned_loss=0.02501, over 24768.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.244, pruned_loss=0.02905, over 4737794.05 frames. ], batch size: 71, lr: 3.76e-03, grad_scale: 32.0 2023-12-04 18:25:35,241 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.16 vs. limit=22.5 2023-12-04 18:25:39,658 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=387600.0, ans=0.2 2023-12-04 18:25:43,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=387600.0, ans=0.125 2023-12-04 18:25:51,845 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=12.0 2023-12-04 18:25:53,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=387666.6666666667, ans=0.125 2023-12-04 18:25:57,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=387666.6666666667, ans=0.125 2023-12-04 18:26:01,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=387733.3333333333, ans=0.0 2023-12-04 18:26:15,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=387800.0, ans=0.125 2023-12-04 18:26:37,639 INFO [train.py:1087] (1/4) Epoch 66, batch 0, loss[loss=0.1448, simple_loss=0.2429, pruned_loss=0.02339, over 24735.00 frames. ], tot_loss[loss=0.1448, simple_loss=0.2429, pruned_loss=0.02339, over 24735.00 frames. ], batch size: 74, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:26:37,640 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 18:26:49,444 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.1232, 4.2358, 3.8572, 3.8678], device='cuda:1') 2023-12-04 18:26:53,166 INFO [train.py:1119] (1/4) Epoch 66, validation: loss=0.1505, simple_loss=0.2474, pruned_loss=0.02677, over 944034.00 frames. 2023-12-04 18:26:53,167 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 18:26:57,494 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_na.min_abs, batch_count=387833.3333333333, ans=0.02 2023-12-04 18:27:13,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=387900.0, ans=0.0 2023-12-04 18:27:13,981 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.04 vs. limit=10.0 2023-12-04 18:27:32,106 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.103e+02 1.283e+02 1.412e+02 1.512e+02 1.954e+02, threshold=2.825e+02, percent-clipped=0.0 2023-12-04 18:27:49,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=388100.0, ans=0.09899494936611666 2023-12-04 18:28:03,096 INFO [train.py:1087] (1/4) Epoch 66, batch 50, loss[loss=0.1395, simple_loss=0.2376, pruned_loss=0.02071, over 24575.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2451, pruned_loss=0.029, over 1087329.40 frames. ], batch size: 65, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:28:12,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=388166.6666666667, ans=0.125 2023-12-04 18:28:41,819 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=388300.0, ans=0.2 2023-12-04 18:28:42,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=388366.6666666667, ans=0.125 2023-12-04 18:29:10,107 INFO [train.py:1087] (1/4) Epoch 66, batch 100, loss[loss=0.1552, simple_loss=0.2507, pruned_loss=0.02983, over 24054.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2445, pruned_loss=0.02849, over 1911476.62 frames. ], batch size: 87, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:29:23,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=388500.0, ans=0.0 2023-12-04 18:29:28,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=388566.6666666667, ans=0.04949747468305833 2023-12-04 18:29:39,131 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.76 vs. limit=15.0 2023-12-04 18:29:48,533 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.055e+02 1.248e+02 1.303e+02 1.427e+02 1.853e+02, threshold=2.607e+02, percent-clipped=0.0 2023-12-04 18:30:18,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=388833.3333333333, ans=0.125 2023-12-04 18:30:19,462 INFO [train.py:1087] (1/4) Epoch 66, batch 150, loss[loss=0.1495, simple_loss=0.2422, pruned_loss=0.02841, over 24738.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.244, pruned_loss=0.02842, over 2554003.94 frames. ], batch size: 61, lr: 3.73e-03, grad_scale: 32.0 2023-12-04 18:30:52,904 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=388966.6666666667, ans=0.035 2023-12-04 18:30:56,208 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-12-04 18:31:06,019 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=389033.3333333333, ans=0.125 2023-12-04 18:31:24,990 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389100.0, ans=0.1 2023-12-04 18:31:29,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=389166.6666666667, ans=0.125 2023-12-04 18:31:30,889 INFO [train.py:1087] (1/4) Epoch 66, batch 200, loss[loss=0.1412, simple_loss=0.2384, pruned_loss=0.02201, over 24564.00 frames. ], tot_loss[loss=0.1508, simple_loss=0.2441, pruned_loss=0.02879, over 3043341.54 frames. ], batch size: 65, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:31:36,685 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-12-04 18:31:58,996 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-12-04 18:32:02,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=389300.0, ans=0.0 2023-12-04 18:32:09,661 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.275e+02 1.356e+02 1.462e+02 1.717e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 18:32:19,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=389366.6666666667, ans=0.1 2023-12-04 18:32:27,781 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-12-04 18:32:39,986 INFO [train.py:1087] (1/4) Epoch 66, batch 250, loss[loss=0.1492, simple_loss=0.2429, pruned_loss=0.02777, over 24573.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2447, pruned_loss=0.02898, over 3426788.51 frames. ], batch size: 64, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:33:24,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=389700.0, ans=0.0 2023-12-04 18:33:48,609 INFO [train.py:1087] (1/4) Epoch 66, batch 300, loss[loss=0.1486, simple_loss=0.2457, pruned_loss=0.0257, over 24768.00 frames. ], tot_loss[loss=0.1512, simple_loss=0.2444, pruned_loss=0.02898, over 3731212.99 frames. ], batch size: 71, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:33:59,751 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=389833.3333333333, ans=0.0 2023-12-04 18:34:01,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=389900.0, ans=0.125 2023-12-04 18:34:27,121 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.271e+02 1.351e+02 1.464e+02 1.947e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 18:34:35,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=390033.3333333333, ans=0.125 2023-12-04 18:34:35,928 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.24 vs. limit=15.0 2023-12-04 18:34:47,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390100.0, ans=0.125 2023-12-04 18:34:56,323 INFO [train.py:1087] (1/4) Epoch 66, batch 350, loss[loss=0.151, simple_loss=0.2479, pruned_loss=0.02701, over 22833.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2445, pruned_loss=0.02909, over 3965488.04 frames. ], batch size: 106, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:35:13,773 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=9.28 vs. limit=12.0 2023-12-04 18:35:15,722 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=390233.3333333333, ans=0.0 2023-12-04 18:35:32,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=390300.0, ans=0.025 2023-12-04 18:35:35,417 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-12-04 18:35:54,676 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.39 vs. limit=15.0 2023-12-04 18:35:56,016 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-12-04 18:36:04,976 INFO [train.py:1087] (1/4) Epoch 66, batch 400, loss[loss=0.1672, simple_loss=0.2572, pruned_loss=0.03858, over 24447.00 frames. ], tot_loss[loss=0.1513, simple_loss=0.2443, pruned_loss=0.02914, over 4142025.27 frames. ], batch size: 77, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:36:22,427 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-12-04 18:36:37,087 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-12-04 18:36:43,259 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.262e+02 1.332e+02 1.446e+02 1.853e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-04 18:36:44,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=390700.0, ans=0.2 2023-12-04 18:36:49,628 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-12-04 18:36:57,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=390700.0, ans=0.0 2023-12-04 18:37:08,107 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=390766.6666666667, ans=0.125 2023-12-04 18:37:11,027 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=12.0 2023-12-04 18:37:12,199 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=390766.6666666667, ans=0.5 2023-12-04 18:37:14,347 INFO [train.py:1087] (1/4) Epoch 66, batch 450, loss[loss=0.1506, simple_loss=0.2489, pruned_loss=0.02619, over 22847.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.2439, pruned_loss=0.02892, over 4294053.03 frames. ], batch size: 106, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:37:16,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=390833.3333333333, ans=0.125 2023-12-04 18:37:29,795 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=390900.0, ans=0.0 2023-12-04 18:38:23,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=391166.6666666667, ans=0.035 2023-12-04 18:38:24,175 INFO [train.py:1087] (1/4) Epoch 66, batch 500, loss[loss=0.1404, simple_loss=0.2322, pruned_loss=0.02427, over 24607.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2437, pruned_loss=0.02873, over 4407255.16 frames. ], batch size: 68, lr: 3.72e-03, grad_scale: 32.0 2023-12-04 18:38:28,508 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=391166.6666666667, ans=0.125 2023-12-04 18:38:41,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=391233.3333333333, ans=0.125 2023-12-04 18:39:02,976 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.271e+02 1.374e+02 1.457e+02 1.879e+02, threshold=2.748e+02, percent-clipped=0.0 2023-12-04 18:39:05,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391366.6666666667, ans=0.125 2023-12-04 18:39:06,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=391366.6666666667, ans=0.2 2023-12-04 18:39:15,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=391366.6666666667, ans=0.125 2023-12-04 18:39:18,443 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.03 vs. limit=15.0 2023-12-04 18:39:27,420 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.59 vs. limit=15.0 2023-12-04 18:39:32,567 INFO [train.py:1087] (1/4) Epoch 66, batch 550, loss[loss=0.1551, simple_loss=0.2488, pruned_loss=0.0307, over 22682.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.244, pruned_loss=0.02866, over 4500706.52 frames. ], batch size: 106, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:39:57,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=391566.6666666667, ans=0.125 2023-12-04 18:40:17,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=391700.0, ans=0.09899494936611666 2023-12-04 18:40:41,244 INFO [train.py:1087] (1/4) Epoch 66, batch 600, loss[loss=0.1687, simple_loss=0.2594, pruned_loss=0.039, over 24335.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2435, pruned_loss=0.02837, over 4574726.28 frames. ], batch size: 79, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:40:56,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=391900.0, ans=0.125 2023-12-04 18:40:59,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=391900.0, ans=0.0 2023-12-04 18:41:07,970 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.21 vs. limit=15.0 2023-12-04 18:41:17,191 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391966.6666666667, ans=0.1 2023-12-04 18:41:20,747 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.262e+02 1.356e+02 1.449e+02 2.093e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 18:41:51,265 INFO [train.py:1087] (1/4) Epoch 66, batch 650, loss[loss=0.1444, simple_loss=0.2373, pruned_loss=0.02576, over 24763.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2438, pruned_loss=0.02874, over 4602401.36 frames. ], batch size: 64, lr: 3.71e-03, grad_scale: 64.0 2023-12-04 18:41:58,548 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-12-04 18:42:05,320 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-12-04 18:42:10,088 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=392233.3333333333, ans=0.125 2023-12-04 18:42:25,715 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.54 vs. limit=10.0 2023-12-04 18:42:54,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392433.3333333333, ans=0.1 2023-12-04 18:43:02,102 INFO [train.py:1087] (1/4) Epoch 66, batch 700, loss[loss=0.1439, simple_loss=0.2392, pruned_loss=0.02425, over 24694.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2438, pruned_loss=0.02854, over 4651017.27 frames. ], batch size: 74, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:43:13,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=392500.0, ans=0.125 2023-12-04 18:43:17,779 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-12-04 18:43:43,672 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.264e+02 1.339e+02 1.470e+02 1.975e+02, threshold=2.678e+02, percent-clipped=0.0 2023-12-04 18:44:07,485 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.82 vs. limit=15.0 2023-12-04 18:44:13,093 INFO [train.py:1087] (1/4) Epoch 66, batch 750, loss[loss=0.1473, simple_loss=0.2418, pruned_loss=0.02638, over 24695.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2441, pruned_loss=0.02864, over 4685491.46 frames. ], batch size: 74, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:44:48,107 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=392966.6666666667, ans=0.1 2023-12-04 18:45:19,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=393100.0, ans=0.0 2023-12-04 18:45:22,758 INFO [train.py:1087] (1/4) Epoch 66, batch 800, loss[loss=0.1472, simple_loss=0.2383, pruned_loss=0.02806, over 24746.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.2439, pruned_loss=0.02864, over 4717849.36 frames. ], batch size: 66, lr: 3.71e-03, grad_scale: 32.0 2023-12-04 18:45:28,759 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=393166.6666666667, ans=0.0 2023-12-04 18:45:50,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=393300.0, ans=0.2 2023-12-04 18:46:00,575 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.263e+02 1.349e+02 1.436e+02 2.203e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 18:46:15,254 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=393433.3333333333, ans=0.2 2023-12-04 18:46:20,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=393433.3333333333, ans=0.125 2023-12-04 18:46:23,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=393433.3333333333, ans=0.125 2023-12-04 18:46:27,419 INFO [train.py:1087] (1/4) Epoch 66, batch 850, loss[loss=0.1954, simple_loss=0.2767, pruned_loss=0.05707, over 16288.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2437, pruned_loss=0.02862, over 4735621.99 frames. ], batch size: 177, lr: 3.70e-03, grad_scale: 32.0 2023-12-04 18:46:45,574 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.75 vs. limit=22.5 2023-12-04 18:46:53,052 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-12-04 18:47:45,727 INFO [train.py:1087] (1/4) Epoch 67, batch 0, loss[loss=0.1388, simple_loss=0.2407, pruned_loss=0.01849, over 24705.00 frames. ], tot_loss[loss=0.1388, simple_loss=0.2407, pruned_loss=0.01849, over 24705.00 frames. ], batch size: 74, lr: 3.68e-03, grad_scale: 32.0 2023-12-04 18:47:45,730 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 18:48:02,848 INFO [train.py:1119] (1/4) Epoch 67, validation: loss=0.1507, simple_loss=0.2474, pruned_loss=0.02701, over 944034.00 frames. 2023-12-04 18:48:02,850 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 18:48:04,721 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=393800.0, ans=0.0 2023-12-04 18:48:13,958 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=393800.0, ans=0.0 2023-12-04 18:48:50,409 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.049e+02 1.306e+02 1.374e+02 1.528e+02 2.366e+02, threshold=2.749e+02, percent-clipped=0.0 2023-12-04 18:49:14,005 INFO [train.py:1087] (1/4) Epoch 67, batch 50, loss[loss=0.149, simple_loss=0.2429, pruned_loss=0.02756, over 23737.00 frames. ], tot_loss[loss=0.1514, simple_loss=0.2446, pruned_loss=0.02914, over 1075917.94 frames. ], batch size: 57, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:49:16,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394133.3333333333, ans=0.0 2023-12-04 18:49:41,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=394266.6666666667, ans=0.1 2023-12-04 18:49:44,197 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:49:56,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=394333.3333333333, ans=0.2 2023-12-04 18:50:02,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=394333.3333333333, ans=0.125 2023-12-04 18:50:09,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=394400.0, ans=0.125 2023-12-04 18:50:09,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394400.0, ans=0.0 2023-12-04 18:50:16,563 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.83 vs. limit=15.0 2023-12-04 18:50:22,288 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=394466.6666666667, ans=0.0 2023-12-04 18:50:23,072 INFO [train.py:1087] (1/4) Epoch 67, batch 100, loss[loss=0.1472, simple_loss=0.2433, pruned_loss=0.02552, over 24000.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2442, pruned_loss=0.02862, over 1907422.62 frames. ], batch size: 87, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:50:28,615 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=394466.6666666667, ans=10.0 2023-12-04 18:50:29,211 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.74 vs. limit=15.0 2023-12-04 18:50:30,658 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=394466.6666666667, ans=0.5 2023-12-04 18:50:32,674 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.67 vs. limit=22.5 2023-12-04 18:50:36,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=394466.6666666667, ans=0.0 2023-12-04 18:50:37,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394533.3333333333, ans=0.1 2023-12-04 18:50:43,031 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=15.0 2023-12-04 18:50:44,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=394533.3333333333, ans=0.125 2023-12-04 18:51:04,090 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=394666.6666666667, ans=0.0 2023-12-04 18:51:08,891 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=394666.6666666667, ans=0.125 2023-12-04 18:51:11,858 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.252e+02 1.351e+02 1.505e+02 2.091e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 18:51:32,819 INFO [train.py:1087] (1/4) Epoch 67, batch 150, loss[loss=0.1488, simple_loss=0.2393, pruned_loss=0.02913, over 24723.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2438, pruned_loss=0.02842, over 2559561.86 frames. ], batch size: 63, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:51:53,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=394866.6666666667, ans=0.125 2023-12-04 18:51:55,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=394866.6666666667, ans=0.0 2023-12-04 18:52:23,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395000.0, ans=0.1 2023-12-04 18:52:42,565 INFO [train.py:1087] (1/4) Epoch 67, batch 200, loss[loss=0.1499, simple_loss=0.2468, pruned_loss=0.0265, over 24697.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2435, pruned_loss=0.02834, over 3058018.30 frames. ], batch size: 69, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:52:45,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395133.3333333333, ans=0.1 2023-12-04 18:52:58,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=395200.0, ans=0.2 2023-12-04 18:52:59,733 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-12-04 18:53:21,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=395266.6666666667, ans=0.2 2023-12-04 18:53:24,912 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-12-04 18:53:25,826 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=395333.3333333333, ans=0.125 2023-12-04 18:53:30,519 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.241e+02 1.329e+02 1.420e+02 2.138e+02, threshold=2.658e+02, percent-clipped=0.0 2023-12-04 18:53:33,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=395333.3333333333, ans=0.125 2023-12-04 18:53:40,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=395400.0, ans=0.125 2023-12-04 18:53:53,696 INFO [train.py:1087] (1/4) Epoch 67, batch 250, loss[loss=0.1492, simple_loss=0.2434, pruned_loss=0.02744, over 23977.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2433, pruned_loss=0.02832, over 3452968.07 frames. ], batch size: 87, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:54:17,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=395533.3333333333, ans=0.125 2023-12-04 18:54:21,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=395600.0, ans=0.2 2023-12-04 18:54:32,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=395600.0, ans=0.0 2023-12-04 18:54:56,181 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.24 vs. limit=15.0 2023-12-04 18:55:04,089 INFO [train.py:1087] (1/4) Epoch 67, batch 300, loss[loss=0.1476, simple_loss=0.2429, pruned_loss=0.02618, over 24747.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2434, pruned_loss=0.02841, over 3744778.11 frames. ], batch size: 63, lr: 3.67e-03, grad_scale: 32.0 2023-12-04 18:55:53,470 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.050e+02 1.265e+02 1.360e+02 1.434e+02 1.941e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-04 18:56:16,000 INFO [train.py:1087] (1/4) Epoch 67, batch 350, loss[loss=0.1496, simple_loss=0.2414, pruned_loss=0.02888, over 24799.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2437, pruned_loss=0.02863, over 3973444.92 frames. ], batch size: 62, lr: 3.66e-03, grad_scale: 32.0 2023-12-04 18:56:26,294 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.56 vs. limit=15.0 2023-12-04 18:56:33,566 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396200.0, ans=0.125 2023-12-04 18:56:53,117 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=396266.6666666667, ans=0.0 2023-12-04 18:56:53,201 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=396266.6666666667, ans=0.02 2023-12-04 18:56:54,413 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 18:57:03,921 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.28 vs. limit=22.5 2023-12-04 18:57:09,592 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=396333.3333333333, ans=10.0 2023-12-04 18:57:28,732 INFO [train.py:1087] (1/4) Epoch 67, batch 400, loss[loss=0.155, simple_loss=0.2503, pruned_loss=0.02988, over 24793.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2435, pruned_loss=0.02834, over 4175839.20 frames. ], batch size: 73, lr: 3.66e-03, grad_scale: 32.0 2023-12-04 18:57:31,037 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=396466.6666666667, ans=0.0 2023-12-04 18:57:33,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=396466.6666666667, ans=0.1 2023-12-04 18:57:57,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=396600.0, ans=0.125 2023-12-04 18:57:57,975 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396600.0, ans=0.125 2023-12-04 18:58:02,441 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=12.0 2023-12-04 18:58:10,423 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=396600.0, ans=0.2 2023-12-04 18:58:11,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=396666.6666666667, ans=0.2 2023-12-04 18:58:18,880 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.226e+02 1.325e+02 1.425e+02 1.742e+02, threshold=2.649e+02, percent-clipped=0.0 2023-12-04 18:58:41,489 INFO [train.py:1087] (1/4) Epoch 67, batch 450, loss[loss=0.1486, simple_loss=0.2406, pruned_loss=0.0283, over 24619.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2438, pruned_loss=0.02855, over 4309879.72 frames. ], batch size: 68, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 18:58:51,437 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.96 vs. limit=12.0 2023-12-04 18:59:20,196 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-12-04 18:59:28,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=397000.0, ans=0.125 2023-12-04 18:59:33,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=397000.0, ans=0.0 2023-12-04 18:59:40,302 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=397066.6666666667, ans=0.04949747468305833 2023-12-04 18:59:51,807 INFO [train.py:1087] (1/4) Epoch 67, batch 500, loss[loss=0.1395, simple_loss=0.2313, pruned_loss=0.02379, over 24771.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02827, over 4432179.03 frames. ], batch size: 70, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:00:08,923 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=397200.0, ans=0.2 2023-12-04 19:00:16,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=397200.0, ans=0.125 2023-12-04 19:00:41,470 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.088e+02 1.259e+02 1.344e+02 1.490e+02 1.984e+02, threshold=2.687e+02, percent-clipped=0.0 2023-12-04 19:00:49,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=397400.0, ans=0.0 2023-12-04 19:00:58,317 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.03 vs. limit=10.0 2023-12-04 19:01:01,417 INFO [train.py:1087] (1/4) Epoch 67, batch 550, loss[loss=0.1434, simple_loss=0.2387, pruned_loss=0.02408, over 24725.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2433, pruned_loss=0.02839, over 4512171.16 frames. ], batch size: 67, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:01:09,227 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.47 vs. limit=12.0 2023-12-04 19:01:16,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=397533.3333333333, ans=0.2 2023-12-04 19:01:46,820 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397666.6666666667, ans=0.125 2023-12-04 19:02:02,111 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.59 vs. limit=15.0 2023-12-04 19:02:12,012 INFO [train.py:1087] (1/4) Epoch 67, batch 600, loss[loss=0.1583, simple_loss=0.2516, pruned_loss=0.03247, over 24297.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2433, pruned_loss=0.02826, over 4570975.46 frames. ], batch size: 79, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:02:24,663 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=397866.6666666667, ans=0.1 2023-12-04 19:02:36,825 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.44 vs. limit=22.5 2023-12-04 19:02:41,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=397933.3333333333, ans=0.125 2023-12-04 19:03:00,018 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.244e+02 1.311e+02 1.453e+02 1.975e+02, threshold=2.621e+02, percent-clipped=0.0 2023-12-04 19:03:13,076 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=6.90 vs. limit=15.0 2023-12-04 19:03:21,699 INFO [train.py:1087] (1/4) Epoch 67, batch 650, loss[loss=0.157, simple_loss=0.2458, pruned_loss=0.03414, over 24456.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2433, pruned_loss=0.02815, over 4632638.54 frames. ], batch size: 77, lr: 3.66e-03, grad_scale: 16.0 2023-12-04 19:03:29,097 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-12-04 19:03:38,849 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=398200.0, ans=0.0 2023-12-04 19:03:42,800 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=398200.0, ans=0.125 2023-12-04 19:03:45,920 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.03 vs. limit=10.0 2023-12-04 19:03:52,460 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-12-04 19:03:58,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten.whitening_limit, batch_count=398266.6666666667, ans=15.0 2023-12-04 19:04:03,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=398333.3333333333, ans=0.125 2023-12-04 19:04:13,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=398333.3333333333, ans=0.125 2023-12-04 19:04:26,946 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.17 vs. limit=6.0 2023-12-04 19:04:31,944 INFO [train.py:1087] (1/4) Epoch 67, batch 700, loss[loss=0.1571, simple_loss=0.2525, pruned_loss=0.03087, over 24181.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2434, pruned_loss=0.02814, over 4666590.18 frames. ], batch size: 82, lr: 3.65e-03, grad_scale: 16.0 2023-12-04 19:04:42,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=398466.6666666667, ans=0.0 2023-12-04 19:04:55,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=398533.3333333333, ans=0.2 2023-12-04 19:05:02,335 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-12-04 19:05:22,004 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.250e+02 1.327e+02 1.419e+02 2.097e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 19:05:29,385 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-12-04 19:05:35,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=398733.3333333333, ans=0.015 2023-12-04 19:05:43,133 INFO [train.py:1087] (1/4) Epoch 67, batch 750, loss[loss=0.1426, simple_loss=0.2405, pruned_loss=0.0224, over 24766.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02816, over 4689083.87 frames. ], batch size: 70, lr: 3.65e-03, grad_scale: 16.0 2023-12-04 19:05:46,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=398800.0, ans=0.125 2023-12-04 19:05:49,492 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=398800.0, ans=0.125 2023-12-04 19:05:54,943 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=398800.0, ans=0.2 2023-12-04 19:06:01,840 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=398866.6666666667, ans=0.125 2023-12-04 19:06:09,525 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=398933.3333333333, ans=0.125 2023-12-04 19:06:52,770 INFO [train.py:1087] (1/4) Epoch 67, batch 800, loss[loss=0.1518, simple_loss=0.2414, pruned_loss=0.03115, over 24865.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2428, pruned_loss=0.02801, over 4725258.59 frames. ], batch size: 68, lr: 3.65e-03, grad_scale: 32.0 2023-12-04 19:07:07,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=399200.0, ans=0.0 2023-12-04 19:07:25,059 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-12-04 19:07:27,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=399266.6666666667, ans=0.125 2023-12-04 19:07:38,271 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.255e+02 1.337e+02 1.436e+02 1.884e+02, threshold=2.675e+02, percent-clipped=0.0 2023-12-04 19:07:48,843 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-12-04 19:07:49,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=399400.0, ans=0.2 2023-12-04 19:07:56,361 INFO [train.py:1087] (1/4) Epoch 67, batch 850, loss[loss=0.1418, simple_loss=0.2337, pruned_loss=0.02497, over 24747.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2429, pruned_loss=0.02819, over 4745316.03 frames. ], batch size: 66, lr: 3.65e-03, grad_scale: 32.0 2023-12-04 19:07:58,004 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=399466.6666666667, ans=0.125 2023-12-04 19:08:11,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399533.3333333333, ans=0.1 2023-12-04 19:08:36,650 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=399666.6666666667, ans=0.0 2023-12-04 19:08:39,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=399666.6666666667, ans=0.0 2023-12-04 19:09:08,802 INFO [train.py:1087] (1/4) Epoch 68, batch 0, loss[loss=0.1556, simple_loss=0.2465, pruned_loss=0.03231, over 23404.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2465, pruned_loss=0.03231, over 23404.00 frames. ], batch size: 94, lr: 3.62e-03, grad_scale: 32.0 2023-12-04 19:09:08,808 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 19:09:25,585 INFO [train.py:1119] (1/4) Epoch 68, validation: loss=0.151, simple_loss=0.2474, pruned_loss=0.02727, over 944034.00 frames. 2023-12-04 19:09:25,586 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 19:09:59,836 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=399900.0, ans=0.0 2023-12-04 19:10:01,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=399900.0, ans=0.0 2023-12-04 19:10:12,071 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.37 vs. limit=15.0 2023-12-04 19:10:20,724 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=399966.6666666667, ans=0.125 2023-12-04 19:10:25,402 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.238e+02 1.333e+02 1.431e+02 2.439e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-04 19:10:26,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.87 vs. limit=15.0 2023-12-04 19:10:36,797 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=400033.3333333333, ans=0.125 2023-12-04 19:10:40,301 INFO [train.py:1087] (1/4) Epoch 68, batch 50, loss[loss=0.1548, simple_loss=0.2488, pruned_loss=0.03042, over 24562.00 frames. ], tot_loss[loss=0.1516, simple_loss=0.2446, pruned_loss=0.02924, over 1091138.81 frames. ], batch size: 64, lr: 3.62e-03, grad_scale: 32.0 2023-12-04 19:10:55,229 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400166.6666666667, ans=0.1 2023-12-04 19:11:00,054 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=15.0 2023-12-04 19:11:15,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=400233.3333333333, ans=0.125 2023-12-04 19:11:27,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=400300.0, ans=0.125 2023-12-04 19:11:29,153 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=400300.0, ans=0.125 2023-12-04 19:11:49,752 INFO [train.py:1087] (1/4) Epoch 68, batch 100, loss[loss=0.1745, simple_loss=0.2613, pruned_loss=0.0439, over 16653.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2438, pruned_loss=0.02863, over 1906534.24 frames. ], batch size: 177, lr: 3.62e-03, grad_scale: 16.0 2023-12-04 19:11:56,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=400433.3333333333, ans=0.07 2023-12-04 19:11:57,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=400433.3333333333, ans=0.125 2023-12-04 19:11:59,210 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.13 vs. limit=10.0 2023-12-04 19:12:00,134 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:12:43,265 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=400633.3333333333, ans=0.1 2023-12-04 19:12:46,696 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.320e+02 1.380e+02 1.502e+02 2.105e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-04 19:12:58,642 INFO [train.py:1087] (1/4) Epoch 68, batch 150, loss[loss=0.1482, simple_loss=0.2376, pruned_loss=0.02938, over 24338.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2437, pruned_loss=0.0286, over 2543618.16 frames. ], batch size: 79, lr: 3.62e-03, grad_scale: 16.0 2023-12-04 19:13:12,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=400833.3333333333, ans=0.1 2023-12-04 19:14:07,721 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=401100.0, ans=0.125 2023-12-04 19:14:08,825 INFO [train.py:1087] (1/4) Epoch 68, batch 200, loss[loss=0.1478, simple_loss=0.244, pruned_loss=0.02583, over 24793.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2437, pruned_loss=0.02854, over 3056266.81 frames. ], batch size: 71, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:14:09,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=401100.0, ans=0.125 2023-12-04 19:14:24,745 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=401166.6666666667, ans=0.0 2023-12-04 19:14:26,182 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=401166.6666666667, ans=0.0 2023-12-04 19:15:07,828 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.273e+02 1.370e+02 1.496e+02 1.748e+02, threshold=2.741e+02, percent-clipped=0.0 2023-12-04 19:15:15,213 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.89 vs. limit=10.0 2023-12-04 19:15:19,779 INFO [train.py:1087] (1/4) Epoch 68, batch 250, loss[loss=0.1415, simple_loss=0.2359, pruned_loss=0.02353, over 24774.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2431, pruned_loss=0.02833, over 3450419.21 frames. ], batch size: 65, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:15:25,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=401433.3333333333, ans=0.0 2023-12-04 19:15:44,379 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-12-04 19:15:55,492 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-12-04 19:16:05,462 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401633.3333333333, ans=0.0 2023-12-04 19:16:07,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=401633.3333333333, ans=0.0 2023-12-04 19:16:15,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=401700.0, ans=0.125 2023-12-04 19:16:17,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=401700.0, ans=0.125 2023-12-04 19:16:23,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=401700.0, ans=0.125 2023-12-04 19:16:25,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=401700.0, ans=0.125 2023-12-04 19:16:29,602 INFO [train.py:1087] (1/4) Epoch 68, batch 300, loss[loss=0.1398, simple_loss=0.2355, pruned_loss=0.02202, over 24798.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2432, pruned_loss=0.02823, over 3756557.20 frames. ], batch size: 73, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:17:25,421 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.065e+02 1.247e+02 1.339e+02 1.460e+02 2.398e+02, threshold=2.677e+02, percent-clipped=0.0 2023-12-04 19:17:31,424 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.36 vs. limit=22.5 2023-12-04 19:17:37,293 INFO [train.py:1087] (1/4) Epoch 68, batch 350, loss[loss=0.1379, simple_loss=0.2284, pruned_loss=0.0237, over 24735.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2426, pruned_loss=0.02799, over 3994347.11 frames. ], batch size: 63, lr: 3.61e-03, grad_scale: 16.0 2023-12-04 19:17:53,101 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.80 vs. limit=15.0 2023-12-04 19:18:07,106 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=402233.3333333333, ans=0.0 2023-12-04 19:18:09,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=402233.3333333333, ans=0.125 2023-12-04 19:18:35,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=402366.6666666667, ans=0.125 2023-12-04 19:18:47,046 INFO [train.py:1087] (1/4) Epoch 68, batch 400, loss[loss=0.1494, simple_loss=0.2456, pruned_loss=0.02661, over 24777.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.243, pruned_loss=0.02822, over 4169599.01 frames. ], batch size: 71, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:18:53,209 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=12.0 2023-12-04 19:18:57,595 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=402433.3333333333, ans=0.0 2023-12-04 19:19:04,087 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=402500.0, ans=0.0 2023-12-04 19:19:17,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402566.6666666667, ans=0.1 2023-12-04 19:19:30,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=402633.3333333333, ans=0.125 2023-12-04 19:19:41,195 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=402633.3333333333, ans=0.0 2023-12-04 19:19:44,727 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:19:45,413 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.227e+02 1.311e+02 1.473e+02 1.957e+02, threshold=2.622e+02, percent-clipped=0.0 2023-12-04 19:19:45,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=402700.0, ans=0.1 2023-12-04 19:19:47,177 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:19:49,626 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=402700.0, ans=0.5 2023-12-04 19:19:52,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=402700.0, ans=10.0 2023-12-04 19:19:57,925 INFO [train.py:1087] (1/4) Epoch 68, batch 450, loss[loss=0.1593, simple_loss=0.2493, pruned_loss=0.03471, over 23710.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.243, pruned_loss=0.02823, over 4304496.41 frames. ], batch size: 57, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:20:12,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=402833.3333333333, ans=0.2 2023-12-04 19:21:00,624 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:21:06,550 INFO [train.py:1087] (1/4) Epoch 68, batch 500, loss[loss=0.1414, simple_loss=0.2323, pruned_loss=0.02528, over 24760.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2433, pruned_loss=0.02822, over 4403722.05 frames. ], batch size: 65, lr: 3.61e-03, grad_scale: 32.0 2023-12-04 19:21:35,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=403233.3333333333, ans=0.125 2023-12-04 19:21:54,715 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.94 vs. limit=10.0 2023-12-04 19:22:04,953 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.270e+02 1.394e+02 1.461e+02 1.900e+02, threshold=2.787e+02, percent-clipped=0.0 2023-12-04 19:22:16,554 INFO [train.py:1087] (1/4) Epoch 68, batch 550, loss[loss=0.1548, simple_loss=0.2459, pruned_loss=0.03184, over 24041.00 frames. ], tot_loss[loss=0.1503, simple_loss=0.2435, pruned_loss=0.0285, over 4491347.35 frames. ], batch size: 87, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:22:20,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=403433.3333333333, ans=10.0 2023-12-04 19:22:24,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=403433.3333333333, ans=0.1 2023-12-04 19:22:27,877 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=403433.3333333333, ans=0.0 2023-12-04 19:22:30,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=403500.0, ans=0.0 2023-12-04 19:22:34,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403500.0, ans=0.1 2023-12-04 19:22:36,053 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=403500.0, ans=0.0 2023-12-04 19:22:44,603 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:23:12,724 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:23:12,990 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.40 vs. limit=22.5 2023-12-04 19:23:25,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-12-04 19:23:25,790 INFO [train.py:1087] (1/4) Epoch 68, batch 600, loss[loss=0.1648, simple_loss=0.2633, pruned_loss=0.03321, over 22767.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2434, pruned_loss=0.02844, over 4572536.34 frames. ], batch size: 106, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:23:30,714 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-12-04 19:23:39,825 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=403833.3333333333, ans=0.125 2023-12-04 19:24:03,300 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=403900.0, ans=0.1 2023-12-04 19:24:25,072 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.276e+02 1.333e+02 1.458e+02 1.844e+02, threshold=2.667e+02, percent-clipped=0.0 2023-12-04 19:24:29,504 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=404033.3333333333, ans=0.125 2023-12-04 19:24:35,503 INFO [train.py:1087] (1/4) Epoch 68, batch 650, loss[loss=0.1333, simple_loss=0.2268, pruned_loss=0.01992, over 24854.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2433, pruned_loss=0.02846, over 4631325.84 frames. ], batch size: 68, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:24:52,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=404166.6666666667, ans=0.125 2023-12-04 19:25:14,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=404233.3333333333, ans=0.1 2023-12-04 19:25:26,303 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=12.0 2023-12-04 19:25:35,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=404366.6666666667, ans=0.2 2023-12-04 19:25:35,981 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=404366.6666666667, ans=0.0 2023-12-04 19:25:44,700 INFO [train.py:1087] (1/4) Epoch 68, batch 700, loss[loss=0.1447, simple_loss=0.2349, pruned_loss=0.02722, over 24575.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2433, pruned_loss=0.02846, over 4673672.28 frames. ], batch size: 64, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:25:49,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=404433.3333333333, ans=0.0 2023-12-04 19:26:03,893 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=404500.0, ans=0.2 2023-12-04 19:26:31,616 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404633.3333333333, ans=0.1 2023-12-04 19:26:41,763 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.275e+02 1.354e+02 1.444e+02 2.220e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-04 19:26:53,517 INFO [train.py:1087] (1/4) Epoch 68, batch 750, loss[loss=0.1669, simple_loss=0.2588, pruned_loss=0.03755, over 24490.00 frames. ], tot_loss[loss=0.1502, simple_loss=0.2434, pruned_loss=0.02853, over 4700095.32 frames. ], batch size: 75, lr: 3.60e-03, grad_scale: 16.0 2023-12-04 19:27:07,391 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.21 vs. limit=6.0 2023-12-04 19:27:27,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=404900.0, ans=0.125 2023-12-04 19:27:36,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=404966.6666666667, ans=0.2 2023-12-04 19:28:01,065 INFO [train.py:1087] (1/4) Epoch 68, batch 800, loss[loss=0.1454, simple_loss=0.2347, pruned_loss=0.02802, over 24327.00 frames. ], tot_loss[loss=0.1505, simple_loss=0.2435, pruned_loss=0.02874, over 4722304.53 frames. ], batch size: 79, lr: 3.60e-03, grad_scale: 32.0 2023-12-04 19:28:04,063 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405100.0, ans=0.1 2023-12-04 19:28:07,361 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-12-04 19:28:16,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=405166.6666666667, ans=0.0 2023-12-04 19:28:21,999 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-12-04 19:28:46,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=405300.0, ans=0.0 2023-12-04 19:28:51,207 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.30 vs. limit=15.0 2023-12-04 19:28:51,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405366.6666666667, ans=0.125 2023-12-04 19:28:53,973 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.053e+02 1.256e+02 1.383e+02 1.523e+02 1.980e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-04 19:28:54,847 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-12-04 19:29:03,556 INFO [train.py:1087] (1/4) Epoch 68, batch 850, loss[loss=0.1544, simple_loss=0.2461, pruned_loss=0.03132, over 24754.00 frames. ], tot_loss[loss=0.1507, simple_loss=0.2438, pruned_loss=0.02883, over 4752865.65 frames. ], batch size: 70, lr: 3.60e-03, grad_scale: 32.0 2023-12-04 19:29:16,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=405500.0, ans=0.025 2023-12-04 19:29:19,907 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-12-04 19:29:29,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405566.6666666667, ans=0.1 2023-12-04 19:30:18,008 INFO [train.py:1087] (1/4) Epoch 69, batch 0, loss[loss=0.1473, simple_loss=0.2475, pruned_loss=0.02353, over 24715.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2475, pruned_loss=0.02353, over 24715.00 frames. ], batch size: 69, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:30:18,009 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 19:30:34,633 INFO [train.py:1119] (1/4) Epoch 69, validation: loss=0.151, simple_loss=0.2473, pruned_loss=0.02734, over 944034.00 frames. 2023-12-04 19:30:34,634 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 19:30:50,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=405800.0, ans=0.125 2023-12-04 19:30:55,098 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=405800.0, ans=0.2 2023-12-04 19:30:55,529 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.28 vs. limit=12.0 2023-12-04 19:31:32,039 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=406000.0, ans=0.0 2023-12-04 19:31:38,940 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.167e+02 1.273e+02 1.368e+02 1.465e+02 2.041e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 19:31:43,013 INFO [train.py:1087] (1/4) Epoch 69, batch 50, loss[loss=0.1521, simple_loss=0.2474, pruned_loss=0.02836, over 23958.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02831, over 1095036.40 frames. ], batch size: 87, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:31:48,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=406066.6666666667, ans=0.2 2023-12-04 19:32:09,783 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=406200.0, ans=0.125 2023-12-04 19:32:32,264 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-12-04 19:32:51,506 INFO [train.py:1087] (1/4) Epoch 69, batch 100, loss[loss=0.1525, simple_loss=0.2459, pruned_loss=0.02949, over 24760.00 frames. ], tot_loss[loss=0.1506, simple_loss=0.244, pruned_loss=0.02855, over 1909879.37 frames. ], batch size: 65, lr: 3.57e-03, grad_scale: 32.0 2023-12-04 19:33:24,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=406533.3333333333, ans=0.125 2023-12-04 19:33:49,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=406666.6666666667, ans=0.1 2023-12-04 19:33:56,387 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.284e+02 1.376e+02 1.523e+02 2.091e+02, threshold=2.753e+02, percent-clipped=0.0 2023-12-04 19:33:59,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=406733.3333333333, ans=0.2 2023-12-04 19:34:00,290 INFO [train.py:1087] (1/4) Epoch 69, batch 150, loss[loss=0.1709, simple_loss=0.2586, pruned_loss=0.04161, over 16449.00 frames. ], tot_loss[loss=0.151, simple_loss=0.2442, pruned_loss=0.02887, over 2541313.07 frames. ], batch size: 177, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:34:14,209 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.90 vs. limit=10.0 2023-12-04 19:34:20,499 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=406800.0, ans=0.125 2023-12-04 19:34:29,767 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=406866.6666666667, ans=0.125 2023-12-04 19:34:51,716 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-12-04 19:34:52,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=406933.3333333333, ans=0.0 2023-12-04 19:35:02,854 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407000.0, ans=0.1 2023-12-04 19:35:08,916 INFO [train.py:1087] (1/4) Epoch 69, batch 200, loss[loss=0.1401, simple_loss=0.2328, pruned_loss=0.02368, over 24781.00 frames. ], tot_loss[loss=0.1509, simple_loss=0.244, pruned_loss=0.02892, over 3024584.20 frames. ], batch size: 70, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:35:27,024 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=407133.3333333333, ans=0.125 2023-12-04 19:35:28,079 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=407133.3333333333, ans=0.125 2023-12-04 19:35:28,339 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=407133.3333333333, ans=0.0 2023-12-04 19:35:54,130 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=407266.6666666667, ans=0.125 2023-12-04 19:36:11,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=407333.3333333333, ans=0.2 2023-12-04 19:36:14,378 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.284e+02 1.362e+02 1.487e+02 2.201e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 19:36:16,480 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.05 vs. limit=10.0 2023-12-04 19:36:18,335 INFO [train.py:1087] (1/4) Epoch 69, batch 250, loss[loss=0.1481, simple_loss=0.245, pruned_loss=0.02558, over 24808.00 frames. ], tot_loss[loss=0.1511, simple_loss=0.2441, pruned_loss=0.02902, over 3410701.59 frames. ], batch size: 62, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:37:07,489 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=407600.0, ans=0.0 2023-12-04 19:37:10,443 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=407600.0, ans=0.125 2023-12-04 19:37:23,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=407666.6666666667, ans=0.2 2023-12-04 19:37:26,289 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407733.3333333333, ans=0.1 2023-12-04 19:37:27,280 INFO [train.py:1087] (1/4) Epoch 69, batch 300, loss[loss=0.1365, simple_loss=0.2321, pruned_loss=0.02044, over 24680.00 frames. ], tot_loss[loss=0.1504, simple_loss=0.2436, pruned_loss=0.02861, over 3722117.62 frames. ], batch size: 74, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:37:50,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=407800.0, ans=0.2 2023-12-04 19:37:57,550 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=407866.6666666667, ans=0.0 2023-12-04 19:38:13,589 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=22.5 2023-12-04 19:38:30,203 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.056e+02 1.268e+02 1.359e+02 1.465e+02 1.790e+02, threshold=2.718e+02, percent-clipped=0.0 2023-12-04 19:38:34,709 INFO [train.py:1087] (1/4) Epoch 69, batch 350, loss[loss=0.1343, simple_loss=0.2289, pruned_loss=0.01992, over 24710.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02809, over 3981103.09 frames. ], batch size: 74, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:38:36,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=408066.6666666667, ans=0.125 2023-12-04 19:38:49,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=408133.3333333333, ans=0.125 2023-12-04 19:38:57,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=408133.3333333333, ans=0.125 2023-12-04 19:38:57,542 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=408133.3333333333, ans=0.125 2023-12-04 19:38:58,710 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=408133.3333333333, ans=0.0 2023-12-04 19:39:09,319 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=408200.0, ans=0.2 2023-12-04 19:39:34,995 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:39:43,591 INFO [train.py:1087] (1/4) Epoch 69, batch 400, loss[loss=0.16, simple_loss=0.2476, pruned_loss=0.0362, over 24300.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2426, pruned_loss=0.02805, over 4174982.41 frames. ], batch size: 79, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:39:47,863 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=408400.0, ans=0.1 2023-12-04 19:40:34,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408600.0, ans=0.125 2023-12-04 19:40:48,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=408666.6666666667, ans=0.0 2023-12-04 19:40:49,272 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.060e+02 1.255e+02 1.353e+02 1.467e+02 1.842e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 19:40:52,396 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=408733.3333333333, ans=0.125 2023-12-04 19:40:53,296 INFO [train.py:1087] (1/4) Epoch 69, batch 450, loss[loss=0.1409, simple_loss=0.2349, pruned_loss=0.0235, over 24806.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2427, pruned_loss=0.02801, over 4313325.34 frames. ], batch size: 71, lr: 3.56e-03, grad_scale: 32.0 2023-12-04 19:40:59,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=408733.3333333333, ans=0.125 2023-12-04 19:41:25,034 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=15.0 2023-12-04 19:41:27,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=408866.6666666667, ans=0.125 2023-12-04 19:41:32,334 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=408866.6666666667, ans=0.125 2023-12-04 19:41:36,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=408933.3333333333, ans=0.0 2023-12-04 19:41:58,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=409000.0, ans=0.125 2023-12-04 19:42:02,485 INFO [train.py:1087] (1/4) Epoch 69, batch 500, loss[loss=0.1453, simple_loss=0.238, pruned_loss=0.02626, over 24768.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2432, pruned_loss=0.02839, over 4418752.39 frames. ], batch size: 71, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:42:04,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=409066.6666666667, ans=0.125 2023-12-04 19:42:07,221 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=409066.6666666667, ans=0.0 2023-12-04 19:42:08,405 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409066.6666666667, ans=0.1 2023-12-04 19:42:35,690 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409200.0, ans=0.1 2023-12-04 19:42:39,291 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.31 vs. limit=15.0 2023-12-04 19:42:47,873 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.95 vs. limit=15.0 2023-12-04 19:42:55,330 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=409266.6666666667, ans=0.125 2023-12-04 19:43:07,065 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.242e+02 1.345e+02 1.487e+02 1.908e+02, threshold=2.690e+02, percent-clipped=0.0 2023-12-04 19:43:11,102 INFO [train.py:1087] (1/4) Epoch 69, batch 550, loss[loss=0.152, simple_loss=0.2455, pruned_loss=0.02926, over 24566.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2431, pruned_loss=0.02825, over 4515466.60 frames. ], batch size: 63, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:43:55,045 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:43:57,071 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-12-04 19:44:02,729 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=409600.0, ans=0.125 2023-12-04 19:44:13,219 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=409666.6666666667, ans=0.0 2023-12-04 19:44:20,580 INFO [train.py:1087] (1/4) Epoch 69, batch 600, loss[loss=0.1481, simple_loss=0.2403, pruned_loss=0.02793, over 24813.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2425, pruned_loss=0.02797, over 4586880.62 frames. ], batch size: 73, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:44:35,194 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-12-04 19:44:54,299 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=409866.6666666667, ans=0.125 2023-12-04 19:44:55,816 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.14 vs. limit=12.0 2023-12-04 19:45:13,215 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten.whitening_limit, batch_count=409933.3333333333, ans=15.0 2023-12-04 19:45:24,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=410000.0, ans=0.125 2023-12-04 19:45:25,234 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.291e+02 1.390e+02 1.536e+02 2.156e+02, threshold=2.780e+02, percent-clipped=0.0 2023-12-04 19:45:29,264 INFO [train.py:1087] (1/4) Epoch 69, batch 650, loss[loss=0.1403, simple_loss=0.2349, pruned_loss=0.02282, over 23725.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2425, pruned_loss=0.02801, over 4642357.94 frames. ], batch size: 57, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:45:58,907 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.60 vs. limit=22.5 2023-12-04 19:45:59,535 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=410200.0, ans=0.04949747468305833 2023-12-04 19:46:09,404 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=410266.6666666667, ans=0.125 2023-12-04 19:46:32,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410333.3333333333, ans=0.0 2023-12-04 19:46:34,074 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:46:37,726 INFO [train.py:1087] (1/4) Epoch 69, batch 700, loss[loss=0.144, simple_loss=0.2388, pruned_loss=0.02466, over 24584.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2423, pruned_loss=0.02803, over 4677373.37 frames. ], batch size: 65, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:46:54,311 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410466.6666666667, ans=0.1 2023-12-04 19:47:02,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=410466.6666666667, ans=0.125 2023-12-04 19:47:10,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=410533.3333333333, ans=0.0 2023-12-04 19:47:33,289 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-12-04 19:47:38,456 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-12-04 19:47:41,419 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.061e+02 1.290e+02 1.377e+02 1.496e+02 2.401e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 19:47:46,306 INFO [train.py:1087] (1/4) Epoch 69, batch 750, loss[loss=0.1535, simple_loss=0.2475, pruned_loss=0.02979, over 24876.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2425, pruned_loss=0.02834, over 4690956.18 frames. ], batch size: 68, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:47:56,289 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:47:56,476 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2023-12-04 19:48:00,767 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-12-04 19:48:03,351 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-12-04 19:48:33,197 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=410933.3333333333, ans=0.125 2023-12-04 19:48:36,020 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=410933.3333333333, ans=0.125 2023-12-04 19:48:38,547 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 19:48:38,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=410933.3333333333, ans=0.125 2023-12-04 19:48:39,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=411000.0, ans=0.2 2023-12-04 19:48:42,783 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=411000.0, ans=0.125 2023-12-04 19:48:54,606 INFO [train.py:1087] (1/4) Epoch 69, batch 800, loss[loss=0.149, simple_loss=0.242, pruned_loss=0.02803, over 24705.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2425, pruned_loss=0.02813, over 4727369.67 frames. ], batch size: 74, lr: 3.55e-03, grad_scale: 32.0 2023-12-04 19:48:55,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=411066.6666666667, ans=0.0 2023-12-04 19:49:03,966 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=411066.6666666667, ans=0.125 2023-12-04 19:49:18,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=411133.3333333333, ans=0.5 2023-12-04 19:49:20,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=411200.0, ans=0.2 2023-12-04 19:49:20,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=411200.0, ans=0.0 2023-12-04 19:49:35,514 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=411266.6666666667, ans=0.07 2023-12-04 19:49:36,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411266.6666666667, ans=0.125 2023-12-04 19:49:43,124 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.31 vs. limit=15.0 2023-12-04 19:49:50,792 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=411333.3333333333, ans=0.0 2023-12-04 19:49:52,926 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.292e+02 1.387e+02 1.504e+02 2.186e+02, threshold=2.774e+02, percent-clipped=0.0 2023-12-04 19:49:56,658 INFO [train.py:1087] (1/4) Epoch 69, batch 850, loss[loss=0.1477, simple_loss=0.2397, pruned_loss=0.02786, over 24766.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2429, pruned_loss=0.02828, over 4737982.53 frames. ], batch size: 70, lr: 3.54e-03, grad_scale: 32.0 2023-12-04 19:50:17,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=411466.6666666667, ans=0.125 2023-12-04 19:50:28,606 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-12-04 19:50:37,133 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.49 vs. limit=15.0 2023-12-04 19:50:44,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=411666.6666666667, ans=0.125 2023-12-04 19:50:48,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=411666.6666666667, ans=0.125 2023-12-04 19:51:06,628 INFO [train.py:1087] (1/4) Epoch 70, batch 0, loss[loss=0.148, simple_loss=0.2445, pruned_loss=0.02575, over 21629.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2445, pruned_loss=0.02575, over 21629.00 frames. ], batch size: 128, lr: 3.52e-03, grad_scale: 32.0 2023-12-04 19:51:06,629 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 19:51:19,891 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([1.7466, 3.8666, 3.9395, 4.2572], device='cuda:1') 2023-12-04 19:51:22,674 INFO [train.py:1119] (1/4) Epoch 70, validation: loss=0.1509, simple_loss=0.2473, pruned_loss=0.02724, over 944034.00 frames. 2023-12-04 19:51:22,675 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 19:51:43,973 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=411766.6666666667, ans=0.0 2023-12-04 19:52:04,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=411900.0, ans=0.0 2023-12-04 19:52:05,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=411900.0, ans=0.125 2023-12-04 19:52:24,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=411966.6666666667, ans=0.0 2023-12-04 19:52:29,722 INFO [train.py:1087] (1/4) Epoch 70, batch 50, loss[loss=0.152, simple_loss=0.247, pruned_loss=0.0285, over 24863.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2433, pruned_loss=0.02735, over 1098520.78 frames. ], batch size: 68, lr: 3.52e-03, grad_scale: 16.0 2023-12-04 19:52:35,037 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.274e+02 1.365e+02 1.527e+02 2.483e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 19:53:14,960 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.88 vs. limit=6.0 2023-12-04 19:53:14,987 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2023-12-04 19:53:24,485 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.72 vs. limit=15.0 2023-12-04 19:53:24,636 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-12-04 19:53:25,312 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412300.0, ans=0.1 2023-12-04 19:53:30,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412300.0, ans=0.1 2023-12-04 19:53:36,470 INFO [train.py:1087] (1/4) Epoch 70, batch 100, loss[loss=0.1606, simple_loss=0.2553, pruned_loss=0.033, over 22859.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2426, pruned_loss=0.02755, over 1927846.44 frames. ], batch size: 106, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:53:37,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=412366.6666666667, ans=0.1 2023-12-04 19:53:45,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=412366.6666666667, ans=0.125 2023-12-04 19:54:00,395 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=412433.3333333333, ans=0.0 2023-12-04 19:54:28,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=412566.6666666667, ans=0.0 2023-12-04 19:54:40,258 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.85 vs. limit=15.0 2023-12-04 19:54:44,493 INFO [train.py:1087] (1/4) Epoch 70, batch 150, loss[loss=0.1455, simple_loss=0.2366, pruned_loss=0.02717, over 24762.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2431, pruned_loss=0.0276, over 2564020.81 frames. ], batch size: 65, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:54:49,695 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.081e+02 1.259e+02 1.327e+02 1.407e+02 1.863e+02, threshold=2.654e+02, percent-clipped=0.0 2023-12-04 19:54:51,882 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.23 vs. limit=22.5 2023-12-04 19:54:52,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=412700.0, ans=0.125 2023-12-04 19:55:43,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=412966.6666666667, ans=0.125 2023-12-04 19:55:51,979 INFO [train.py:1087] (1/4) Epoch 70, batch 200, loss[loss=0.1427, simple_loss=0.2397, pruned_loss=0.02285, over 24706.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2427, pruned_loss=0.02762, over 3064319.82 frames. ], batch size: 69, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:55:59,494 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-12-04 19:56:12,162 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=413100.0, ans=0.125 2023-12-04 19:56:47,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.53 vs. limit=22.5 2023-12-04 19:56:59,678 INFO [train.py:1087] (1/4) Epoch 70, batch 250, loss[loss=0.1527, simple_loss=0.2462, pruned_loss=0.02958, over 24605.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2428, pruned_loss=0.02791, over 3433588.95 frames. ], batch size: 68, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:57:04,766 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.288e+02 1.394e+02 1.582e+02 1.956e+02, threshold=2.788e+02, percent-clipped=0.0 2023-12-04 19:57:10,930 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-12-04 19:57:31,524 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=413500.0, ans=0.0 2023-12-04 19:57:51,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=413633.3333333333, ans=0.125 2023-12-04 19:57:58,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=413633.3333333333, ans=0.2 2023-12-04 19:58:07,195 INFO [train.py:1087] (1/4) Epoch 70, batch 300, loss[loss=0.1466, simple_loss=0.2375, pruned_loss=0.02783, over 24729.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2427, pruned_loss=0.02806, over 3721907.63 frames. ], batch size: 67, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:58:19,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=413766.6666666667, ans=0.0 2023-12-04 19:58:30,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=413766.6666666667, ans=0.0 2023-12-04 19:59:00,764 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.05 vs. limit=15.0 2023-12-04 19:59:08,883 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.72 vs. limit=12.0 2023-12-04 19:59:12,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=414033.3333333333, ans=0.0 2023-12-04 19:59:13,306 INFO [train.py:1087] (1/4) Epoch 70, batch 350, loss[loss=0.1517, simple_loss=0.2492, pruned_loss=0.02705, over 22057.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02808, over 3960327.13 frames. ], batch size: 53, lr: 3.51e-03, grad_scale: 16.0 2023-12-04 19:59:18,496 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.237e+02 1.319e+02 1.438e+02 1.651e+02, threshold=2.639e+02, percent-clipped=0.0 2023-12-04 19:59:49,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=414166.6666666667, ans=0.0 2023-12-04 20:00:09,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=414300.0, ans=0.0 2023-12-04 20:00:13,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=414300.0, ans=0.125 2023-12-04 20:00:22,629 INFO [train.py:1087] (1/4) Epoch 70, batch 400, loss[loss=0.1488, simple_loss=0.2435, pruned_loss=0.02702, over 24580.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2427, pruned_loss=0.02799, over 4150116.77 frames. ], batch size: 65, lr: 3.51e-03, grad_scale: 32.0 2023-12-04 20:00:22,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414366.6666666667, ans=0.1 2023-12-04 20:00:22,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=414366.6666666667, ans=0.2 2023-12-04 20:00:26,207 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=22.5 2023-12-04 20:00:36,035 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414433.3333333333, ans=0.125 2023-12-04 20:00:44,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=414433.3333333333, ans=0.2 2023-12-04 20:00:54,034 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=8.0 2023-12-04 20:00:56,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414500.0, ans=0.125 2023-12-04 20:01:17,800 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-12-04 20:01:31,548 INFO [train.py:1087] (1/4) Epoch 70, batch 450, loss[loss=0.1526, simple_loss=0.2428, pruned_loss=0.03119, over 24562.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2428, pruned_loss=0.02785, over 4304969.99 frames. ], batch size: 62, lr: 3.50e-03, grad_scale: 32.0 2023-12-04 20:01:31,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=414700.0, ans=0.125 2023-12-04 20:01:36,544 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.278e+02 1.347e+02 1.512e+02 1.947e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 20:01:44,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=414766.6666666667, ans=0.07 2023-12-04 20:02:07,279 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2.whitening_limit, batch_count=414833.3333333333, ans=15.0 2023-12-04 20:02:13,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=414900.0, ans=0.04949747468305833 2023-12-04 20:02:18,684 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=414900.0, ans=0.09899494936611666 2023-12-04 20:02:39,593 INFO [train.py:1087] (1/4) Epoch 70, batch 500, loss[loss=0.1452, simple_loss=0.2404, pruned_loss=0.02499, over 24758.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2429, pruned_loss=0.02781, over 4424522.73 frames. ], batch size: 66, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:02:49,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=415033.3333333333, ans=0.125 2023-12-04 20:03:47,568 INFO [train.py:1087] (1/4) Epoch 70, batch 550, loss[loss=0.148, simple_loss=0.2454, pruned_loss=0.02533, over 24796.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2432, pruned_loss=0.028, over 4496554.45 frames. ], batch size: 71, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:03:52,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.75 vs. limit=15.0 2023-12-04 20:03:55,051 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.240e+02 1.315e+02 1.412e+02 1.948e+02, threshold=2.630e+02, percent-clipped=0.0 2023-12-04 20:03:58,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=415366.6666666667, ans=0.125 2023-12-04 20:03:59,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415366.6666666667, ans=0.0 2023-12-04 20:04:09,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=415433.3333333333, ans=0.125 2023-12-04 20:04:10,137 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2023-12-04 20:04:23,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=415500.0, ans=10.0 2023-12-04 20:04:30,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=415566.6666666667, ans=0.125 2023-12-04 20:04:34,002 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-12-04 20:04:52,097 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-12-04 20:04:52,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=415633.3333333333, ans=0.125 2023-12-04 20:04:56,339 INFO [train.py:1087] (1/4) Epoch 70, batch 600, loss[loss=0.1437, simple_loss=0.2384, pruned_loss=0.02451, over 24755.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.02801, over 4567638.92 frames. ], batch size: 65, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:04:58,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=415700.0, ans=0.0 2023-12-04 20:05:00,751 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=415700.0, ans=0.125 2023-12-04 20:05:11,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415766.6666666667, ans=0.0 2023-12-04 20:05:29,841 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.38 vs. limit=22.5 2023-12-04 20:06:04,815 INFO [train.py:1087] (1/4) Epoch 70, batch 650, loss[loss=0.1688, simple_loss=0.2598, pruned_loss=0.03893, over 24471.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2428, pruned_loss=0.02806, over 4625980.89 frames. ], batch size: 75, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:06:11,809 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.073e+02 1.223e+02 1.306e+02 1.426e+02 1.890e+02, threshold=2.613e+02, percent-clipped=0.0 2023-12-04 20:06:15,387 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.86 vs. limit=15.0 2023-12-04 20:06:23,213 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.70 vs. limit=15.0 2023-12-04 20:06:32,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=416166.6666666667, ans=0.125 2023-12-04 20:06:32,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=416166.6666666667, ans=0.125 2023-12-04 20:06:36,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=416166.6666666667, ans=0.0 2023-12-04 20:06:47,759 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=15.0 2023-12-04 20:06:47,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.81 vs. limit=15.0 2023-12-04 20:07:00,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=416300.0, ans=0.125 2023-12-04 20:07:13,110 INFO [train.py:1087] (1/4) Epoch 70, batch 700, loss[loss=0.1457, simple_loss=0.2397, pruned_loss=0.02585, over 24578.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2427, pruned_loss=0.02798, over 4668785.98 frames. ], batch size: 64, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:07:34,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=416433.3333333333, ans=0.125 2023-12-04 20:07:40,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=416500.0, ans=0.0 2023-12-04 20:07:42,817 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416500.0, ans=0.1 2023-12-04 20:07:56,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=416566.6666666667, ans=0.2 2023-12-04 20:07:57,448 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=416566.6666666667, ans=0.125 2023-12-04 20:08:03,946 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=416566.6666666667, ans=0.1 2023-12-04 20:08:20,013 INFO [train.py:1087] (1/4) Epoch 70, batch 750, loss[loss=0.1434, simple_loss=0.2412, pruned_loss=0.02277, over 24755.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.02802, over 4696360.75 frames. ], batch size: 70, lr: 3.50e-03, grad_scale: 16.0 2023-12-04 20:08:27,499 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.147e+02 1.247e+02 1.315e+02 1.424e+02 2.210e+02, threshold=2.629e+02, percent-clipped=0.0 2023-12-04 20:08:31,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=416700.0, ans=0.07 2023-12-04 20:08:51,947 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=416833.3333333333, ans=0.2 2023-12-04 20:09:15,417 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=416966.6666666667, ans=0.125 2023-12-04 20:09:28,683 INFO [train.py:1087] (1/4) Epoch 70, batch 800, loss[loss=0.1407, simple_loss=0.2395, pruned_loss=0.02098, over 24786.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.243, pruned_loss=0.02808, over 4713208.78 frames. ], batch size: 73, lr: 3.49e-03, grad_scale: 32.0 2023-12-04 20:09:34,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=417033.3333333333, ans=0.1 2023-12-04 20:09:49,909 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417100.0, ans=0.1 2023-12-04 20:10:10,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=417233.3333333333, ans=0.125 2023-12-04 20:10:16,549 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-12-04 20:10:30,571 INFO [train.py:1087] (1/4) Epoch 70, batch 850, loss[loss=0.1466, simple_loss=0.2413, pruned_loss=0.02593, over 24754.00 frames. ], tot_loss[loss=0.1501, simple_loss=0.2435, pruned_loss=0.02838, over 4719442.54 frames. ], batch size: 61, lr: 3.49e-03, grad_scale: 32.0 2023-12-04 20:10:36,606 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.298e+02 1.372e+02 1.484e+02 2.066e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 20:10:45,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417433.3333333333, ans=0.1 2023-12-04 20:10:56,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=417500.0, ans=0.2 2023-12-04 20:11:03,715 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=417500.0, ans=0.125 2023-12-04 20:11:17,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417566.6666666667, ans=0.1 2023-12-04 20:11:40,513 INFO [train.py:1087] (1/4) Epoch 71, batch 0, loss[loss=0.1542, simple_loss=0.2451, pruned_loss=0.03162, over 23383.00 frames. ], tot_loss[loss=0.1542, simple_loss=0.2451, pruned_loss=0.03162, over 23383.00 frames. ], batch size: 94, lr: 3.47e-03, grad_scale: 32.0 2023-12-04 20:11:40,514 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 20:11:56,070 INFO [train.py:1119] (1/4) Epoch 71, validation: loss=0.1506, simple_loss=0.247, pruned_loss=0.02716, over 944034.00 frames. 2023-12-04 20:11:56,071 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 20:12:11,630 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=417733.3333333333, ans=0.0 2023-12-04 20:12:14,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=417733.3333333333, ans=0.2 2023-12-04 20:12:34,830 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=417866.6666666667, ans=0.0 2023-12-04 20:12:37,048 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.64 vs. limit=10.0 2023-12-04 20:13:02,937 INFO [train.py:1087] (1/4) Epoch 71, batch 50, loss[loss=0.1606, simple_loss=0.259, pruned_loss=0.03109, over 24098.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2424, pruned_loss=0.02728, over 1094242.37 frames. ], batch size: 87, lr: 3.47e-03, grad_scale: 32.0 2023-12-04 20:13:16,537 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.267e+02 1.344e+02 1.443e+02 2.088e+02, threshold=2.688e+02, percent-clipped=0.0 2023-12-04 20:13:25,814 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=418066.6666666667, ans=0.0 2023-12-04 20:13:26,968 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:13:43,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.61 vs. limit=10.0 2023-12-04 20:14:09,116 INFO [train.py:1087] (1/4) Epoch 71, batch 100, loss[loss=0.1518, simple_loss=0.2442, pruned_loss=0.02967, over 24441.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2433, pruned_loss=0.02767, over 1929148.41 frames. ], batch size: 77, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:14:16,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=418333.3333333333, ans=0.0 2023-12-04 20:14:47,364 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=418533.3333333333, ans=15.0 2023-12-04 20:15:09,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=418600.0, ans=0.2 2023-12-04 20:15:13,492 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=418666.6666666667, ans=0.125 2023-12-04 20:15:13,657 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=418666.6666666667, ans=0.125 2023-12-04 20:15:14,847 INFO [train.py:1087] (1/4) Epoch 71, batch 150, loss[loss=0.1422, simple_loss=0.2364, pruned_loss=0.02403, over 24566.00 frames. ], tot_loss[loss=0.15, simple_loss=0.2438, pruned_loss=0.02812, over 2564476.02 frames. ], batch size: 63, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:15:15,155 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418666.6666666667, ans=0.1 2023-12-04 20:15:28,858 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.240e+02 1.368e+02 1.477e+02 1.934e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 20:15:54,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=418866.6666666667, ans=0.125 2023-12-04 20:15:59,321 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.35 vs. limit=15.0 2023-12-04 20:16:10,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=418933.3333333333, ans=0.0 2023-12-04 20:16:22,819 INFO [train.py:1087] (1/4) Epoch 71, batch 200, loss[loss=0.1555, simple_loss=0.2477, pruned_loss=0.03168, over 24791.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2432, pruned_loss=0.02807, over 3069128.04 frames. ], batch size: 62, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:16:36,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419066.6666666667, ans=0.125 2023-12-04 20:17:31,519 INFO [train.py:1087] (1/4) Epoch 71, batch 250, loss[loss=0.1542, simple_loss=0.2468, pruned_loss=0.03081, over 24768.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2427, pruned_loss=0.02797, over 3459606.72 frames. ], batch size: 70, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:17:44,477 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.265e+02 1.365e+02 1.452e+02 1.844e+02, threshold=2.730e+02, percent-clipped=0.0 2023-12-04 20:18:11,334 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=419533.3333333333, ans=0.2 2023-12-04 20:18:28,384 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419600.0, ans=0.1 2023-12-04 20:18:39,727 INFO [train.py:1087] (1/4) Epoch 71, batch 300, loss[loss=0.1396, simple_loss=0.2334, pruned_loss=0.02286, over 24788.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.02811, over 3747883.08 frames. ], batch size: 73, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:18:48,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=419666.6666666667, ans=0.125 2023-12-04 20:19:07,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419800.0, ans=0.1 2023-12-04 20:19:16,014 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-12-04 20:19:16,826 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=419800.0, ans=0.125 2023-12-04 20:19:28,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=419866.6666666667, ans=0.2 2023-12-04 20:19:32,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=419933.3333333333, ans=0.05 2023-12-04 20:19:33,572 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=419933.3333333333, ans=0.5 2023-12-04 20:19:46,782 INFO [train.py:1087] (1/4) Epoch 71, batch 350, loss[loss=0.1574, simple_loss=0.2436, pruned_loss=0.0356, over 22867.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.242, pruned_loss=0.02787, over 3984893.90 frames. ], batch size: 55, lr: 3.46e-03, grad_scale: 16.0 2023-12-04 20:19:50,079 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-12-04 20:20:02,809 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.263e+02 1.346e+02 1.433e+02 1.740e+02, threshold=2.692e+02, percent-clipped=0.0 2023-12-04 20:20:06,039 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-12-04 20:20:11,244 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=12.0 2023-12-04 20:20:24,449 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=420133.3333333333, ans=10.0 2023-12-04 20:20:30,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=420200.0, ans=0.125 2023-12-04 20:20:43,988 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=420266.6666666667, ans=0.125 2023-12-04 20:20:55,225 INFO [train.py:1087] (1/4) Epoch 71, batch 400, loss[loss=0.1461, simple_loss=0.2448, pruned_loss=0.02368, over 24578.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2417, pruned_loss=0.02765, over 4179925.54 frames. ], batch size: 65, lr: 3.46e-03, grad_scale: 32.0 2023-12-04 20:20:55,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=420333.3333333333, ans=0.2 2023-12-04 20:21:07,400 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=420400.0, ans=0.0 2023-12-04 20:21:18,609 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=420400.0, ans=0.125 2023-12-04 20:21:39,158 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:21:57,968 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.90 vs. limit=15.0 2023-12-04 20:22:04,161 INFO [train.py:1087] (1/4) Epoch 71, batch 450, loss[loss=0.1433, simple_loss=0.2372, pruned_loss=0.02469, over 24691.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02749, over 4322392.11 frames. ], batch size: 74, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:22:10,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=420666.6666666667, ans=0.2 2023-12-04 20:22:18,177 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.263e+02 1.348e+02 1.441e+02 1.914e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-04 20:22:20,185 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=12.0 2023-12-04 20:22:45,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=420866.6666666667, ans=0.0 2023-12-04 20:23:00,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=420933.3333333333, ans=0.125 2023-12-04 20:23:11,322 INFO [train.py:1087] (1/4) Epoch 71, batch 500, loss[loss=0.1513, simple_loss=0.2447, pruned_loss=0.02896, over 23477.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02753, over 4435427.50 frames. ], batch size: 94, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:23:19,958 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=421000.0, ans=0.2 2023-12-04 20:23:23,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=421066.6666666667, ans=0.0 2023-12-04 20:23:23,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=421066.6666666667, ans=0.125 2023-12-04 20:23:38,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=421133.3333333333, ans=0.5 2023-12-04 20:23:38,235 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=421133.3333333333, ans=0.125 2023-12-04 20:23:41,689 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.35 vs. limit=6.0 2023-12-04 20:24:07,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=421266.6666666667, ans=0.0 2023-12-04 20:24:17,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=421333.3333333333, ans=0.0 2023-12-04 20:24:19,595 INFO [train.py:1087] (1/4) Epoch 71, batch 550, loss[loss=0.144, simple_loss=0.2345, pruned_loss=0.02675, over 24856.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.02788, over 4504646.24 frames. ], batch size: 68, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:24:27,546 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=421333.3333333333, ans=0.125 2023-12-04 20:24:34,972 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.289e+02 1.391e+02 1.501e+02 1.847e+02, threshold=2.783e+02, percent-clipped=0.0 2023-12-04 20:24:39,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=421400.0, ans=0.1 2023-12-04 20:25:17,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=421600.0, ans=0.125 2023-12-04 20:25:27,737 INFO [train.py:1087] (1/4) Epoch 71, batch 600, loss[loss=0.1411, simple_loss=0.2382, pruned_loss=0.02204, over 24807.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2424, pruned_loss=0.02788, over 4566347.17 frames. ], batch size: 72, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:25:30,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=421666.6666666667, ans=0.125 2023-12-04 20:25:57,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421800.0, ans=0.1 2023-12-04 20:26:04,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=421800.0, ans=0.0 2023-12-04 20:26:05,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-12-04 20:26:23,956 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=421933.3333333333, ans=0.0 2023-12-04 20:26:29,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=421933.3333333333, ans=0.2 2023-12-04 20:26:35,199 INFO [train.py:1087] (1/4) Epoch 71, batch 650, loss[loss=0.143, simple_loss=0.2334, pruned_loss=0.02634, over 24705.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2425, pruned_loss=0.02807, over 4618120.50 frames. ], batch size: 69, lr: 3.45e-03, grad_scale: 32.0 2023-12-04 20:26:38,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=422000.0, ans=0.125 2023-12-04 20:26:40,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=422000.0, ans=0.125 2023-12-04 20:26:51,248 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.308e+02 1.425e+02 1.537e+02 2.318e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 20:27:17,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=422200.0, ans=0.125 2023-12-04 20:27:33,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-12-04 20:27:40,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=422266.6666666667, ans=0.07 2023-12-04 20:27:44,296 INFO [train.py:1087] (1/4) Epoch 71, batch 700, loss[loss=0.1532, simple_loss=0.2478, pruned_loss=0.0293, over 24767.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2427, pruned_loss=0.02819, over 4646016.83 frames. ], batch size: 64, lr: 3.45e-03, grad_scale: 16.0 2023-12-04 20:27:51,124 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=422333.3333333333, ans=0.05 2023-12-04 20:27:55,397 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=15.0 2023-12-04 20:27:56,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=422400.0, ans=0.0 2023-12-04 20:28:06,355 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.52 vs. limit=15.0 2023-12-04 20:28:15,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=422466.6666666667, ans=0.025 2023-12-04 20:28:22,112 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=422466.6666666667, ans=0.125 2023-12-04 20:28:22,403 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=422466.6666666667, ans=0.0 2023-12-04 20:28:38,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=422600.0, ans=0.125 2023-12-04 20:28:53,160 INFO [train.py:1087] (1/4) Epoch 71, batch 750, loss[loss=0.1484, simple_loss=0.2445, pruned_loss=0.02608, over 23956.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2428, pruned_loss=0.0284, over 4660738.72 frames. ], batch size: 87, lr: 3.45e-03, grad_scale: 8.0 2023-12-04 20:28:53,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=422666.6666666667, ans=0.125 2023-12-04 20:28:59,613 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=422666.6666666667, ans=0.1 2023-12-04 20:29:06,593 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.28 vs. limit=15.0 2023-12-04 20:29:10,976 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.165e+02 1.267e+02 1.335e+02 1.436e+02 1.922e+02, threshold=2.670e+02, percent-clipped=0.0 2023-12-04 20:29:19,211 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=422800.0, ans=0.0 2023-12-04 20:29:24,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=422800.0, ans=0.125 2023-12-04 20:30:01,125 INFO [train.py:1087] (1/4) Epoch 71, batch 800, loss[loss=0.1548, simple_loss=0.2473, pruned_loss=0.03115, over 24674.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2428, pruned_loss=0.02823, over 4702903.32 frames. ], batch size: 74, lr: 3.45e-03, grad_scale: 16.0 2023-12-04 20:30:21,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=423066.6666666667, ans=0.125 2023-12-04 20:30:31,111 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=12.0 2023-12-04 20:30:38,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=423133.3333333333, ans=0.1 2023-12-04 20:30:56,135 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-12-04 20:31:04,101 INFO [train.py:1087] (1/4) Epoch 71, batch 850, loss[loss=0.1546, simple_loss=0.25, pruned_loss=0.02959, over 24559.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2429, pruned_loss=0.02835, over 4729113.85 frames. ], batch size: 62, lr: 3.44e-03, grad_scale: 16.0 2023-12-04 20:31:19,244 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.244e+02 1.335e+02 1.450e+02 1.789e+02, threshold=2.670e+02, percent-clipped=0.0 2023-12-04 20:31:40,035 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2023-12-04 20:31:46,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=423533.3333333333, ans=0.125 2023-12-04 20:31:50,544 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=423600.0, ans=0.0 2023-12-04 20:31:51,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=423600.0, ans=0.125 2023-12-04 20:32:11,649 INFO [train.py:1087] (1/4) Epoch 72, batch 0, loss[loss=0.1463, simple_loss=0.2406, pruned_loss=0.02595, over 24856.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2406, pruned_loss=0.02595, over 24856.00 frames. ], batch size: 68, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:32:11,651 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 20:32:27,128 INFO [train.py:1119] (1/4) Epoch 72, validation: loss=0.1505, simple_loss=0.2467, pruned_loss=0.02715, over 944034.00 frames. 2023-12-04 20:32:27,129 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 20:32:28,624 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:32:51,699 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=423700.0, ans=0.125 2023-12-04 20:33:34,136 INFO [train.py:1087] (1/4) Epoch 72, batch 50, loss[loss=0.1677, simple_loss=0.2546, pruned_loss=0.04041, over 16366.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2421, pruned_loss=0.02753, over 1085532.37 frames. ], batch size: 176, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:33:35,966 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=423966.6666666667, ans=0.125 2023-12-04 20:33:42,098 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=423966.6666666667, ans=0.125 2023-12-04 20:33:43,724 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-12-04 20:33:52,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=424033.3333333333, ans=0.2 2023-12-04 20:33:58,346 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.316e+02 1.426e+02 1.600e+02 2.159e+02, threshold=2.851e+02, percent-clipped=0.0 2023-12-04 20:34:20,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=424166.6666666667, ans=0.2 2023-12-04 20:34:40,873 INFO [train.py:1087] (1/4) Epoch 72, batch 100, loss[loss=0.1433, simple_loss=0.2382, pruned_loss=0.02417, over 24564.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2424, pruned_loss=0.02775, over 1916768.74 frames. ], batch size: 65, lr: 3.42e-03, grad_scale: 32.0 2023-12-04 20:35:05,142 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-12-04 20:35:09,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=424433.3333333333, ans=0.125 2023-12-04 20:35:09,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=424433.3333333333, ans=0.125 2023-12-04 20:35:19,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=424433.3333333333, ans=0.0 2023-12-04 20:35:19,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=424433.3333333333, ans=0.2 2023-12-04 20:35:26,411 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=424500.0, ans=0.0 2023-12-04 20:35:45,161 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=424566.6666666667, ans=0.2 2023-12-04 20:35:48,528 INFO [train.py:1087] (1/4) Epoch 72, batch 150, loss[loss=0.1506, simple_loss=0.247, pruned_loss=0.02714, over 24546.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.2429, pruned_loss=0.02832, over 2560473.10 frames. ], batch size: 62, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:35:54,325 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.91 vs. limit=10.0 2023-12-04 20:36:12,718 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.054e+02 1.281e+02 1.343e+02 1.454e+02 1.783e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-04 20:36:26,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424766.6666666667, ans=0.1 2023-12-04 20:36:34,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424833.3333333333, ans=0.1 2023-12-04 20:36:44,841 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=424900.0, ans=0.09899494936611666 2023-12-04 20:36:49,044 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.83 vs. limit=15.0 2023-12-04 20:36:56,257 INFO [train.py:1087] (1/4) Epoch 72, batch 200, loss[loss=0.1466, simple_loss=0.244, pruned_loss=0.02462, over 23443.00 frames. ], tot_loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02829, over 3049776.23 frames. ], batch size: 94, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:36:56,632 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=424966.6666666667, ans=0.125 2023-12-04 20:36:56,767 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=424966.6666666667, ans=0.2 2023-12-04 20:37:14,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=425033.3333333333, ans=0.0 2023-12-04 20:37:49,383 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=425233.3333333333, ans=0.07 2023-12-04 20:37:53,466 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.92 vs. limit=22.5 2023-12-04 20:37:55,659 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=425233.3333333333, ans=0.0 2023-12-04 20:38:04,437 INFO [train.py:1087] (1/4) Epoch 72, batch 250, loss[loss=0.1718, simple_loss=0.2649, pruned_loss=0.03932, over 20860.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02818, over 3440846.22 frames. ], batch size: 50, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:38:25,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=425366.6666666667, ans=0.0 2023-12-04 20:38:28,883 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.288e+02 1.353e+02 1.422e+02 1.761e+02, threshold=2.705e+02, percent-clipped=0.0 2023-12-04 20:38:40,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=425433.3333333333, ans=0.0 2023-12-04 20:39:06,758 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=425566.6666666667, ans=0.0 2023-12-04 20:39:12,448 INFO [train.py:1087] (1/4) Epoch 72, batch 300, loss[loss=0.1429, simple_loss=0.236, pruned_loss=0.02496, over 24544.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2431, pruned_loss=0.02831, over 3750391.32 frames. ], batch size: 62, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:39:38,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=425766.6666666667, ans=0.125 2023-12-04 20:39:45,984 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-12-04 20:40:19,163 INFO [train.py:1087] (1/4) Epoch 72, batch 350, loss[loss=0.1537, simple_loss=0.2534, pruned_loss=0.02706, over 21466.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2429, pruned_loss=0.02805, over 3981507.73 frames. ], batch size: 127, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:40:20,147 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=425966.6666666667, ans=15.0 2023-12-04 20:40:41,887 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=426033.3333333333, ans=0.125 2023-12-04 20:40:42,565 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.62 vs. limit=22.5 2023-12-04 20:40:44,214 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.286e+02 1.389e+02 1.472e+02 1.982e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-04 20:40:47,153 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426100.0, ans=0.1 2023-12-04 20:41:07,193 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.06 vs. limit=22.5 2023-12-04 20:41:27,076 INFO [train.py:1087] (1/4) Epoch 72, batch 400, loss[loss=0.147, simple_loss=0.2388, pruned_loss=0.02765, over 24763.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2429, pruned_loss=0.0281, over 4156693.83 frames. ], batch size: 64, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:41:33,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=426300.0, ans=0.125 2023-12-04 20:42:00,934 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.57 vs. limit=10.0 2023-12-04 20:42:15,302 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=426500.0, ans=0.1 2023-12-04 20:42:25,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=426566.6666666667, ans=0.125 2023-12-04 20:42:35,239 INFO [train.py:1087] (1/4) Epoch 72, batch 450, loss[loss=0.1554, simple_loss=0.2443, pruned_loss=0.03326, over 24552.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02822, over 4284878.63 frames. ], batch size: 63, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:42:48,982 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=426633.3333333333, ans=0.0 2023-12-04 20:43:04,165 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.260e+02 1.358e+02 1.461e+02 1.966e+02, threshold=2.717e+02, percent-clipped=0.0 2023-12-04 20:43:35,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426900.0, ans=0.1 2023-12-04 20:43:37,037 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=426900.0, ans=0.05 2023-12-04 20:43:43,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=426900.0, ans=0.0 2023-12-04 20:43:47,705 INFO [train.py:1087] (1/4) Epoch 72, batch 500, loss[loss=0.1515, simple_loss=0.2437, pruned_loss=0.02962, over 24570.00 frames. ], tot_loss[loss=0.1496, simple_loss=0.2427, pruned_loss=0.02828, over 4399388.62 frames. ], batch size: 65, lr: 3.41e-03, grad_scale: 32.0 2023-12-04 20:44:31,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=427166.6666666667, ans=0.0 2023-12-04 20:44:47,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=427233.3333333333, ans=0.0 2023-12-04 20:44:48,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=427233.3333333333, ans=0.0 2023-12-04 20:44:54,316 INFO [train.py:1087] (1/4) Epoch 72, batch 550, loss[loss=0.1405, simple_loss=0.2372, pruned_loss=0.02193, over 24609.00 frames. ], tot_loss[loss=0.1495, simple_loss=0.2426, pruned_loss=0.02823, over 4502181.34 frames. ], batch size: 68, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:44:56,016 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=427300.0, ans=0.09899494936611666 2023-12-04 20:44:59,389 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 20:45:14,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427366.6666666667, ans=0.1 2023-12-04 20:45:18,709 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.096e+02 1.261e+02 1.376e+02 1.500e+02 2.224e+02, threshold=2.751e+02, percent-clipped=0.0 2023-12-04 20:45:35,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=427500.0, ans=15.0 2023-12-04 20:45:51,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=427566.6666666667, ans=0.125 2023-12-04 20:46:01,918 INFO [train.py:1087] (1/4) Epoch 72, batch 600, loss[loss=0.15, simple_loss=0.2442, pruned_loss=0.02787, over 24738.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2423, pruned_loss=0.0279, over 4581421.64 frames. ], batch size: 67, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:46:37,699 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=427766.6666666667, ans=0.2 2023-12-04 20:46:39,405 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.26 vs. limit=15.0 2023-12-04 20:46:40,696 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-12-04 20:47:10,173 INFO [train.py:1087] (1/4) Epoch 72, batch 650, loss[loss=0.1509, simple_loss=0.2445, pruned_loss=0.02869, over 24561.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2424, pruned_loss=0.02799, over 4626556.44 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:47:11,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=427966.6666666667, ans=0.125 2023-12-04 20:47:18,551 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427966.6666666667, ans=0.1 2023-12-04 20:47:25,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=428033.3333333333, ans=0.125 2023-12-04 20:47:34,546 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.250e+02 1.369e+02 1.461e+02 1.871e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-04 20:47:53,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=428166.6666666667, ans=0.125 2023-12-04 20:48:15,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=428300.0, ans=0.025 2023-12-04 20:48:16,623 INFO [train.py:1087] (1/4) Epoch 72, batch 700, loss[loss=0.1498, simple_loss=0.243, pruned_loss=0.02826, over 24545.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2423, pruned_loss=0.02794, over 4673788.94 frames. ], batch size: 63, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:48:20,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=428300.0, ans=0.125 2023-12-04 20:48:37,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=428366.6666666667, ans=0.125 2023-12-04 20:48:39,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=428366.6666666667, ans=0.0 2023-12-04 20:48:40,618 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.10 vs. limit=22.5 2023-12-04 20:48:41,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=428433.3333333333, ans=0.02 2023-12-04 20:48:45,513 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-12-04 20:48:49,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=428433.3333333333, ans=0.2 2023-12-04 20:48:57,054 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=428500.0, ans=0.125 2023-12-04 20:48:58,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=428500.0, ans=0.0 2023-12-04 20:49:20,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=428566.6666666667, ans=0.125 2023-12-04 20:49:23,498 INFO [train.py:1087] (1/4) Epoch 72, batch 750, loss[loss=0.1668, simple_loss=0.2566, pruned_loss=0.03852, over 23554.00 frames. ], tot_loss[loss=0.1493, simple_loss=0.2425, pruned_loss=0.02809, over 4692446.74 frames. ], batch size: 94, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:49:32,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=428633.3333333333, ans=0.125 2023-12-04 20:49:47,397 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.245e+02 1.331e+02 1.423e+02 2.357e+02, threshold=2.662e+02, percent-clipped=0.0 2023-12-04 20:49:59,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=428766.6666666667, ans=0.0 2023-12-04 20:50:07,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=428833.3333333333, ans=0.125 2023-12-04 20:50:22,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=428900.0, ans=0.125 2023-12-04 20:50:26,413 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=428900.0, ans=0.0 2023-12-04 20:50:29,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=428966.6666666667, ans=0.0 2023-12-04 20:50:29,947 INFO [train.py:1087] (1/4) Epoch 72, batch 800, loss[loss=0.144, simple_loss=0.2406, pruned_loss=0.02373, over 24779.00 frames. ], tot_loss[loss=0.149, simple_loss=0.2423, pruned_loss=0.02786, over 4736605.67 frames. ], batch size: 62, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:50:47,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=429033.3333333333, ans=0.2 2023-12-04 20:50:59,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=429100.0, ans=0.2 2023-12-04 20:51:19,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=429233.3333333333, ans=0.09899494936611666 2023-12-04 20:51:24,906 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-12-04 20:51:29,965 INFO [train.py:1087] (1/4) Epoch 72, batch 850, loss[loss=0.1417, simple_loss=0.239, pruned_loss=0.02216, over 24815.00 frames. ], tot_loss[loss=0.1494, simple_loss=0.2427, pruned_loss=0.02809, over 4747131.53 frames. ], batch size: 73, lr: 3.40e-03, grad_scale: 32.0 2023-12-04 20:51:30,129 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=429300.0, ans=0.0 2023-12-04 20:51:32,586 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=429300.0, ans=0.2 2023-12-04 20:51:51,955 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.290e+02 1.355e+02 1.497e+02 2.215e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 20:51:52,248 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=429366.6666666667, ans=0.2 2023-12-04 20:51:59,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=429433.3333333333, ans=0.125 2023-12-04 20:52:01,269 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.18 vs. limit=10.0 2023-12-04 20:52:05,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=429500.0, ans=0.125 2023-12-04 20:52:11,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=429500.0, ans=0.1 2023-12-04 20:52:15,587 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-12-04 20:52:19,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=429566.6666666667, ans=0.2 2023-12-04 20:52:19,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=429566.6666666667, ans=0.05 2023-12-04 20:52:43,337 INFO [train.py:1087] (1/4) Epoch 73, batch 0, loss[loss=0.1455, simple_loss=0.24, pruned_loss=0.02547, over 24744.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.24, pruned_loss=0.02547, over 24744.00 frames. ], batch size: 63, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:52:43,343 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 20:52:58,766 INFO [train.py:1119] (1/4) Epoch 73, validation: loss=0.1503, simple_loss=0.2466, pruned_loss=0.02702, over 944034.00 frames. 2023-12-04 20:52:58,768 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 20:53:29,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429733.3333333333, ans=0.1 2023-12-04 20:53:37,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-12-04 20:54:05,243 INFO [train.py:1087] (1/4) Epoch 73, batch 50, loss[loss=0.1487, simple_loss=0.2458, pruned_loss=0.02579, over 24794.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2424, pruned_loss=0.02681, over 1091302.62 frames. ], batch size: 73, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:54:11,734 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=429933.3333333333, ans=0.0 2023-12-04 20:54:13,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=429933.3333333333, ans=15.0 2023-12-04 20:54:27,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430000.0, ans=0.1 2023-12-04 20:54:27,256 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=430000.0, ans=0.0 2023-12-04 20:54:27,322 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=430000.0, ans=0.125 2023-12-04 20:54:36,085 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.287e+02 1.367e+02 1.508e+02 1.992e+02, threshold=2.734e+02, percent-clipped=0.0 2023-12-04 20:54:44,212 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=430133.3333333333, ans=0.2 2023-12-04 20:54:44,499 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=12.0 2023-12-04 20:55:11,499 INFO [train.py:1087] (1/4) Epoch 73, batch 100, loss[loss=0.1559, simple_loss=0.2509, pruned_loss=0.03046, over 24155.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2424, pruned_loss=0.02717, over 1905673.50 frames. ], batch size: 82, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:55:20,485 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=430266.6666666667, ans=0.2 2023-12-04 20:56:07,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=430533.3333333333, ans=0.05 2023-12-04 20:56:08,735 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.59 vs. limit=15.0 2023-12-04 20:56:18,505 INFO [train.py:1087] (1/4) Epoch 73, batch 150, loss[loss=0.1505, simple_loss=0.2465, pruned_loss=0.02723, over 24551.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2425, pruned_loss=0.02725, over 2557259.02 frames. ], batch size: 63, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:56:28,042 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=430600.0, ans=0.125 2023-12-04 20:56:33,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=430666.6666666667, ans=0.125 2023-12-04 20:56:49,241 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.311e+02 1.423e+02 1.594e+02 2.186e+02, threshold=2.846e+02, percent-clipped=0.0 2023-12-04 20:57:10,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=430800.0, ans=0.0 2023-12-04 20:57:19,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=430866.6666666667, ans=0.125 2023-12-04 20:57:25,452 INFO [train.py:1087] (1/4) Epoch 73, batch 200, loss[loss=0.1511, simple_loss=0.2444, pruned_loss=0.02883, over 24758.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2419, pruned_loss=0.02747, over 3067708.96 frames. ], batch size: 66, lr: 3.37e-03, grad_scale: 32.0 2023-12-04 20:57:33,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=430933.3333333333, ans=0.2 2023-12-04 20:57:53,361 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=431066.6666666667, ans=0.09899494936611666 2023-12-04 20:57:58,539 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=431066.6666666667, ans=0.09899494936611666 2023-12-04 20:57:58,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=431066.6666666667, ans=0.05 2023-12-04 20:58:15,008 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=12.0 2023-12-04 20:58:26,483 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2023-12-04 20:58:32,507 INFO [train.py:1087] (1/4) Epoch 73, batch 250, loss[loss=0.1469, simple_loss=0.2375, pruned_loss=0.0281, over 24752.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2418, pruned_loss=0.0275, over 3464127.57 frames. ], batch size: 65, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 20:58:53,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=431333.3333333333, ans=0.5 2023-12-04 20:59:02,550 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.285e+02 1.373e+02 1.461e+02 1.891e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-04 20:59:12,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=431466.6666666667, ans=0.0 2023-12-04 20:59:38,333 INFO [train.py:1087] (1/4) Epoch 73, batch 300, loss[loss=0.1568, simple_loss=0.2482, pruned_loss=0.03273, over 24339.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.242, pruned_loss=0.02761, over 3763839.05 frames. ], batch size: 79, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 20:59:39,662 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0 2023-12-04 20:59:46,291 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-12-04 20:59:46,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=431600.0, ans=0.125 2023-12-04 20:59:57,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431666.6666666667, ans=0.1 2023-12-04 21:00:32,164 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-12-04 21:00:43,851 INFO [train.py:1087] (1/4) Epoch 73, batch 350, loss[loss=0.1413, simple_loss=0.2357, pruned_loss=0.02344, over 24554.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2427, pruned_loss=0.02787, over 3987618.69 frames. ], batch size: 66, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:01:12,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=432066.6666666667, ans=0.025 2023-12-04 21:01:14,896 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.084e+02 1.268e+02 1.341e+02 1.476e+02 1.875e+02, threshold=2.682e+02, percent-clipped=0.0 2023-12-04 21:01:32,776 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=432133.3333333333, ans=0.09899494936611666 2023-12-04 21:01:38,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432200.0, ans=0.1 2023-12-04 21:01:50,795 INFO [train.py:1087] (1/4) Epoch 73, batch 400, loss[loss=0.1445, simple_loss=0.2437, pruned_loss=0.02263, over 24773.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2423, pruned_loss=0.02776, over 4182640.07 frames. ], batch size: 70, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:01:53,664 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=432266.6666666667, ans=0.2 2023-12-04 21:02:01,250 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-12-04 21:02:49,106 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=432533.3333333333, ans=0.125 2023-12-04 21:02:50,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432533.3333333333, ans=0.1 2023-12-04 21:02:56,773 INFO [train.py:1087] (1/4) Epoch 73, batch 450, loss[loss=0.1414, simple_loss=0.2329, pruned_loss=0.02496, over 24814.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2424, pruned_loss=0.02769, over 4315239.73 frames. ], batch size: 72, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:03:23,373 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.78 vs. limit=15.0 2023-12-04 21:03:26,804 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.266e+02 1.338e+02 1.464e+02 1.805e+02, threshold=2.676e+02, percent-clipped=0.0 2023-12-04 21:03:36,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=432800.0, ans=0.0 2023-12-04 21:03:41,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=432800.0, ans=0.125 2023-12-04 21:03:45,400 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=432800.0, ans=0.1 2023-12-04 21:03:50,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=432866.6666666667, ans=0.0 2023-12-04 21:04:02,556 INFO [train.py:1087] (1/4) Epoch 73, batch 500, loss[loss=0.1802, simple_loss=0.2644, pruned_loss=0.04798, over 16919.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2423, pruned_loss=0.02755, over 4436403.79 frames. ], batch size: 178, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:04:08,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=432933.3333333333, ans=0.0 2023-12-04 21:04:20,285 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433000.0, ans=0.1 2023-12-04 21:05:08,149 INFO [train.py:1087] (1/4) Epoch 73, batch 550, loss[loss=0.1402, simple_loss=0.2395, pruned_loss=0.02042, over 21670.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2421, pruned_loss=0.02741, over 4528009.69 frames. ], batch size: 128, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:05:28,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=433333.3333333333, ans=0.0 2023-12-04 21:05:39,653 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.258e+02 1.337e+02 1.412e+02 2.063e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 21:05:42,466 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=433400.0, ans=0.0 2023-12-04 21:06:15,627 INFO [train.py:1087] (1/4) Epoch 73, batch 600, loss[loss=0.1631, simple_loss=0.252, pruned_loss=0.03713, over 24342.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2423, pruned_loss=0.02764, over 4564211.78 frames. ], batch size: 79, lr: 3.36e-03, grad_scale: 32.0 2023-12-04 21:06:19,898 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=433600.0, ans=0.125 2023-12-04 21:06:24,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=433600.0, ans=0.0 2023-12-04 21:06:42,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=433733.3333333333, ans=0.07 2023-12-04 21:06:44,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=433733.3333333333, ans=0.0 2023-12-04 21:06:51,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=433733.3333333333, ans=0.0 2023-12-04 21:06:52,263 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:06:55,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=433800.0, ans=0.125 2023-12-04 21:06:59,203 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=15.0 2023-12-04 21:07:00,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=433800.0, ans=0.125 2023-12-04 21:07:19,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=433866.6666666667, ans=0.125 2023-12-04 21:07:22,680 INFO [train.py:1087] (1/4) Epoch 73, batch 650, loss[loss=0.1561, simple_loss=0.2495, pruned_loss=0.03139, over 24460.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02755, over 4639379.63 frames. ], batch size: 77, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:07:40,433 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=434000.0, ans=0.125 2023-12-04 21:07:54,266 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.262e+02 1.372e+02 1.499e+02 1.934e+02, threshold=2.743e+02, percent-clipped=0.0 2023-12-04 21:08:18,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=434200.0, ans=0.125 2023-12-04 21:08:23,626 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=434200.0, ans=0.125 2023-12-04 21:08:30,286 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.82 vs. limit=15.0 2023-12-04 21:08:30,734 INFO [train.py:1087] (1/4) Epoch 73, batch 700, loss[loss=0.1458, simple_loss=0.243, pruned_loss=0.02432, over 24798.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02761, over 4682948.38 frames. ], batch size: 72, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:08:43,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=434333.3333333333, ans=0.125 2023-12-04 21:09:00,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=434400.0, ans=0.2 2023-12-04 21:09:11,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=434466.6666666667, ans=0.125 2023-12-04 21:09:24,604 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434533.3333333333, ans=0.1 2023-12-04 21:09:28,962 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.92 vs. limit=15.0 2023-12-04 21:09:37,104 INFO [train.py:1087] (1/4) Epoch 73, batch 750, loss[loss=0.1449, simple_loss=0.2395, pruned_loss=0.02513, over 24713.00 frames. ], tot_loss[loss=0.1488, simple_loss=0.2421, pruned_loss=0.02773, over 4708884.81 frames. ], batch size: 69, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:09:52,560 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-12-04 21:09:57,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=434666.6666666667, ans=0.1 2023-12-04 21:10:00,357 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.36 vs. limit=22.5 2023-12-04 21:10:07,604 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.279e+02 1.357e+02 1.516e+02 2.377e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-04 21:10:09,303 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434733.3333333333, ans=0.125 2023-12-04 21:10:29,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=434866.6666666667, ans=0.2 2023-12-04 21:10:31,277 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=434866.6666666667, ans=0.5 2023-12-04 21:10:43,642 INFO [train.py:1087] (1/4) Epoch 73, batch 800, loss[loss=0.1517, simple_loss=0.2449, pruned_loss=0.02931, over 24569.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.242, pruned_loss=0.02769, over 4739887.43 frames. ], batch size: 64, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:10:46,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=434933.3333333333, ans=0.125 2023-12-04 21:11:09,374 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=435066.6666666667, ans=0.0 2023-12-04 21:11:11,220 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-12-04 21:11:38,272 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=435200.0, ans=0.1 2023-12-04 21:11:43,667 INFO [train.py:1087] (1/4) Epoch 73, batch 850, loss[loss=0.1515, simple_loss=0.2454, pruned_loss=0.02881, over 24544.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2417, pruned_loss=0.02759, over 4776207.35 frames. ], batch size: 63, lr: 3.35e-03, grad_scale: 32.0 2023-12-04 21:11:43,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=435266.6666666667, ans=10.0 2023-12-04 21:12:10,278 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.277e+02 1.358e+02 1.521e+02 2.220e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 21:12:13,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=435400.0, ans=0.95 2023-12-04 21:12:22,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435466.6666666667, ans=0.1 2023-12-04 21:12:47,980 INFO [train.py:1087] (1/4) Epoch 74, batch 0, loss[loss=0.1551, simple_loss=0.2481, pruned_loss=0.03105, over 23535.00 frames. ], tot_loss[loss=0.1551, simple_loss=0.2481, pruned_loss=0.03105, over 23535.00 frames. ], batch size: 94, lr: 3.33e-03, grad_scale: 32.0 2023-12-04 21:12:47,981 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 21:12:56,937 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([6.0426, 5.6172, 5.7321, 5.6862], device='cuda:1') 2023-12-04 21:13:02,603 INFO [train.py:1119] (1/4) Epoch 74, validation: loss=0.1507, simple_loss=0.2468, pruned_loss=0.02733, over 944034.00 frames. 2023-12-04 21:13:02,604 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 21:13:03,556 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.65 vs. limit=10.0 2023-12-04 21:13:07,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=435566.6666666667, ans=0.0 2023-12-04 21:13:36,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=435700.0, ans=0.1 2023-12-04 21:13:41,542 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=435766.6666666667, ans=0.0 2023-12-04 21:13:44,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=435766.6666666667, ans=0.125 2023-12-04 21:13:49,044 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=435766.6666666667, ans=0.125 2023-12-04 21:14:07,801 INFO [train.py:1087] (1/4) Epoch 74, batch 50, loss[loss=0.148, simple_loss=0.2437, pruned_loss=0.02613, over 24801.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2434, pruned_loss=0.02822, over 1077011.36 frames. ], batch size: 71, lr: 3.32e-03, grad_scale: 64.0 2023-12-04 21:14:11,880 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:14:18,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=435900.0, ans=0.125 2023-12-04 21:14:35,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=436033.3333333333, ans=0.125 2023-12-04 21:14:44,575 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.109e+02 1.267e+02 1.359e+02 1.516e+02 1.990e+02, threshold=2.718e+02, percent-clipped=0.0 2023-12-04 21:14:46,441 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:14:58,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=436100.0, ans=0.0 2023-12-04 21:15:03,639 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436166.6666666667, ans=0.1 2023-12-04 21:15:03,640 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=436166.6666666667, ans=0.2 2023-12-04 21:15:14,000 INFO [train.py:1087] (1/4) Epoch 74, batch 100, loss[loss=0.156, simple_loss=0.2473, pruned_loss=0.03234, over 22821.00 frames. ], tot_loss[loss=0.1499, simple_loss=0.2436, pruned_loss=0.02808, over 1902352.83 frames. ], batch size: 106, lr: 3.32e-03, grad_scale: 64.0 2023-12-04 21:15:42,679 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=436366.6666666667, ans=0.2 2023-12-04 21:15:43,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=436366.6666666667, ans=0.125 2023-12-04 21:15:44,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=436366.6666666667, ans=0.125 2023-12-04 21:15:57,836 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=436433.3333333333, ans=0.125 2023-12-04 21:16:17,160 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=436500.0, ans=0.1 2023-12-04 21:16:19,459 INFO [train.py:1087] (1/4) Epoch 74, batch 150, loss[loss=0.1413, simple_loss=0.2353, pruned_loss=0.02365, over 24773.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2422, pruned_loss=0.0276, over 2560602.34 frames. ], batch size: 70, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:16:26,689 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.92 vs. limit=12.0 2023-12-04 21:16:34,869 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:16:37,248 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=436633.3333333333, ans=0.125 2023-12-04 21:16:40,019 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:16:57,618 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.082e+02 1.269e+02 1.335e+02 1.449e+02 2.080e+02, threshold=2.669e+02, percent-clipped=0.0 2023-12-04 21:17:10,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=436766.6666666667, ans=0.0 2023-12-04 21:17:11,489 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=436833.3333333333, ans=0.125 2023-12-04 21:17:15,872 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-12-04 21:17:25,263 INFO [train.py:1087] (1/4) Epoch 74, batch 200, loss[loss=0.1503, simple_loss=0.2434, pruned_loss=0.02861, over 24741.00 frames. ], tot_loss[loss=0.1491, simple_loss=0.2428, pruned_loss=0.02772, over 3063562.01 frames. ], batch size: 69, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:17:29,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=436900.0, ans=0.0 2023-12-04 21:17:51,067 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437033.3333333333, ans=0.1 2023-12-04 21:17:56,530 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-12-04 21:18:10,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=437100.0, ans=0.125 2023-12-04 21:18:31,447 INFO [train.py:1087] (1/4) Epoch 74, batch 250, loss[loss=0.1511, simple_loss=0.2481, pruned_loss=0.02701, over 24251.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2423, pruned_loss=0.0276, over 3442856.38 frames. ], batch size: 82, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:18:32,145 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.00 vs. limit=15.0 2023-12-04 21:18:40,426 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=437233.3333333333, ans=0.5 2023-12-04 21:19:07,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=437366.6666666667, ans=0.0 2023-12-04 21:19:09,605 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.079e+02 1.279e+02 1.357e+02 1.509e+02 2.105e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 21:19:24,355 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.35 vs. limit=15.0 2023-12-04 21:19:33,270 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437500.0, ans=0.1 2023-12-04 21:19:37,280 INFO [train.py:1087] (1/4) Epoch 74, batch 300, loss[loss=0.143, simple_loss=0.2332, pruned_loss=0.02644, over 24562.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02752, over 3752455.21 frames. ], batch size: 66, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:19:57,176 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.05 vs. limit=6.0 2023-12-04 21:20:04,761 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-12-04 21:20:18,152 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.48 vs. limit=15.0 2023-12-04 21:20:19,087 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=437766.6666666667, ans=0.125 2023-12-04 21:20:28,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=437766.6666666667, ans=0.2 2023-12-04 21:20:30,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=437833.3333333333, ans=0.2 2023-12-04 21:20:42,673 INFO [train.py:1087] (1/4) Epoch 74, batch 350, loss[loss=0.1538, simple_loss=0.2415, pruned_loss=0.03308, over 24121.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2422, pruned_loss=0.02779, over 3995467.84 frames. ], batch size: 82, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:20:44,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=437900.0, ans=0.2 2023-12-04 21:20:56,694 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=437966.6666666667, ans=0.125 2023-12-04 21:21:21,344 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.265e+02 1.357e+02 1.512e+02 2.007e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 21:21:37,459 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=438166.6666666667, ans=0.1 2023-12-04 21:21:49,765 INFO [train.py:1087] (1/4) Epoch 74, batch 400, loss[loss=0.1469, simple_loss=0.241, pruned_loss=0.02638, over 24744.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2421, pruned_loss=0.02759, over 4182457.54 frames. ], batch size: 63, lr: 3.32e-03, grad_scale: 32.0 2023-12-04 21:22:24,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438366.6666666667, ans=0.125 2023-12-04 21:22:56,785 INFO [train.py:1087] (1/4) Epoch 74, batch 450, loss[loss=0.1422, simple_loss=0.2404, pruned_loss=0.02203, over 24561.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.242, pruned_loss=0.02732, over 4337905.38 frames. ], batch size: 66, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:23:29,508 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.00 vs. limit=10.0 2023-12-04 21:23:34,857 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.047e+02 1.260e+02 1.354e+02 1.483e+02 1.828e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-04 21:24:02,993 INFO [train.py:1087] (1/4) Epoch 74, batch 500, loss[loss=0.1456, simple_loss=0.2355, pruned_loss=0.02784, over 24252.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.2416, pruned_loss=0.02731, over 4440423.76 frames. ], batch size: 79, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:24:21,286 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=438966.6666666667, ans=0.0 2023-12-04 21:24:29,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=439033.3333333333, ans=0.05 2023-12-04 21:24:29,986 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439033.3333333333, ans=0.1 2023-12-04 21:25:05,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=439166.6666666667, ans=0.125 2023-12-04 21:25:07,891 INFO [train.py:1087] (1/4) Epoch 74, batch 550, loss[loss=0.151, simple_loss=0.2464, pruned_loss=0.02778, over 24557.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.242, pruned_loss=0.02755, over 4507560.40 frames. ], batch size: 63, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:25:21,258 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.09 vs. limit=15.0 2023-12-04 21:25:24,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=439300.0, ans=0.0 2023-12-04 21:25:36,507 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:25:46,156 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.258e+02 1.337e+02 1.431e+02 1.814e+02, threshold=2.673e+02, percent-clipped=0.0 2023-12-04 21:25:55,462 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=439433.3333333333, ans=0.125 2023-12-04 21:26:08,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=439500.0, ans=0.2 2023-12-04 21:26:14,516 INFO [train.py:1087] (1/4) Epoch 74, batch 600, loss[loss=0.1454, simple_loss=0.2412, pruned_loss=0.02479, over 21402.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2421, pruned_loss=0.02753, over 4579790.70 frames. ], batch size: 128, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:26:15,295 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-12-04 21:26:17,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=439566.6666666667, ans=0.125 2023-12-04 21:26:34,212 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.83 vs. limit=15.0 2023-12-04 21:26:36,509 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=439633.3333333333, ans=0.0 2023-12-04 21:26:40,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=439700.0, ans=0.125 2023-12-04 21:26:53,209 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=439766.6666666667, ans=0.125 2023-12-04 21:27:14,861 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=439833.3333333333, ans=0.125 2023-12-04 21:27:15,157 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.38 vs. limit=22.5 2023-12-04 21:27:21,259 INFO [train.py:1087] (1/4) Epoch 74, batch 650, loss[loss=0.1423, simple_loss=0.2339, pruned_loss=0.02533, over 24802.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02753, over 4627905.80 frames. ], batch size: 71, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:27:28,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=439900.0, ans=0.125 2023-12-04 21:27:32,326 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2023-12-04 21:27:48,263 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=6.016e-03 2023-12-04 21:27:57,558 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.86 vs. limit=12.0 2023-12-04 21:27:59,965 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.287e+02 1.377e+02 1.543e+02 1.843e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 21:28:14,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=440166.6666666667, ans=0.125 2023-12-04 21:28:28,089 INFO [train.py:1087] (1/4) Epoch 74, batch 700, loss[loss=0.1633, simple_loss=0.256, pruned_loss=0.03527, over 24198.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2419, pruned_loss=0.02768, over 4667217.51 frames. ], batch size: 82, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:28:29,657 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=440233.3333333333, ans=0.125 2023-12-04 21:29:05,798 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=440366.6666666667, ans=0.0 2023-12-04 21:29:10,050 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-12-04 21:29:11,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=440433.3333333333, ans=0.0 2023-12-04 21:29:11,370 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.54 vs. limit=15.0 2023-12-04 21:29:16,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=440433.3333333333, ans=0.2 2023-12-04 21:29:19,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=440500.0, ans=0.0 2023-12-04 21:29:20,423 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-12-04 21:29:25,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=440500.0, ans=0.125 2023-12-04 21:29:34,216 INFO [train.py:1087] (1/4) Epoch 74, batch 750, loss[loss=0.1324, simple_loss=0.2214, pruned_loss=0.02172, over 24752.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2419, pruned_loss=0.02773, over 4700906.28 frames. ], batch size: 63, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:29:34,421 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=440566.6666666667, ans=0.0 2023-12-04 21:29:54,251 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-12-04 21:30:03,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440700.0, ans=0.0 2023-12-04 21:30:03,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=440700.0, ans=0.125 2023-12-04 21:30:04,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=440700.0, ans=0.0 2023-12-04 21:30:12,278 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.019e+02 1.271e+02 1.351e+02 1.450e+02 1.892e+02, threshold=2.703e+02, percent-clipped=0.0 2023-12-04 21:30:33,660 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-12-04 21:30:39,691 INFO [train.py:1087] (1/4) Epoch 74, batch 800, loss[loss=0.1502, simple_loss=0.2436, pruned_loss=0.0284, over 24702.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2417, pruned_loss=0.02756, over 4730749.69 frames. ], batch size: 74, lr: 3.31e-03, grad_scale: 32.0 2023-12-04 21:30:41,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=440900.0, ans=0.0 2023-12-04 21:31:03,482 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.65 vs. limit=22.5 2023-12-04 21:31:15,991 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=441100.0, ans=0.0 2023-12-04 21:31:19,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=441100.0, ans=0.125 2023-12-04 21:31:26,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=441100.0, ans=0.125 2023-12-04 21:31:38,455 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.05 vs. limit=15.0 2023-12-04 21:31:40,125 INFO [train.py:1087] (1/4) Epoch 74, batch 850, loss[loss=0.162, simple_loss=0.2553, pruned_loss=0.03436, over 24193.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2418, pruned_loss=0.02764, over 4743124.77 frames. ], batch size: 82, lr: 3.30e-03, grad_scale: 32.0 2023-12-04 21:31:46,640 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.02 vs. limit=15.0 2023-12-04 21:32:10,724 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.32 vs. limit=22.5 2023-12-04 21:32:14,709 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.298e+02 1.361e+02 1.454e+02 1.998e+02, threshold=2.723e+02, percent-clipped=0.0 2023-12-04 21:32:16,178 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441433.3333333333, ans=0.125 2023-12-04 21:32:50,937 INFO [train.py:1087] (1/4) Epoch 75, batch 0, loss[loss=0.1404, simple_loss=0.2346, pruned_loss=0.02307, over 24765.00 frames. ], tot_loss[loss=0.1404, simple_loss=0.2346, pruned_loss=0.02307, over 24765.00 frames. ], batch size: 66, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:32:50,938 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 21:33:05,648 INFO [train.py:1119] (1/4) Epoch 75, validation: loss=0.1512, simple_loss=0.247, pruned_loss=0.02763, over 944034.00 frames. 2023-12-04 21:33:05,650 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 21:33:09,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=441533.3333333333, ans=0.2 2023-12-04 21:33:23,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441600.0, ans=0.1 2023-12-04 21:33:30,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441666.6666666667, ans=0.1 2023-12-04 21:34:11,148 INFO [train.py:1087] (1/4) Epoch 75, batch 50, loss[loss=0.1478, simple_loss=0.2461, pruned_loss=0.02477, over 24849.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.242, pruned_loss=0.02736, over 1101412.69 frames. ], batch size: 68, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:34:17,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=441866.6666666667, ans=0.125 2023-12-04 21:34:30,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=441933.3333333333, ans=0.125 2023-12-04 21:34:38,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=442000.0, ans=0.0 2023-12-04 21:34:55,357 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.136e+02 1.309e+02 1.399e+02 1.562e+02 2.287e+02, threshold=2.798e+02, percent-clipped=0.0 2023-12-04 21:35:15,736 INFO [train.py:1087] (1/4) Epoch 75, batch 100, loss[loss=0.148, simple_loss=0.2422, pruned_loss=0.02691, over 24774.00 frames. ], tot_loss[loss=0.1497, simple_loss=0.2431, pruned_loss=0.02817, over 1907351.66 frames. ], batch size: 72, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:35:16,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=442200.0, ans=0.125 2023-12-04 21:35:19,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=442200.0, ans=0.125 2023-12-04 21:35:33,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442266.6666666667, ans=0.1 2023-12-04 21:35:41,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442333.3333333333, ans=0.1 2023-12-04 21:36:03,081 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:36:11,667 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442466.6666666667, ans=0.1 2023-12-04 21:36:13,309 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-12-04 21:36:15,973 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-12-04 21:36:20,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=442533.3333333333, ans=0.125 2023-12-04 21:36:20,928 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.57 vs. limit=22.5 2023-12-04 21:36:22,003 INFO [train.py:1087] (1/4) Epoch 75, batch 150, loss[loss=0.1406, simple_loss=0.2346, pruned_loss=0.0233, over 23510.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2421, pruned_loss=0.02785, over 2544458.20 frames. ], batch size: 94, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:36:38,204 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442600.0, ans=0.1 2023-12-04 21:36:42,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442600.0, ans=0.125 2023-12-04 21:36:44,440 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:37:02,632 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-12-04 21:37:03,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=442733.3333333333, ans=0.95 2023-12-04 21:37:05,690 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.287e+02 1.378e+02 1.496e+02 1.803e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-04 21:37:13,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.06 vs. limit=15.0 2023-12-04 21:37:24,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=442866.6666666667, ans=0.125 2023-12-04 21:37:25,775 INFO [train.py:1087] (1/4) Epoch 75, batch 200, loss[loss=0.182, simple_loss=0.2676, pruned_loss=0.04824, over 16809.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.2421, pruned_loss=0.02766, over 3038515.45 frames. ], batch size: 181, lr: 3.28e-03, grad_scale: 32.0 2023-12-04 21:37:27,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=442866.6666666667, ans=0.125 2023-12-04 21:37:54,024 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=443000.0, ans=0.0 2023-12-04 21:38:06,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=443066.6666666667, ans=0.0 2023-12-04 21:38:10,130 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=443066.6666666667, ans=0.2 2023-12-04 21:38:30,001 INFO [train.py:1087] (1/4) Epoch 75, batch 250, loss[loss=0.1479, simple_loss=0.2426, pruned_loss=0.02662, over 24811.00 frames. ], tot_loss[loss=0.1487, simple_loss=0.242, pruned_loss=0.0277, over 3435546.26 frames. ], batch size: 73, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:38:36,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443200.0, ans=0.1 2023-12-04 21:38:49,091 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.00 vs. limit=10.0 2023-12-04 21:38:56,476 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=443333.3333333333, ans=0.125 2023-12-04 21:38:56,561 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=443333.3333333333, ans=0.07 2023-12-04 21:38:58,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=443333.3333333333, ans=0.0 2023-12-04 21:39:12,908 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.106e+02 1.284e+02 1.356e+02 1.481e+02 1.863e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-04 21:39:13,702 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.35 vs. limit=15.0 2023-12-04 21:39:33,067 INFO [train.py:1087] (1/4) Epoch 75, batch 300, loss[loss=0.1527, simple_loss=0.2488, pruned_loss=0.02827, over 24793.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2422, pruned_loss=0.02756, over 3755081.32 frames. ], batch size: 62, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:39:39,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=443533.3333333333, ans=0.02 2023-12-04 21:39:44,824 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.38 vs. limit=22.5 2023-12-04 21:39:52,373 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-12-04 21:39:55,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443600.0, ans=0.1 2023-12-04 21:40:17,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=443733.3333333333, ans=0.125 2023-12-04 21:40:32,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=443800.0, ans=0.1 2023-12-04 21:40:37,625 INFO [train.py:1087] (1/4) Epoch 75, batch 350, loss[loss=0.1479, simple_loss=0.2423, pruned_loss=0.02677, over 23996.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2418, pruned_loss=0.02746, over 3994682.50 frames. ], batch size: 87, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:40:54,258 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=22.5 2023-12-04 21:40:58,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=443933.3333333333, ans=0.125 2023-12-04 21:41:02,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=444000.0, ans=0.125 2023-12-04 21:41:19,169 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:41:21,268 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.270e+02 1.334e+02 1.449e+02 1.837e+02, threshold=2.668e+02, percent-clipped=0.0 2023-12-04 21:41:22,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=444066.6666666667, ans=0.025 2023-12-04 21:41:41,400 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=444200.0, ans=0.0 2023-12-04 21:41:42,450 INFO [train.py:1087] (1/4) Epoch 75, batch 400, loss[loss=0.1467, simple_loss=0.2375, pruned_loss=0.02791, over 24710.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2415, pruned_loss=0.02763, over 4185994.10 frames. ], batch size: 67, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:41:53,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444266.6666666667, ans=0.1 2023-12-04 21:42:05,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=444266.6666666667, ans=0.125 2023-12-04 21:42:35,957 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:42:36,140 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=444466.6666666667, ans=0.1 2023-12-04 21:42:47,158 INFO [train.py:1087] (1/4) Epoch 75, batch 450, loss[loss=0.1388, simple_loss=0.2354, pruned_loss=0.02106, over 24755.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2412, pruned_loss=0.02743, over 4338002.72 frames. ], batch size: 70, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:42:56,484 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.09 vs. limit=22.5 2023-12-04 21:43:09,950 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.62 vs. limit=15.0 2023-12-04 21:43:23,068 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=444666.6666666667, ans=0.0 2023-12-04 21:43:26,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=444733.3333333333, ans=0.125 2023-12-04 21:43:30,953 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.121e+02 1.270e+02 1.368e+02 1.485e+02 2.151e+02, threshold=2.736e+02, percent-clipped=0.0 2023-12-04 21:43:39,812 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=444800.0, ans=0.1 2023-12-04 21:43:47,144 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=444800.0, ans=0.0 2023-12-04 21:43:50,517 INFO [train.py:1087] (1/4) Epoch 75, batch 500, loss[loss=0.1396, simple_loss=0.2366, pruned_loss=0.02126, over 24557.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2412, pruned_loss=0.02744, over 4452110.90 frames. ], batch size: 62, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:43:52,463 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=444866.6666666667, ans=0.125 2023-12-04 21:43:52,822 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-12-04 21:44:04,969 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444933.3333333333, ans=0.1 2023-12-04 21:44:17,798 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=445000.0, ans=0.125 2023-12-04 21:44:35,739 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.69 vs. limit=15.0 2023-12-04 21:44:36,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=445066.6666666667, ans=0.125 2023-12-04 21:44:48,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=445133.3333333333, ans=0.125 2023-12-04 21:44:48,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=445133.3333333333, ans=0.125 2023-12-04 21:44:56,512 INFO [train.py:1087] (1/4) Epoch 75, batch 550, loss[loss=0.1446, simple_loss=0.2371, pruned_loss=0.02606, over 24541.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.2415, pruned_loss=0.02739, over 4547471.61 frames. ], batch size: 62, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:44:58,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=445200.0, ans=0.125 2023-12-04 21:45:05,757 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=445200.0, ans=0.125 2023-12-04 21:45:09,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=445266.6666666667, ans=0.2 2023-12-04 21:45:11,622 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=445266.6666666667, ans=0.125 2023-12-04 21:45:12,012 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-12-04 21:45:14,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=445266.6666666667, ans=0.0 2023-12-04 21:45:40,178 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.270e+02 1.369e+02 1.475e+02 1.946e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 21:45:42,521 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=445400.0, ans=0.0 2023-12-04 21:46:02,143 INFO [train.py:1087] (1/4) Epoch 75, batch 600, loss[loss=0.1437, simple_loss=0.2365, pruned_loss=0.02544, over 24770.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2417, pruned_loss=0.02747, over 4587805.70 frames. ], batch size: 64, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:46:11,125 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=445533.3333333333, ans=0.0 2023-12-04 21:46:20,710 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=445600.0, ans=0.125 2023-12-04 21:46:27,011 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=445666.6666666667, ans=0.125 2023-12-04 21:47:05,447 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445800.0, ans=0.1 2023-12-04 21:47:05,827 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.53 vs. limit=15.0 2023-12-04 21:47:07,593 INFO [train.py:1087] (1/4) Epoch 75, batch 650, loss[loss=0.1454, simple_loss=0.2408, pruned_loss=0.02498, over 24549.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02754, over 4620134.52 frames. ], batch size: 66, lr: 3.27e-03, grad_scale: 32.0 2023-12-04 21:47:12,168 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.08 vs. limit=15.0 2023-12-04 21:47:25,058 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.68 vs. limit=15.0 2023-12-04 21:47:25,171 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=445933.3333333333, ans=15.0 2023-12-04 21:47:33,939 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=446000.0, ans=0.125 2023-12-04 21:47:38,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=446000.0, ans=0.2 2023-12-04 21:47:41,475 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-12-04 21:47:51,534 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.294e+02 1.373e+02 1.463e+02 1.900e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-04 21:47:53,080 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=446066.6666666667, ans=0.125 2023-12-04 21:48:06,386 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.19 vs. limit=15.0 2023-12-04 21:48:12,770 INFO [train.py:1087] (1/4) Epoch 75, batch 700, loss[loss=0.1421, simple_loss=0.2364, pruned_loss=0.02394, over 24690.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2418, pruned_loss=0.02749, over 4669531.23 frames. ], batch size: 69, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:48:40,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=446333.3333333333, ans=0.125 2023-12-04 21:48:41,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=446333.3333333333, ans=0.1 2023-12-04 21:48:59,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=446400.0, ans=0.0 2023-12-04 21:49:02,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=446400.0, ans=0.125 2023-12-04 21:49:04,425 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.01 vs. limit=22.5 2023-12-04 21:49:13,114 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=22.5 2023-12-04 21:49:17,446 INFO [train.py:1087] (1/4) Epoch 75, batch 750, loss[loss=0.1456, simple_loss=0.2337, pruned_loss=0.02878, over 24449.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2419, pruned_loss=0.02768, over 4692555.04 frames. ], batch size: 77, lr: 3.26e-03, grad_scale: 16.0 2023-12-04 21:49:37,677 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=446600.0, ans=0.125 2023-12-04 21:49:42,556 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=446666.6666666667, ans=0.2 2023-12-04 21:49:44,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=446666.6666666667, ans=0.125 2023-12-04 21:50:03,652 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.259e+02 1.344e+02 1.448e+02 1.875e+02, threshold=2.689e+02, percent-clipped=0.0 2023-12-04 21:50:08,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=446800.0, ans=0.05 2023-12-04 21:50:21,875 INFO [train.py:1087] (1/4) Epoch 75, batch 800, loss[loss=0.1425, simple_loss=0.2375, pruned_loss=0.02373, over 24728.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2416, pruned_loss=0.02742, over 4717740.64 frames. ], batch size: 67, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:50:52,264 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.29 vs. limit=15.0 2023-12-04 21:51:06,006 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=22.5 2023-12-04 21:51:10,129 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=447133.3333333333, ans=0.07 2023-12-04 21:51:20,975 INFO [train.py:1087] (1/4) Epoch 75, batch 850, loss[loss=0.1456, simple_loss=0.2429, pruned_loss=0.0242, over 24853.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2414, pruned_loss=0.02734, over 4746306.57 frames. ], batch size: 68, lr: 3.26e-03, grad_scale: 32.0 2023-12-04 21:51:21,225 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=447200.0, ans=0.125 2023-12-04 21:51:26,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=447200.0, ans=0.125 2023-12-04 21:51:35,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=447266.6666666667, ans=0.125 2023-12-04 21:51:42,512 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=447333.3333333333, ans=0.0 2023-12-04 21:52:00,230 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.158e+02 1.286e+02 1.370e+02 1.501e+02 2.135e+02, threshold=2.740e+02, percent-clipped=0.0 2023-12-04 21:52:00,433 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=447400.0, ans=0.125 2023-12-04 21:52:08,412 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=447466.6666666667, ans=0.125 2023-12-04 21:52:29,115 INFO [train.py:1087] (1/4) Epoch 76, batch 0, loss[loss=0.1452, simple_loss=0.2392, pruned_loss=0.02561, over 24559.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.2392, pruned_loss=0.02561, over 24559.00 frames. ], batch size: 66, lr: 3.24e-03, grad_scale: 32.0 2023-12-04 21:52:29,116 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 21:52:43,467 INFO [train.py:1119] (1/4) Epoch 76, validation: loss=0.1514, simple_loss=0.2471, pruned_loss=0.02786, over 944034.00 frames. 2023-12-04 21:52:43,468 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 21:52:43,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=447500.0, ans=0.0 2023-12-04 21:52:54,948 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=447566.6666666667, ans=0.125 2023-12-04 21:53:06,818 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=447566.6666666667, ans=0.125 2023-12-04 21:53:47,239 INFO [train.py:1087] (1/4) Epoch 76, batch 50, loss[loss=0.1558, simple_loss=0.2471, pruned_loss=0.03228, over 23550.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.02745, over 1086393.85 frames. ], batch size: 94, lr: 3.24e-03, grad_scale: 16.0 2023-12-04 21:54:05,748 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=15.0 2023-12-04 21:54:13,983 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=447966.6666666667, ans=0.0 2023-12-04 21:54:14,078 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=447966.6666666667, ans=0.0 2023-12-04 21:54:25,462 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=448033.3333333333, ans=0.0 2023-12-04 21:54:38,793 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.295e+02 1.384e+02 1.507e+02 2.552e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-04 21:54:49,815 INFO [train.py:1087] (1/4) Epoch 76, batch 100, loss[loss=0.152, simple_loss=0.2462, pruned_loss=0.02891, over 24576.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2418, pruned_loss=0.02734, over 1902679.80 frames. ], batch size: 64, lr: 3.24e-03, grad_scale: 16.0 2023-12-04 21:55:04,781 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=448233.3333333333, ans=0.125 2023-12-04 21:55:13,247 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 21:55:34,754 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.87 vs. limit=15.0 2023-12-04 21:55:54,148 INFO [train.py:1087] (1/4) Epoch 76, batch 150, loss[loss=0.1533, simple_loss=0.2497, pruned_loss=0.0285, over 24860.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.242, pruned_loss=0.02745, over 2537194.79 frames. ], batch size: 68, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:55:55,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=448500.0, ans=0.0 2023-12-04 21:56:20,449 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=448633.3333333333, ans=0.125 2023-12-04 21:56:30,510 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=448633.3333333333, ans=0.125 2023-12-04 21:56:44,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=448700.0, ans=0.125 2023-12-04 21:56:47,487 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.239e+02 1.337e+02 1.421e+02 1.871e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 21:56:47,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=448766.6666666667, ans=0.1 2023-12-04 21:56:59,068 INFO [train.py:1087] (1/4) Epoch 76, batch 200, loss[loss=0.1639, simple_loss=0.2551, pruned_loss=0.03638, over 17173.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02748, over 3032850.92 frames. ], batch size: 177, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:57:05,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=448833.3333333333, ans=0.125 2023-12-04 21:57:06,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=448833.3333333333, ans=0.125 2023-12-04 21:57:26,704 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=448966.6666666667, ans=0.0 2023-12-04 21:57:29,279 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=448966.6666666667, ans=0.0 2023-12-04 21:57:38,736 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.02 vs. limit=6.0 2023-12-04 21:57:42,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=449033.3333333333, ans=0.04949747468305833 2023-12-04 21:57:59,392 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-12-04 21:58:03,459 INFO [train.py:1087] (1/4) Epoch 76, batch 250, loss[loss=0.1517, simple_loss=0.2446, pruned_loss=0.0294, over 24573.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2419, pruned_loss=0.02742, over 3428813.79 frames. ], batch size: 65, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:58:04,929 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=449166.6666666667, ans=0.125 2023-12-04 21:58:06,393 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-12-04 21:58:12,333 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=449166.6666666667, ans=0.0 2023-12-04 21:58:30,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=449300.0, ans=0.2 2023-12-04 21:58:55,189 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=449433.3333333333, ans=0.0 2023-12-04 21:58:55,905 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.278e+02 1.372e+02 1.483e+02 1.898e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 21:59:07,000 INFO [train.py:1087] (1/4) Epoch 76, batch 300, loss[loss=0.1576, simple_loss=0.2448, pruned_loss=0.03516, over 24540.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2418, pruned_loss=0.02737, over 3743729.19 frames. ], batch size: 75, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 21:59:21,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=449566.6666666667, ans=0.1 2023-12-04 21:59:23,948 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=449566.6666666667, ans=0.0 2023-12-04 21:59:32,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=449633.3333333333, ans=0.05 2023-12-04 21:59:35,591 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=449633.3333333333, ans=0.0 2023-12-04 21:59:46,937 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=449700.0, ans=0.2 2023-12-04 22:00:01,309 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:00:11,242 INFO [train.py:1087] (1/4) Epoch 76, batch 350, loss[loss=0.1475, simple_loss=0.2475, pruned_loss=0.02374, over 24791.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.2417, pruned_loss=0.02723, over 3983920.49 frames. ], batch size: 71, lr: 3.23e-03, grad_scale: 16.0 2023-12-04 22:00:15,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=449833.3333333333, ans=0.125 2023-12-04 22:00:16,037 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=22.5 2023-12-04 22:00:20,309 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=449833.3333333333, ans=0.125 2023-12-04 22:00:36,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=449966.6666666667, ans=0.125 2023-12-04 22:00:44,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=449966.6666666667, ans=0.125 2023-12-04 22:00:48,355 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=449966.6666666667, ans=0.0 2023-12-04 22:00:51,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=450033.3333333333, ans=0.0 2023-12-04 22:01:04,960 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.318e+02 1.386e+02 1.479e+02 1.987e+02, threshold=2.773e+02, percent-clipped=0.0 2023-12-04 22:01:08,132 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-12-04 22:01:16,005 INFO [train.py:1087] (1/4) Epoch 76, batch 400, loss[loss=0.1468, simple_loss=0.2409, pruned_loss=0.02632, over 24565.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2419, pruned_loss=0.0276, over 4169426.07 frames. ], batch size: 64, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:01:32,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=450233.3333333333, ans=0.0 2023-12-04 22:01:35,138 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450233.3333333333, ans=0.1 2023-12-04 22:01:38,020 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=450233.3333333333, ans=0.95 2023-12-04 22:01:38,202 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=450233.3333333333, ans=0.1 2023-12-04 22:01:39,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=450233.3333333333, ans=0.05 2023-12-04 22:02:00,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=450366.6666666667, ans=0.5 2023-12-04 22:02:16,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=450433.3333333333, ans=0.125 2023-12-04 22:02:21,032 INFO [train.py:1087] (1/4) Epoch 76, batch 450, loss[loss=0.1515, simple_loss=0.2474, pruned_loss=0.0278, over 24706.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.2419, pruned_loss=0.02762, over 4300246.71 frames. ], batch size: 69, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:02:38,713 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=450566.6666666667, ans=0.125 2023-12-04 22:02:39,372 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.47 vs. limit=10.0 2023-12-04 22:02:58,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=450633.3333333333, ans=0.125 2023-12-04 22:03:00,366 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=450700.0, ans=0.0 2023-12-04 22:03:06,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=450700.0, ans=0.125 2023-12-04 22:03:13,404 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.288e+02 1.371e+02 1.459e+02 1.960e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 22:03:25,622 INFO [train.py:1087] (1/4) Epoch 76, batch 500, loss[loss=0.1478, simple_loss=0.2416, pruned_loss=0.02698, over 24846.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2417, pruned_loss=0.02739, over 4422593.93 frames. ], batch size: 68, lr: 3.23e-03, grad_scale: 32.0 2023-12-04 22:04:28,686 INFO [train.py:1087] (1/4) Epoch 76, batch 550, loss[loss=0.1344, simple_loss=0.2279, pruned_loss=0.0204, over 24676.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.02743, over 4522127.88 frames. ], batch size: 74, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:04:33,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=451166.6666666667, ans=0.125 2023-12-04 22:04:34,734 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=451166.6666666667, ans=0.125 2023-12-04 22:04:40,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=451166.6666666667, ans=0.035 2023-12-04 22:04:44,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=451233.3333333333, ans=0.0 2023-12-04 22:05:02,439 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=451300.0, ans=0.125 2023-12-04 22:05:18,683 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:05:21,753 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.076e+02 1.266e+02 1.377e+02 1.479e+02 2.187e+02, threshold=2.754e+02, percent-clipped=0.0 2023-12-04 22:05:32,890 INFO [train.py:1087] (1/4) Epoch 76, batch 600, loss[loss=0.1434, simple_loss=0.2371, pruned_loss=0.0248, over 24724.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2416, pruned_loss=0.02744, over 4586611.76 frames. ], batch size: 69, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:05:40,985 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=451500.0, ans=0.09899494936611666 2023-12-04 22:05:44,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=451566.6666666667, ans=0.0 2023-12-04 22:05:54,653 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=451566.6666666667, ans=0.125 2023-12-04 22:05:57,074 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451566.6666666667, ans=0.1 2023-12-04 22:06:03,481 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.32 vs. limit=15.0 2023-12-04 22:06:03,645 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.54 vs. limit=15.0 2023-12-04 22:06:06,763 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=451633.3333333333, ans=0.125 2023-12-04 22:06:37,306 INFO [train.py:1087] (1/4) Epoch 76, batch 650, loss[loss=0.1444, simple_loss=0.2342, pruned_loss=0.02728, over 24756.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2414, pruned_loss=0.02744, over 4636338.82 frames. ], batch size: 65, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:06:57,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=451900.0, ans=10.0 2023-12-04 22:07:05,314 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451966.6666666667, ans=0.1 2023-12-04 22:07:14,379 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.083e-03 2023-12-04 22:07:29,902 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.170e+02 1.268e+02 1.338e+02 1.426e+02 2.110e+02, threshold=2.676e+02, percent-clipped=0.0 2023-12-04 22:07:31,479 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=452100.0, ans=0.0 2023-12-04 22:07:40,653 INFO [train.py:1087] (1/4) Epoch 76, batch 700, loss[loss=0.1432, simple_loss=0.2395, pruned_loss=0.02342, over 24662.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2415, pruned_loss=0.02756, over 4676731.40 frames. ], batch size: 74, lr: 3.22e-03, grad_scale: 16.0 2023-12-04 22:07:42,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=452166.6666666667, ans=0.125 2023-12-04 22:08:12,314 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=452300.0, ans=10.0 2023-12-04 22:08:24,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=452366.6666666667, ans=0.0 2023-12-04 22:08:30,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=452433.3333333333, ans=0.0 2023-12-04 22:08:36,140 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=452433.3333333333, ans=0.125 2023-12-04 22:08:43,691 INFO [train.py:1087] (1/4) Epoch 76, batch 750, loss[loss=0.142, simple_loss=0.2355, pruned_loss=0.02422, over 24810.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.2417, pruned_loss=0.02763, over 4684893.75 frames. ], batch size: 62, lr: 3.22e-03, grad_scale: 16.0 2023-12-04 22:09:33,904 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=452766.6666666667, ans=0.125 2023-12-04 22:09:36,903 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.251e+02 1.333e+02 1.395e+02 1.598e+02, threshold=2.666e+02, percent-clipped=0.0 2023-12-04 22:09:46,933 INFO [train.py:1087] (1/4) Epoch 76, batch 800, loss[loss=0.1432, simple_loss=0.2357, pruned_loss=0.02531, over 24539.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02725, over 4725338.38 frames. ], batch size: 62, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:09:59,870 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=452900.0, ans=0.125 2023-12-04 22:10:11,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=452966.6666666667, ans=0.0 2023-12-04 22:10:29,769 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=453033.3333333333, ans=0.125 2023-12-04 22:10:43,674 INFO [train.py:1087] (1/4) Epoch 76, batch 850, loss[loss=0.1518, simple_loss=0.2534, pruned_loss=0.0251, over 22794.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2412, pruned_loss=0.02723, over 4749483.76 frames. ], batch size: 106, lr: 3.22e-03, grad_scale: 32.0 2023-12-04 22:11:08,952 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.88 vs. limit=10.0 2023-12-04 22:11:16,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=453300.0, ans=0.125 2023-12-04 22:11:16,666 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.79 vs. limit=15.0 2023-12-04 22:11:25,643 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-12-04 22:11:26,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=453366.6666666667, ans=0.0 2023-12-04 22:11:26,669 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.55 vs. limit=15.0 2023-12-04 22:11:30,888 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=453433.3333333333, ans=0.125 2023-12-04 22:11:33,620 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.179e+02 1.275e+02 1.393e+02 1.516e+02 2.346e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-04 22:11:45,018 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=453466.6666666667, ans=0.125 2023-12-04 22:11:53,946 INFO [train.py:1087] (1/4) Epoch 77, batch 0, loss[loss=0.1424, simple_loss=0.2405, pruned_loss=0.02222, over 24698.00 frames. ], tot_loss[loss=0.1424, simple_loss=0.2405, pruned_loss=0.02222, over 24698.00 frames. ], batch size: 69, lr: 3.20e-03, grad_scale: 32.0 2023-12-04 22:11:53,947 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 22:12:07,719 INFO [train.py:1119] (1/4) Epoch 77, validation: loss=0.1509, simple_loss=0.2467, pruned_loss=0.02756, over 944034.00 frames. 2023-12-04 22:12:07,720 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 22:12:22,853 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-12-04 22:12:26,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=453533.3333333333, ans=0.0 2023-12-04 22:12:38,252 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453600.0, ans=0.1 2023-12-04 22:13:09,707 INFO [train.py:1087] (1/4) Epoch 77, batch 50, loss[loss=0.1433, simple_loss=0.2317, pruned_loss=0.0275, over 24577.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2422, pruned_loss=0.02691, over 1085898.97 frames. ], batch size: 64, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:13:29,202 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.41 vs. limit=22.5 2023-12-04 22:13:40,228 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=453933.3333333333, ans=0.04949747468305833 2023-12-04 22:13:48,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=454000.0, ans=0.2 2023-12-04 22:13:52,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=454000.0, ans=0.035 2023-12-04 22:13:56,393 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=454000.0, ans=0.125 2023-12-04 22:13:56,553 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=454000.0, ans=0.125 2023-12-04 22:14:05,027 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=454066.6666666667, ans=0.125 2023-12-04 22:14:07,547 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.25 vs. limit=6.0 2023-12-04 22:14:09,210 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.276e+02 1.348e+02 1.437e+02 2.275e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-04 22:14:11,606 INFO [train.py:1087] (1/4) Epoch 77, batch 100, loss[loss=0.1524, simple_loss=0.2457, pruned_loss=0.02955, over 23396.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2415, pruned_loss=0.02688, over 1925194.49 frames. ], batch size: 94, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:14:24,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=454200.0, ans=0.125 2023-12-04 22:14:42,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=454266.6666666667, ans=0.125 2023-12-04 22:14:47,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=454333.3333333333, ans=0.1 2023-12-04 22:15:08,545 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=454400.0, ans=0.09899494936611666 2023-12-04 22:15:09,935 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.80 vs. limit=22.5 2023-12-04 22:15:12,894 INFO [train.py:1087] (1/4) Epoch 77, batch 150, loss[loss=0.1416, simple_loss=0.2325, pruned_loss=0.02529, over 24784.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2411, pruned_loss=0.0265, over 2576534.52 frames. ], batch size: 71, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:15:28,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=454533.3333333333, ans=0.0 2023-12-04 22:15:32,063 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.42 vs. limit=22.5 2023-12-04 22:15:47,063 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=454600.0, ans=0.0 2023-12-04 22:15:50,035 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-12-04 22:16:02,764 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:16:08,050 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.52 vs. limit=15.0 2023-12-04 22:16:13,575 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.072e+02 1.230e+02 1.302e+02 1.377e+02 1.981e+02, threshold=2.604e+02, percent-clipped=0.0 2023-12-04 22:16:16,003 INFO [train.py:1087] (1/4) Epoch 77, batch 200, loss[loss=0.1474, simple_loss=0.2417, pruned_loss=0.02651, over 24561.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2414, pruned_loss=0.02692, over 3066301.92 frames. ], batch size: 63, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:16:34,960 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:16:48,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=454933.3333333333, ans=0.125 2023-12-04 22:16:48,169 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=454933.3333333333, ans=0.125 2023-12-04 22:17:06,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=455066.6666666667, ans=10.0 2023-12-04 22:17:16,187 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=455066.6666666667, ans=0.05 2023-12-04 22:17:19,423 INFO [train.py:1087] (1/4) Epoch 77, batch 250, loss[loss=0.1433, simple_loss=0.2374, pruned_loss=0.02454, over 24764.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2416, pruned_loss=0.027, over 3461357.53 frames. ], batch size: 65, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:17:34,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455200.0, ans=0.1 2023-12-04 22:17:43,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=455266.6666666667, ans=0.0 2023-12-04 22:17:54,370 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=12.0 2023-12-04 22:17:55,344 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=455266.6666666667, ans=0.0 2023-12-04 22:18:19,575 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.137e+02 1.256e+02 1.393e+02 1.475e+02 1.772e+02, threshold=2.785e+02, percent-clipped=0.0 2023-12-04 22:18:22,524 INFO [train.py:1087] (1/4) Epoch 77, batch 300, loss[loss=0.1402, simple_loss=0.2349, pruned_loss=0.0228, over 24276.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2411, pruned_loss=0.02691, over 3759856.12 frames. ], batch size: 79, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:18:32,900 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=455466.6666666667, ans=0.2 2023-12-04 22:18:37,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455533.3333333333, ans=0.1 2023-12-04 22:18:48,170 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-12-04 22:18:49,032 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=455600.0, ans=0.0 2023-12-04 22:19:08,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=455666.6666666667, ans=0.125 2023-12-04 22:19:09,409 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=455666.6666666667, ans=0.0 2023-12-04 22:19:14,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=455733.3333333333, ans=0.125 2023-12-04 22:19:25,314 INFO [train.py:1087] (1/4) Epoch 77, batch 350, loss[loss=0.1483, simple_loss=0.2405, pruned_loss=0.028, over 24777.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2412, pruned_loss=0.02696, over 3995038.85 frames. ], batch size: 64, lr: 3.19e-03, grad_scale: 16.0 2023-12-04 22:19:37,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455866.6666666667, ans=0.1 2023-12-04 22:19:46,235 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.96 vs. limit=15.0 2023-12-04 22:20:11,066 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=456000.0, ans=0.2 2023-12-04 22:20:27,041 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.291e+02 1.396e+02 1.522e+02 1.787e+02, threshold=2.793e+02, percent-clipped=0.0 2023-12-04 22:20:29,533 INFO [train.py:1087] (1/4) Epoch 77, batch 400, loss[loss=0.1358, simple_loss=0.2321, pruned_loss=0.01972, over 24683.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2417, pruned_loss=0.02738, over 4149544.52 frames. ], batch size: 74, lr: 3.19e-03, grad_scale: 32.0 2023-12-04 22:20:40,234 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.15 vs. limit=22.5 2023-12-04 22:20:47,937 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=456200.0, ans=0.125 2023-12-04 22:20:56,029 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=456266.6666666667, ans=0.05 2023-12-04 22:21:00,137 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=456266.6666666667, ans=0.2 2023-12-04 22:21:24,654 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-12-04 22:21:32,185 INFO [train.py:1087] (1/4) Epoch 77, batch 450, loss[loss=0.1471, simple_loss=0.2386, pruned_loss=0.0278, over 24559.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.02742, over 4296266.96 frames. ], batch size: 65, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:21:33,633 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:21:38,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=456466.6666666667, ans=0.0 2023-12-04 22:21:59,966 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.76 vs. limit=15.0 2023-12-04 22:22:08,547 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.47 vs. limit=22.5 2023-12-04 22:22:11,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=456666.6666666667, ans=0.125 2023-12-04 22:22:14,814 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.55 vs. limit=15.0 2023-12-04 22:22:27,021 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-12-04 22:22:32,351 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.262e+02 1.346e+02 1.442e+02 1.753e+02, threshold=2.693e+02, percent-clipped=0.0 2023-12-04 22:22:35,704 INFO [train.py:1087] (1/4) Epoch 77, batch 500, loss[loss=0.1486, simple_loss=0.2373, pruned_loss=0.02992, over 24775.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2413, pruned_loss=0.02736, over 4413755.94 frames. ], batch size: 62, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:22:54,213 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=456866.6666666667, ans=0.125 2023-12-04 22:23:05,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=456933.3333333333, ans=0.125 2023-12-04 22:23:10,153 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-12-04 22:23:19,690 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=457000.0, ans=0.125 2023-12-04 22:23:28,467 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457066.6666666667, ans=0.125 2023-12-04 22:23:36,996 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=457133.3333333333, ans=0.125 2023-12-04 22:23:37,760 INFO [train.py:1087] (1/4) Epoch 77, batch 550, loss[loss=0.1408, simple_loss=0.2348, pruned_loss=0.02345, over 24817.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.02708, over 4509267.18 frames. ], batch size: 72, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:23:41,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=457133.3333333333, ans=0.125 2023-12-04 22:23:45,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=457133.3333333333, ans=0.1 2023-12-04 22:23:46,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=457133.3333333333, ans=0.0 2023-12-04 22:24:00,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=457200.0, ans=0.125 2023-12-04 22:24:09,050 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=457266.6666666667, ans=0.0 2023-12-04 22:24:38,651 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=457400.0, ans=0.04949747468305833 2023-12-04 22:24:40,638 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.270e+02 1.347e+02 1.504e+02 2.289e+02, threshold=2.694e+02, percent-clipped=0.0 2023-12-04 22:24:41,833 INFO [train.py:1087] (1/4) Epoch 77, batch 600, loss[loss=0.1413, simple_loss=0.2322, pruned_loss=0.02517, over 24561.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.02708, over 4583804.33 frames. ], batch size: 65, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:24:43,769 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.53 vs. limit=15.0 2023-12-04 22:24:45,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=457466.6666666667, ans=0.125 2023-12-04 22:25:02,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=457533.3333333333, ans=0.125 2023-12-04 22:25:19,701 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=457666.6666666667, ans=0.0 2023-12-04 22:25:20,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=457666.6666666667, ans=0.125 2023-12-04 22:25:22,564 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.35 vs. limit=10.0 2023-12-04 22:25:33,603 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-12-04 22:25:44,101 INFO [train.py:1087] (1/4) Epoch 77, batch 650, loss[loss=0.1544, simple_loss=0.2546, pruned_loss=0.0271, over 21485.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.02697, over 4635513.72 frames. ], batch size: 127, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:26:04,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=457866.6666666667, ans=0.125 2023-12-04 22:26:15,325 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=457933.3333333333, ans=0.125 2023-12-04 22:26:36,052 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.19 vs. limit=22.5 2023-12-04 22:26:41,039 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=458066.6666666667, ans=0.125 2023-12-04 22:26:43,200 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=458066.6666666667, ans=0.0 2023-12-04 22:26:44,059 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.091e+02 1.277e+02 1.365e+02 1.551e+02 2.272e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 22:26:45,290 INFO [train.py:1087] (1/4) Epoch 77, batch 700, loss[loss=0.1356, simple_loss=0.229, pruned_loss=0.02111, over 24730.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.02698, over 4682401.67 frames. ], batch size: 74, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:27:00,299 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=458200.0, ans=0.0 2023-12-04 22:27:07,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=458200.0, ans=0.0 2023-12-04 22:27:07,290 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=458200.0, ans=0.2 2023-12-04 22:27:16,009 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=458266.6666666667, ans=0.0 2023-12-04 22:27:16,658 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2023-12-04 22:27:17,755 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.45 vs. limit=15.0 2023-12-04 22:27:21,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=458333.3333333333, ans=0.125 2023-12-04 22:27:27,879 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458333.3333333333, ans=0.1 2023-12-04 22:27:32,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458400.0, ans=0.125 2023-12-04 22:27:38,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=458400.0, ans=0.2 2023-12-04 22:27:44,972 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.45 vs. limit=15.0 2023-12-04 22:27:45,590 INFO [train.py:1087] (1/4) Epoch 77, batch 750, loss[loss=0.1414, simple_loss=0.236, pruned_loss=0.02337, over 24797.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2408, pruned_loss=0.02686, over 4716640.40 frames. ], batch size: 62, lr: 3.18e-03, grad_scale: 16.0 2023-12-04 22:27:52,602 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=458466.6666666667, ans=0.125 2023-12-04 22:27:53,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=458466.6666666667, ans=0.0 2023-12-04 22:28:05,625 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458533.3333333333, ans=0.1 2023-12-04 22:28:40,434 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458733.3333333333, ans=0.1 2023-12-04 22:28:45,508 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=458733.3333333333, ans=0.125 2023-12-04 22:28:46,384 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.260e+02 1.355e+02 1.461e+02 1.889e+02, threshold=2.710e+02, percent-clipped=0.0 2023-12-04 22:28:47,550 INFO [train.py:1087] (1/4) Epoch 77, batch 800, loss[loss=0.1706, simple_loss=0.2563, pruned_loss=0.04247, over 16442.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2407, pruned_loss=0.02685, over 4737283.85 frames. ], batch size: 177, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:28:52,888 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.29 vs. limit=15.0 2023-12-04 22:29:01,902 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-12-04 22:29:10,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=458933.3333333333, ans=0.125 2023-12-04 22:29:10,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=458933.3333333333, ans=0.125 2023-12-04 22:29:21,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=459000.0, ans=0.0 2023-12-04 22:29:32,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=459066.6666666667, ans=0.125 2023-12-04 22:29:38,259 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=459066.6666666667, ans=0.125 2023-12-04 22:29:40,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=459066.6666666667, ans=0.1 2023-12-04 22:29:43,825 INFO [train.py:1087] (1/4) Epoch 77, batch 850, loss[loss=0.1477, simple_loss=0.2394, pruned_loss=0.02801, over 24571.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.02706, over 4757972.93 frames. ], batch size: 64, lr: 3.18e-03, grad_scale: 32.0 2023-12-04 22:29:59,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=459200.0, ans=0.125 2023-12-04 22:30:02,792 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=459200.0, ans=0.125 2023-12-04 22:30:03,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-12-04 22:30:05,036 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=459266.6666666667, ans=0.2 2023-12-04 22:30:12,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=459266.6666666667, ans=0.125 2023-12-04 22:30:28,562 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=459400.0, ans=0.1 2023-12-04 22:30:43,584 INFO [train.py:1087] (1/4) Epoch 78, batch 0, loss[loss=0.1556, simple_loss=0.2496, pruned_loss=0.03081, over 24500.00 frames. ], tot_loss[loss=0.1556, simple_loss=0.2496, pruned_loss=0.03081, over 24500.00 frames. ], batch size: 75, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:30:43,585 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 22:30:51,870 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([2.5006, 3.4803, 3.7559, 3.5110, 2.9918, 3.5492, 3.7555, 3.0726], device='cuda:1') 2023-12-04 22:30:56,962 INFO [train.py:1119] (1/4) Epoch 78, validation: loss=0.1512, simple_loss=0.2469, pruned_loss=0.02777, over 944034.00 frames. 2023-12-04 22:30:56,964 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 22:31:01,571 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.318e+02 1.400e+02 1.529e+02 2.377e+02, threshold=2.801e+02, percent-clipped=0.0 2023-12-04 22:31:02,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-12-04 22:31:03,061 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=459433.3333333333, ans=0.125 2023-12-04 22:31:58,215 INFO [train.py:1087] (1/4) Epoch 78, batch 50, loss[loss=0.1521, simple_loss=0.2422, pruned_loss=0.03105, over 24280.00 frames. ], tot_loss[loss=0.1492, simple_loss=0.2423, pruned_loss=0.02804, over 1063711.16 frames. ], batch size: 79, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:32:41,073 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.21 vs. limit=6.0 2023-12-04 22:32:56,555 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460033.3333333333, ans=0.1 2023-12-04 22:32:58,609 INFO [train.py:1087] (1/4) Epoch 78, batch 100, loss[loss=0.1474, simple_loss=0.2426, pruned_loss=0.02605, over 24864.00 frames. ], tot_loss[loss=0.1486, simple_loss=0.242, pruned_loss=0.02761, over 1890854.05 frames. ], batch size: 68, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:33:03,697 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.269e+02 1.352e+02 1.523e+02 1.962e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 22:33:06,480 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=460100.0, ans=0.125 2023-12-04 22:33:11,162 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460166.6666666667, ans=0.0 2023-12-04 22:33:34,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460300.0, ans=0.1 2023-12-04 22:33:34,614 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=460300.0, ans=0.125 2023-12-04 22:33:56,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=460366.6666666667, ans=0.1 2023-12-04 22:33:59,202 INFO [train.py:1087] (1/4) Epoch 78, batch 150, loss[loss=0.1505, simple_loss=0.2426, pruned_loss=0.02925, over 24521.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2415, pruned_loss=0.02725, over 2547758.65 frames. ], batch size: 75, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:34:04,439 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=460433.3333333333, ans=0.0 2023-12-04 22:34:20,967 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=460500.0, ans=0.1 2023-12-04 22:34:34,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=460633.3333333333, ans=0.0 2023-12-04 22:34:59,863 INFO [train.py:1087] (1/4) Epoch 78, batch 200, loss[loss=0.1355, simple_loss=0.232, pruned_loss=0.01952, over 24784.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.02702, over 3060750.63 frames. ], batch size: 73, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:35:04,567 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.144e+02 1.245e+02 1.353e+02 1.456e+02 1.804e+02, threshold=2.706e+02, percent-clipped=0.0 2023-12-04 22:35:49,571 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=461033.3333333333, ans=0.07 2023-12-04 22:35:49,932 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-12-04 22:36:01,505 INFO [train.py:1087] (1/4) Epoch 78, batch 250, loss[loss=0.1495, simple_loss=0.2441, pruned_loss=0.02747, over 24560.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02706, over 3446420.16 frames. ], batch size: 65, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:36:37,591 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:36:41,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=461300.0, ans=0.125 2023-12-04 22:36:43,791 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.32 vs. limit=15.0 2023-12-04 22:37:04,054 INFO [train.py:1087] (1/4) Epoch 78, batch 300, loss[loss=0.1535, simple_loss=0.2472, pruned_loss=0.02986, over 24793.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02719, over 3759378.22 frames. ], batch size: 73, lr: 3.15e-03, grad_scale: 16.0 2023-12-04 22:37:09,275 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461433.3333333333, ans=0.1 2023-12-04 22:37:10,080 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.267e+02 1.351e+02 1.468e+02 1.970e+02, threshold=2.702e+02, percent-clipped=0.0 2023-12-04 22:37:10,402 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=461433.3333333333, ans=0.025 2023-12-04 22:37:41,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=461633.3333333333, ans=0.0 2023-12-04 22:37:41,538 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.07 vs. limit=6.0 2023-12-04 22:37:48,076 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=461633.3333333333, ans=0.0 2023-12-04 22:37:49,595 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.76 vs. limit=15.0 2023-12-04 22:38:05,857 INFO [train.py:1087] (1/4) Epoch 78, batch 350, loss[loss=0.1649, simple_loss=0.2573, pruned_loss=0.03631, over 23942.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02721, over 3989621.95 frames. ], batch size: 87, lr: 3.15e-03, grad_scale: 16.0 2023-12-04 22:38:07,408 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461766.6666666667, ans=0.125 2023-12-04 22:38:14,246 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=461766.6666666667, ans=0.0 2023-12-04 22:38:27,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=461833.3333333333, ans=0.0 2023-12-04 22:38:59,443 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-12-04 22:39:04,902 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=462033.3333333333, ans=0.0 2023-12-04 22:39:07,177 INFO [train.py:1087] (1/4) Epoch 78, batch 400, loss[loss=0.1424, simple_loss=0.24, pruned_loss=0.02239, over 24796.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02722, over 4170142.46 frames. ], batch size: 72, lr: 3.15e-03, grad_scale: 32.0 2023-12-04 22:39:13,453 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.151e+02 1.316e+02 1.421e+02 1.551e+02 2.009e+02, threshold=2.843e+02, percent-clipped=0.0 2023-12-04 22:39:13,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=462100.0, ans=0.125 2023-12-04 22:39:28,681 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=462166.6666666667, ans=0.125 2023-12-04 22:39:44,069 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=462300.0, ans=0.04949747468305833 2023-12-04 22:39:45,458 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=22.5 2023-12-04 22:39:59,819 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=462366.6666666667, ans=0.025 2023-12-04 22:40:09,003 INFO [train.py:1087] (1/4) Epoch 78, batch 450, loss[loss=0.1501, simple_loss=0.2466, pruned_loss=0.02673, over 24564.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2415, pruned_loss=0.02746, over 4298096.68 frames. ], batch size: 65, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:40:18,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=462433.3333333333, ans=0.0 2023-12-04 22:40:30,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=462500.0, ans=0.125 2023-12-04 22:40:32,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=462500.0, ans=0.0 2023-12-04 22:40:33,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=462566.6666666667, ans=0.0 2023-12-04 22:41:04,369 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=462700.0, ans=0.0 2023-12-04 22:41:05,698 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=462700.0, ans=0.0 2023-12-04 22:41:11,605 INFO [train.py:1087] (1/4) Epoch 78, batch 500, loss[loss=0.142, simple_loss=0.2355, pruned_loss=0.02427, over 24568.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2411, pruned_loss=0.02718, over 4418628.58 frames. ], batch size: 66, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:41:17,376 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.273e+02 1.355e+02 1.434e+02 2.086e+02, threshold=2.711e+02, percent-clipped=0.0 2023-12-04 22:41:32,544 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=22.5 2023-12-04 22:41:42,072 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=462900.0, ans=0.1 2023-12-04 22:41:57,796 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=462966.6666666667, ans=0.2 2023-12-04 22:42:12,089 INFO [train.py:1087] (1/4) Epoch 78, batch 550, loss[loss=0.1405, simple_loss=0.2349, pruned_loss=0.02308, over 24547.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2411, pruned_loss=0.02727, over 4494301.82 frames. ], batch size: 63, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:42:21,473 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.59 vs. limit=22.5 2023-12-04 22:42:35,605 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463233.3333333333, ans=0.125 2023-12-04 22:42:35,936 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.02 vs. limit=15.0 2023-12-04 22:42:49,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=463300.0, ans=0.0 2023-12-04 22:42:53,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=463300.0, ans=0.1 2023-12-04 22:42:54,204 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.76 vs. limit=15.0 2023-12-04 22:43:12,741 INFO [train.py:1087] (1/4) Epoch 78, batch 600, loss[loss=0.1408, simple_loss=0.2373, pruned_loss=0.02214, over 24573.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.02706, over 4570377.12 frames. ], batch size: 64, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:43:20,561 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.076e+02 1.286e+02 1.376e+02 1.492e+02 2.367e+02, threshold=2.752e+02, percent-clipped=0.0 2023-12-04 22:44:12,663 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.27 vs. limit=12.0 2023-12-04 22:44:14,359 INFO [train.py:1087] (1/4) Epoch 78, batch 650, loss[loss=0.142, simple_loss=0.2371, pruned_loss=0.02343, over 24741.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2412, pruned_loss=0.02713, over 4608701.28 frames. ], batch size: 61, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:44:23,186 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.35 vs. limit=15.0 2023-12-04 22:44:43,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=463900.0, ans=0.2 2023-12-04 22:45:00,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=463966.6666666667, ans=0.125 2023-12-04 22:45:02,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=464033.3333333333, ans=0.07 2023-12-04 22:45:02,760 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-12-04 22:45:15,132 INFO [train.py:1087] (1/4) Epoch 78, batch 700, loss[loss=0.1467, simple_loss=0.242, pruned_loss=0.02564, over 24796.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.0272, over 4658405.76 frames. ], batch size: 71, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:45:18,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=464100.0, ans=0.0 2023-12-04 22:45:22,025 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.303e+02 1.388e+02 1.509e+02 1.902e+02, threshold=2.776e+02, percent-clipped=0.0 2023-12-04 22:45:22,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=464100.0, ans=0.09899494936611666 2023-12-04 22:45:59,199 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=464300.0, ans=0.1 2023-12-04 22:46:16,907 INFO [train.py:1087] (1/4) Epoch 78, batch 750, loss[loss=0.1443, simple_loss=0.2393, pruned_loss=0.02467, over 24575.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2413, pruned_loss=0.02725, over 4683489.70 frames. ], batch size: 64, lr: 3.14e-03, grad_scale: 16.0 2023-12-04 22:46:22,952 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=464433.3333333333, ans=0.1 2023-12-04 22:46:25,191 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=464433.3333333333, ans=0.0 2023-12-04 22:46:40,124 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.68 vs. limit=15.0 2023-12-04 22:46:50,103 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=22.5 2023-12-04 22:47:05,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=464700.0, ans=0.125 2023-12-04 22:47:12,601 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=464700.0, ans=0.2 2023-12-04 22:47:16,391 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=464766.6666666667, ans=0.125 2023-12-04 22:47:17,367 INFO [train.py:1087] (1/4) Epoch 78, batch 800, loss[loss=0.1361, simple_loss=0.2313, pruned_loss=0.02048, over 24696.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2409, pruned_loss=0.02694, over 4718351.48 frames. ], batch size: 69, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:47:24,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=464766.6666666667, ans=0.0 2023-12-04 22:47:24,954 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.273e+02 1.372e+02 1.455e+02 1.988e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-04 22:47:30,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=464833.3333333333, ans=0.125 2023-12-04 22:47:53,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=464966.6666666667, ans=0.125 2023-12-04 22:48:09,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=465033.3333333333, ans=0.0 2023-12-04 22:48:12,568 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=465100.0, ans=0.0 2023-12-04 22:48:13,427 INFO [train.py:1087] (1/4) Epoch 78, batch 850, loss[loss=0.149, simple_loss=0.2441, pruned_loss=0.02699, over 24153.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02695, over 4727354.28 frames. ], batch size: 82, lr: 3.14e-03, grad_scale: 32.0 2023-12-04 22:48:17,230 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-12-04 22:48:19,557 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-12-04 22:48:27,652 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=465166.6666666667, ans=0.125 2023-12-04 22:48:33,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465166.6666666667, ans=0.1 2023-12-04 22:48:34,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=465233.3333333333, ans=0.09899494936611666 2023-12-04 22:48:49,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=465300.0, ans=0.2 2023-12-04 22:49:18,208 INFO [train.py:1087] (1/4) Epoch 79, batch 0, loss[loss=0.1529, simple_loss=0.2451, pruned_loss=0.03035, over 24306.00 frames. ], tot_loss[loss=0.1529, simple_loss=0.2451, pruned_loss=0.03035, over 24306.00 frames. ], batch size: 79, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:49:18,209 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 22:49:26,978 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.0187, 2.9603, 2.6998, 3.7265, 3.5840, 3.1758, 3.3645, 3.3473], device='cuda:1') 2023-12-04 22:49:31,401 INFO [train.py:1119] (1/4) Epoch 79, validation: loss=0.1512, simple_loss=0.2468, pruned_loss=0.02781, over 944034.00 frames. 2023-12-04 22:49:31,402 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 22:49:44,326 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.283e+02 1.371e+02 1.514e+02 2.003e+02, threshold=2.742e+02, percent-clipped=0.0 2023-12-04 22:50:02,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=465533.3333333333, ans=0.125 2023-12-04 22:50:11,886 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=465600.0, ans=0.2 2023-12-04 22:50:32,854 INFO [train.py:1087] (1/4) Epoch 79, batch 50, loss[loss=0.1441, simple_loss=0.2401, pruned_loss=0.02406, over 24560.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2422, pruned_loss=0.02729, over 1088425.63 frames. ], batch size: 64, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:51:04,765 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=465866.6666666667, ans=0.125 2023-12-04 22:51:18,672 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:51:21,005 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=466000.0, ans=0.125 2023-12-04 22:51:30,495 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=466000.0, ans=0.2 2023-12-04 22:51:30,885 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.31 vs. limit=15.0 2023-12-04 22:51:32,896 INFO [train.py:1087] (1/4) Epoch 79, batch 100, loss[loss=0.1472, simple_loss=0.2406, pruned_loss=0.02685, over 24851.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2414, pruned_loss=0.0271, over 1933663.94 frames. ], batch size: 68, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:51:36,065 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=466066.6666666667, ans=0.2 2023-12-04 22:51:43,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=466066.6666666667, ans=0.0 2023-12-04 22:51:43,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=466066.6666666667, ans=0.1 2023-12-04 22:51:46,215 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.313e+02 1.387e+02 1.492e+02 1.840e+02, threshold=2.775e+02, percent-clipped=0.0 2023-12-04 22:52:00,314 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.15 vs. limit=10.0 2023-12-04 22:52:15,202 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.66 vs. limit=15.0 2023-12-04 22:52:33,016 INFO [train.py:1087] (1/4) Epoch 79, batch 150, loss[loss=0.1514, simple_loss=0.2441, pruned_loss=0.02932, over 24778.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2416, pruned_loss=0.02691, over 2554149.61 frames. ], batch size: 70, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:52:47,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=466466.6666666667, ans=0.2 2023-12-04 22:53:18,908 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=466600.0, ans=0.2 2023-12-04 22:53:27,233 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=466666.6666666667, ans=0.125 2023-12-04 22:53:27,287 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=466666.6666666667, ans=0.125 2023-12-04 22:53:33,008 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=1.068e-01 2023-12-04 22:53:33,909 INFO [train.py:1087] (1/4) Epoch 79, batch 200, loss[loss=0.1485, simple_loss=0.2419, pruned_loss=0.02756, over 24043.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2408, pruned_loss=0.02675, over 3061337.04 frames. ], batch size: 87, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:53:43,834 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=466733.3333333333, ans=0.125 2023-12-04 22:53:46,904 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.118e+02 1.258e+02 1.330e+02 1.443e+02 2.425e+02, threshold=2.659e+02, percent-clipped=0.0 2023-12-04 22:53:58,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=466866.6666666667, ans=0.07 2023-12-04 22:54:09,636 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=466933.3333333333, ans=0.125 2023-12-04 22:54:11,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=466933.3333333333, ans=0.2 2023-12-04 22:54:15,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=466933.3333333333, ans=0.2 2023-12-04 22:54:35,267 INFO [train.py:1087] (1/4) Epoch 79, batch 250, loss[loss=0.1556, simple_loss=0.2482, pruned_loss=0.03155, over 24511.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.02667, over 3465453.12 frames. ], batch size: 75, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:54:36,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=467066.6666666667, ans=0.125 2023-12-04 22:54:40,416 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 22:54:54,773 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.52 vs. limit=15.0 2023-12-04 22:54:58,298 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.48 vs. limit=15.0 2023-12-04 22:55:00,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=467200.0, ans=0.125 2023-12-04 22:55:14,147 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=467266.6666666667, ans=0.0 2023-12-04 22:55:19,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=467266.6666666667, ans=0.0 2023-12-04 22:55:24,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-12-04 22:55:28,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=467333.3333333333, ans=0.0 2023-12-04 22:55:36,065 INFO [train.py:1087] (1/4) Epoch 79, batch 300, loss[loss=0.1496, simple_loss=0.2448, pruned_loss=0.02722, over 24577.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2408, pruned_loss=0.02679, over 3765088.04 frames. ], batch size: 64, lr: 3.11e-03, grad_scale: 16.0 2023-12-04 22:55:47,970 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.90 vs. limit=15.0 2023-12-04 22:55:50,633 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.277e+02 1.357e+02 1.469e+02 1.891e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-04 22:55:51,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467466.6666666667, ans=0.1 2023-12-04 22:55:58,266 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=467466.6666666667, ans=0.05 2023-12-04 22:56:19,411 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-12-04 22:56:26,705 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=467666.6666666667, ans=0.125 2023-12-04 22:56:35,548 INFO [train.py:1087] (1/4) Epoch 79, batch 350, loss[loss=0.1483, simple_loss=0.2414, pruned_loss=0.02754, over 24296.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2413, pruned_loss=0.02733, over 3972187.63 frames. ], batch size: 79, lr: 3.11e-03, grad_scale: 16.0 2023-12-04 22:56:38,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=467733.3333333333, ans=0.125 2023-12-04 22:56:46,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=467733.3333333333, ans=0.0 2023-12-04 22:56:51,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=467800.0, ans=0.0 2023-12-04 22:56:52,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=467800.0, ans=0.125 2023-12-04 22:56:58,303 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.95 vs. limit=15.0 2023-12-04 22:57:28,090 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=12.0 2023-12-04 22:57:36,761 INFO [train.py:1087] (1/4) Epoch 79, batch 400, loss[loss=0.1451, simple_loss=0.2378, pruned_loss=0.02617, over 24801.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.241, pruned_loss=0.027, over 4159275.53 frames. ], batch size: 62, lr: 3.11e-03, grad_scale: 32.0 2023-12-04 22:57:48,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=468133.3333333333, ans=0.125 2023-12-04 22:57:50,851 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.294e+02 1.402e+02 1.517e+02 2.343e+02, threshold=2.805e+02, percent-clipped=0.0 2023-12-04 22:57:57,851 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-12-04 22:57:59,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=468133.3333333333, ans=0.125 2023-12-04 22:58:17,828 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=468266.6666666667, ans=0.0 2023-12-04 22:58:37,193 INFO [train.py:1087] (1/4) Epoch 79, batch 450, loss[loss=0.1425, simple_loss=0.2402, pruned_loss=0.02239, over 24550.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2408, pruned_loss=0.02676, over 4317523.16 frames. ], batch size: 66, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 22:58:38,777 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=468400.0, ans=0.07 2023-12-04 22:58:51,773 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-12-04 22:58:56,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=468466.6666666667, ans=0.05 2023-12-04 22:59:15,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=468600.0, ans=0.0 2023-12-04 22:59:20,215 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=468600.0, ans=0.2 2023-12-04 22:59:27,450 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=468666.6666666667, ans=0.125 2023-12-04 22:59:28,700 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468666.6666666667, ans=0.125 2023-12-04 22:59:31,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=468666.6666666667, ans=0.2 2023-12-04 22:59:36,829 INFO [train.py:1087] (1/4) Epoch 79, batch 500, loss[loss=0.1514, simple_loss=0.2439, pruned_loss=0.02949, over 24602.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2412, pruned_loss=0.02704, over 4433568.65 frames. ], batch size: 68, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 22:59:40,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=468733.3333333333, ans=0.125 2023-12-04 22:59:51,127 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.271e+02 1.358e+02 1.492e+02 2.025e+02, threshold=2.716e+02, percent-clipped=0.0 2023-12-04 23:00:20,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=468933.3333333333, ans=0.1 2023-12-04 23:00:36,481 INFO [train.py:1087] (1/4) Epoch 79, batch 550, loss[loss=0.1757, simple_loss=0.2643, pruned_loss=0.04361, over 17248.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2417, pruned_loss=0.02732, over 4485473.69 frames. ], batch size: 177, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:00:46,656 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=469066.6666666667, ans=0.0 2023-12-04 23:00:58,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=469133.3333333333, ans=0.125 2023-12-04 23:01:03,081 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=469200.0, ans=0.95 2023-12-04 23:01:07,737 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469200.0, ans=0.1 2023-12-04 23:01:36,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=469400.0, ans=0.125 2023-12-04 23:01:37,301 INFO [train.py:1087] (1/4) Epoch 79, batch 600, loss[loss=0.1418, simple_loss=0.2363, pruned_loss=0.02367, over 24731.00 frames. ], tot_loss[loss=0.1483, simple_loss=0.2418, pruned_loss=0.02736, over 4547376.77 frames. ], batch size: 67, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:01:40,858 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=469400.0, ans=0.2 2023-12-04 23:01:42,083 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=469400.0, ans=0.125 2023-12-04 23:01:52,265 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.071e+02 1.255e+02 1.326e+02 1.396e+02 1.933e+02, threshold=2.652e+02, percent-clipped=0.0 2023-12-04 23:02:07,966 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=469533.3333333333, ans=0.125 2023-12-04 23:02:27,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=469666.6666666667, ans=0.1 2023-12-04 23:02:29,745 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.31 vs. limit=15.0 2023-12-04 23:02:38,294 INFO [train.py:1087] (1/4) Epoch 79, batch 650, loss[loss=0.1461, simple_loss=0.241, pruned_loss=0.02561, over 24791.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2417, pruned_loss=0.02732, over 4593085.66 frames. ], batch size: 62, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:02:52,768 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=469800.0, ans=0.2 2023-12-04 23:03:07,689 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=469866.6666666667, ans=0.125 2023-12-04 23:03:34,552 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=470000.0, ans=0.125 2023-12-04 23:03:36,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=470000.0, ans=0.0 2023-12-04 23:03:38,865 INFO [train.py:1087] (1/4) Epoch 79, batch 700, loss[loss=0.1368, simple_loss=0.2287, pruned_loss=0.02248, over 24761.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2412, pruned_loss=0.02716, over 4645613.38 frames. ], batch size: 70, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:03:53,565 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.295e+02 1.372e+02 1.521e+02 2.112e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 23:04:11,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=470200.0, ans=0.125 2023-12-04 23:04:11,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=470200.0, ans=0.125 2023-12-04 23:04:12,025 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.14 vs. limit=15.0 2023-12-04 23:04:28,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=470333.3333333333, ans=0.0 2023-12-04 23:04:38,915 INFO [train.py:1087] (1/4) Epoch 79, batch 750, loss[loss=0.1658, simple_loss=0.2584, pruned_loss=0.03661, over 21281.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02718, over 4680890.95 frames. ], batch size: 127, lr: 3.10e-03, grad_scale: 16.0 2023-12-04 23:05:17,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=470600.0, ans=0.125 2023-12-04 23:05:20,925 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=470600.0, ans=0.035 2023-12-04 23:05:23,530 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-12-04 23:05:32,719 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=470666.6666666667, ans=0.0 2023-12-04 23:05:38,504 INFO [train.py:1087] (1/4) Epoch 79, batch 800, loss[loss=0.1469, simple_loss=0.2407, pruned_loss=0.02654, over 24843.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02711, over 4722507.64 frames. ], batch size: 68, lr: 3.10e-03, grad_scale: 32.0 2023-12-04 23:05:41,112 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:05:48,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=470733.3333333333, ans=0.125 2023-12-04 23:05:54,532 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.232e+02 1.327e+02 1.422e+02 1.884e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 23:05:55,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=470800.0, ans=0.0 2023-12-04 23:05:59,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=470800.0, ans=0.125 2023-12-04 23:06:17,392 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=470933.3333333333, ans=0.2 2023-12-04 23:06:34,415 INFO [train.py:1087] (1/4) Epoch 79, batch 850, loss[loss=0.1565, simple_loss=0.2514, pruned_loss=0.03085, over 23454.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2414, pruned_loss=0.02695, over 4734836.61 frames. ], batch size: 94, lr: 3.10e-03, grad_scale: 16.0 2023-12-04 23:06:38,876 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471066.6666666667, ans=0.1 2023-12-04 23:06:38,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=471066.6666666667, ans=0.2 2023-12-04 23:06:46,432 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471133.3333333333, ans=0.1 2023-12-04 23:06:53,890 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=471133.3333333333, ans=0.125 2023-12-04 23:06:54,431 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-12-04 23:07:00,460 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=471200.0, ans=0.125 2023-12-04 23:07:12,249 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=471266.6666666667, ans=0.0 2023-12-04 23:07:17,059 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.23 vs. limit=15.0 2023-12-04 23:07:30,745 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=471366.6666666667, ans=0.125 2023-12-04 23:07:38,922 INFO [train.py:1087] (1/4) Epoch 80, batch 0, loss[loss=0.134, simple_loss=0.2292, pruned_loss=0.01938, over 24706.00 frames. ], tot_loss[loss=0.134, simple_loss=0.2292, pruned_loss=0.01938, over 24706.00 frames. ], batch size: 69, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:07:38,923 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 23:07:52,528 INFO [train.py:1119] (1/4) Epoch 80, validation: loss=0.151, simple_loss=0.2466, pruned_loss=0.02767, over 944034.00 frames. 2023-12-04 23:07:52,529 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 23:07:56,198 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=471366.6666666667, ans=0.2 2023-12-04 23:07:56,577 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.97 vs. limit=12.0 2023-12-04 23:08:00,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=471366.6666666667, ans=0.125 2023-12-04 23:08:08,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=471433.3333333333, ans=0.125 2023-12-04 23:08:15,514 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.302e+02 1.425e+02 1.549e+02 2.145e+02, threshold=2.850e+02, percent-clipped=0.0 2023-12-04 23:08:44,927 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=471633.3333333333, ans=0.125 2023-12-04 23:08:51,154 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=471633.3333333333, ans=0.0 2023-12-04 23:08:53,209 INFO [train.py:1087] (1/4) Epoch 80, batch 50, loss[loss=0.1458, simple_loss=0.2436, pruned_loss=0.02393, over 24770.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2416, pruned_loss=0.02715, over 1084117.31 frames. ], batch size: 64, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:08:53,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=471700.0, ans=0.125 2023-12-04 23:09:10,832 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=471766.6666666667, ans=0.0 2023-12-04 23:09:15,427 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=471766.6666666667, ans=0.125 2023-12-04 23:09:16,472 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=471833.3333333333, ans=0.125 2023-12-04 23:09:17,524 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=471833.3333333333, ans=0.0 2023-12-04 23:09:46,336 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-12-04 23:09:47,508 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=471966.6666666667, ans=0.125 2023-12-04 23:09:53,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=472033.3333333333, ans=0.035 2023-12-04 23:09:54,314 INFO [train.py:1087] (1/4) Epoch 80, batch 100, loss[loss=0.1597, simple_loss=0.2513, pruned_loss=0.03402, over 23448.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2414, pruned_loss=0.02666, over 1920647.20 frames. ], batch size: 94, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:10:06,257 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=472100.0, ans=0.125 2023-12-04 23:10:12,603 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=472100.0, ans=0.0 2023-12-04 23:10:12,611 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=472100.0, ans=0.125 2023-12-04 23:10:18,042 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.286e+02 1.368e+02 1.525e+02 1.879e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-04 23:10:30,296 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=472233.3333333333, ans=0.125 2023-12-04 23:10:39,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=472233.3333333333, ans=0.125 2023-12-04 23:10:47,839 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.36 vs. limit=15.0 2023-12-04 23:10:56,391 INFO [train.py:1087] (1/4) Epoch 80, batch 150, loss[loss=0.1446, simple_loss=0.2392, pruned_loss=0.025, over 24779.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2414, pruned_loss=0.02692, over 2567400.73 frames. ], batch size: 71, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:10:56,579 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472366.6666666667, ans=0.1 2023-12-04 23:11:22,087 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=472500.0, ans=0.2 2023-12-04 23:11:24,885 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=472500.0, ans=0.2 2023-12-04 23:11:26,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=472500.0, ans=0.125 2023-12-04 23:11:34,058 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=472566.6666666667, ans=0.125 2023-12-04 23:11:54,676 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.76 vs. limit=15.0 2023-12-04 23:11:57,563 INFO [train.py:1087] (1/4) Epoch 80, batch 200, loss[loss=0.1505, simple_loss=0.2406, pruned_loss=0.03014, over 24582.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02707, over 3071030.47 frames. ], batch size: 64, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:12:05,626 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.29 vs. limit=15.0 2023-12-04 23:12:21,644 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.292e+02 1.417e+02 1.507e+02 2.188e+02, threshold=2.834e+02, percent-clipped=0.0 2023-12-04 23:12:22,136 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=472833.3333333333, ans=0.0 2023-12-04 23:12:23,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=472833.3333333333, ans=0.125 2023-12-04 23:12:30,091 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=472833.3333333333, ans=0.1 2023-12-04 23:12:30,120 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=472833.3333333333, ans=0.0 2023-12-04 23:12:43,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=472900.0, ans=0.125 2023-12-04 23:12:59,026 INFO [train.py:1087] (1/4) Epoch 80, batch 250, loss[loss=0.153, simple_loss=0.2425, pruned_loss=0.03176, over 24719.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02687, over 3453587.71 frames. ], batch size: 74, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:13:01,702 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=473033.3333333333, ans=0.1 2023-12-04 23:13:12,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473100.0, ans=0.1 2023-12-04 23:13:16,445 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=473100.0, ans=0.2 2023-12-04 23:13:59,756 INFO [train.py:1087] (1/4) Epoch 80, batch 300, loss[loss=0.1512, simple_loss=0.2436, pruned_loss=0.02934, over 24291.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2406, pruned_loss=0.02675, over 3756642.86 frames. ], batch size: 79, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:14:09,460 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.43 vs. limit=15.0 2023-12-04 23:14:22,725 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.053e+02 1.246e+02 1.309e+02 1.433e+02 1.869e+02, threshold=2.619e+02, percent-clipped=0.0 2023-12-04 23:14:57,356 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=8.43 vs. limit=12.0 2023-12-04 23:15:01,689 INFO [train.py:1087] (1/4) Epoch 80, batch 350, loss[loss=0.1398, simple_loss=0.2359, pruned_loss=0.02182, over 24776.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.241, pruned_loss=0.02711, over 3979818.13 frames. ], batch size: 73, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:15:15,506 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.41 vs. limit=22.5 2023-12-04 23:15:23,039 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=473766.6666666667, ans=0.2 2023-12-04 23:15:24,422 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=473766.6666666667, ans=0.0 2023-12-04 23:15:35,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473833.3333333333, ans=0.1 2023-12-04 23:15:39,182 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-12-04 23:15:49,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=473900.0, ans=0.2 2023-12-04 23:15:54,738 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-12-04 23:16:01,378 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=473966.6666666667, ans=0.0 2023-12-04 23:16:03,402 INFO [train.py:1087] (1/4) Epoch 80, batch 400, loss[loss=0.1413, simple_loss=0.2344, pruned_loss=0.02414, over 24711.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02703, over 4172884.84 frames. ], batch size: 67, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:16:14,931 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474100.0, ans=0.1 2023-12-04 23:16:26,875 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.069e+02 1.273e+02 1.337e+02 1.434e+02 1.847e+02, threshold=2.674e+02, percent-clipped=0.0 2023-12-04 23:16:27,158 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=474166.6666666667, ans=0.0 2023-12-04 23:16:27,188 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=474166.6666666667, ans=0.07 2023-12-04 23:16:31,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474166.6666666667, ans=0.1 2023-12-04 23:16:37,729 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=474166.6666666667, ans=0.5 2023-12-04 23:16:43,738 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=474233.3333333333, ans=0.025 2023-12-04 23:16:51,214 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=474300.0, ans=0.0 2023-12-04 23:16:59,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=474300.0, ans=10.0 2023-12-04 23:17:04,781 INFO [train.py:1087] (1/4) Epoch 80, batch 450, loss[loss=0.1351, simple_loss=0.2319, pruned_loss=0.01914, over 24737.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2409, pruned_loss=0.02695, over 4318558.82 frames. ], batch size: 63, lr: 3.07e-03, grad_scale: 32.0 2023-12-04 23:17:15,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=474433.3333333333, ans=0.0 2023-12-04 23:17:21,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=474433.3333333333, ans=0.0 2023-12-04 23:17:44,912 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=474566.6666666667, ans=0.1 2023-12-04 23:17:48,389 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=474566.6666666667, ans=0.0 2023-12-04 23:18:06,323 INFO [train.py:1087] (1/4) Epoch 80, batch 500, loss[loss=0.1508, simple_loss=0.2402, pruned_loss=0.03074, over 24492.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2408, pruned_loss=0.02713, over 4406076.30 frames. ], batch size: 75, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:18:10,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474700.0, ans=0.1 2023-12-04 23:18:14,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=474700.0, ans=0.125 2023-12-04 23:18:19,488 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=474766.6666666667, ans=0.2 2023-12-04 23:18:28,516 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.264e+02 1.348e+02 1.436e+02 1.913e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-04 23:18:29,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=474833.3333333333, ans=0.0 2023-12-04 23:18:31,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=474833.3333333333, ans=0.1 2023-12-04 23:18:45,828 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=15.0 2023-12-04 23:18:51,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=474900.0, ans=0.125 2023-12-04 23:18:51,394 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=474900.0, ans=0.125 2023-12-04 23:18:58,659 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=15.0 2023-12-04 23:19:03,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-12-04 23:19:06,462 INFO [train.py:1087] (1/4) Epoch 80, batch 550, loss[loss=0.134, simple_loss=0.2277, pruned_loss=0.02014, over 24801.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2409, pruned_loss=0.02706, over 4480583.76 frames. ], batch size: 73, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:19:13,215 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=475033.3333333333, ans=0.5 2023-12-04 23:19:25,618 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=475100.0, ans=0.0 2023-12-04 23:19:34,220 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=475166.6666666667, ans=0.125 2023-12-04 23:19:35,705 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=15.0 2023-12-04 23:19:39,914 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=475166.6666666667, ans=0.125 2023-12-04 23:19:40,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=475166.6666666667, ans=0.125 2023-12-04 23:19:55,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=475300.0, ans=0.125 2023-12-04 23:19:57,204 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.69 vs. limit=22.5 2023-12-04 23:20:05,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=475300.0, ans=0.125 2023-12-04 23:20:06,592 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-12-04 23:20:08,003 INFO [train.py:1087] (1/4) Epoch 80, batch 600, loss[loss=0.1437, simple_loss=0.2407, pruned_loss=0.02336, over 23729.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2406, pruned_loss=0.02695, over 4561713.46 frames. ], batch size: 57, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:20:31,890 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.148e+02 1.302e+02 1.395e+02 1.504e+02 2.408e+02, threshold=2.791e+02, percent-clipped=0.0 2023-12-04 23:20:34,481 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=475500.0, ans=0.0 2023-12-04 23:20:48,469 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=475566.6666666667, ans=0.125 2023-12-04 23:20:56,208 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=475633.3333333333, ans=0.1 2023-12-04 23:21:03,910 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=475633.3333333333, ans=0.5 2023-12-04 23:21:10,058 INFO [train.py:1087] (1/4) Epoch 80, batch 650, loss[loss=0.1541, simple_loss=0.2504, pruned_loss=0.02886, over 24481.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2406, pruned_loss=0.02694, over 4610696.09 frames. ], batch size: 75, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:21:18,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=475700.0, ans=0.125 2023-12-04 23:21:46,767 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=475900.0, ans=10.0 2023-12-04 23:22:00,685 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475966.6666666667, ans=0.1 2023-12-04 23:22:05,513 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.61 vs. limit=12.0 2023-12-04 23:22:11,478 INFO [train.py:1087] (1/4) Epoch 80, batch 700, loss[loss=0.142, simple_loss=0.2394, pruned_loss=0.0223, over 24789.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2408, pruned_loss=0.02682, over 4662826.50 frames. ], batch size: 72, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:22:11,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=476033.3333333333, ans=0.125 2023-12-04 23:22:22,947 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.22 vs. limit=15.0 2023-12-04 23:22:33,975 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.276e+02 1.339e+02 1.452e+02 1.981e+02, threshold=2.679e+02, percent-clipped=0.0 2023-12-04 23:22:45,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=476166.6666666667, ans=0.125 2023-12-04 23:22:51,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=476233.3333333333, ans=0.0 2023-12-04 23:23:00,294 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=476300.0, ans=0.125 2023-12-04 23:23:11,819 INFO [train.py:1087] (1/4) Epoch 80, batch 750, loss[loss=0.144, simple_loss=0.2408, pruned_loss=0.02365, over 24567.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2407, pruned_loss=0.02675, over 4695004.56 frames. ], batch size: 65, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:23:17,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=476366.6666666667, ans=0.1 2023-12-04 23:23:22,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=476366.6666666667, ans=0.125 2023-12-04 23:23:35,933 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=476500.0, ans=0.5 2023-12-04 23:23:52,389 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-12-04 23:23:52,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=476566.6666666667, ans=0.0 2023-12-04 23:24:03,230 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=476633.3333333333, ans=10.0 2023-12-04 23:24:12,650 INFO [train.py:1087] (1/4) Epoch 80, batch 800, loss[loss=0.1567, simple_loss=0.2502, pruned_loss=0.03159, over 21530.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02672, over 4720152.55 frames. ], batch size: 127, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:24:14,127 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=476700.0, ans=0.125 2023-12-04 23:24:20,348 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=476700.0, ans=0.05 2023-12-04 23:24:35,158 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.046e+02 1.250e+02 1.373e+02 1.517e+02 1.879e+02, threshold=2.745e+02, percent-clipped=0.0 2023-12-04 23:24:35,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=476833.3333333333, ans=0.1 2023-12-04 23:24:36,578 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=476833.3333333333, ans=0.125 2023-12-04 23:24:37,670 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=476833.3333333333, ans=0.125 2023-12-04 23:25:09,015 INFO [train.py:1087] (1/4) Epoch 80, batch 850, loss[loss=0.157, simple_loss=0.2468, pruned_loss=0.03364, over 24024.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02677, over 4744597.60 frames. ], batch size: 87, lr: 3.06e-03, grad_scale: 32.0 2023-12-04 23:25:27,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=477100.0, ans=0.0 2023-12-04 23:25:32,379 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=477166.6666666667, ans=0.125 2023-12-04 23:25:33,023 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=22.5 2023-12-04 23:25:33,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477166.6666666667, ans=0.125 2023-12-04 23:25:37,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=477166.6666666667, ans=0.125 2023-12-04 23:26:14,830 INFO [train.py:1087] (1/4) Epoch 81, batch 0, loss[loss=0.1353, simple_loss=0.2323, pruned_loss=0.01912, over 24809.00 frames. ], tot_loss[loss=0.1353, simple_loss=0.2323, pruned_loss=0.01912, over 24809.00 frames. ], batch size: 71, lr: 3.04e-03, grad_scale: 32.0 2023-12-04 23:26:14,832 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 23:26:28,304 INFO [train.py:1119] (1/4) Epoch 81, validation: loss=0.151, simple_loss=0.2464, pruned_loss=0.02775, over 944034.00 frames. 2023-12-04 23:26:28,305 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 23:26:47,962 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=477400.0, ans=0.05 2023-12-04 23:26:56,771 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.252e+02 1.352e+02 1.455e+02 2.085e+02, threshold=2.704e+02, percent-clipped=0.0 2023-12-04 23:27:08,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=477533.3333333333, ans=0.1 2023-12-04 23:27:25,348 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.29 vs. limit=10.0 2023-12-04 23:27:28,247 INFO [train.py:1087] (1/4) Epoch 81, batch 50, loss[loss=0.1411, simple_loss=0.2333, pruned_loss=0.02445, over 24750.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.241, pruned_loss=0.02687, over 1099732.67 frames. ], batch size: 63, lr: 3.04e-03, grad_scale: 32.0 2023-12-04 23:28:06,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=477866.6666666667, ans=0.125 2023-12-04 23:28:28,274 INFO [train.py:1087] (1/4) Epoch 81, batch 100, loss[loss=0.1441, simple_loss=0.241, pruned_loss=0.02358, over 22864.00 frames. ], tot_loss[loss=0.1481, simple_loss=0.2416, pruned_loss=0.0273, over 1909619.07 frames. ], batch size: 106, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:28:47,508 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478066.6666666667, ans=0.1 2023-12-04 23:28:49,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=478066.6666666667, ans=0.0 2023-12-04 23:28:56,570 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.160e+02 1.303e+02 1.380e+02 1.472e+02 1.883e+02, threshold=2.760e+02, percent-clipped=0.0 2023-12-04 23:29:08,250 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=478200.0, ans=10.0 2023-12-04 23:29:28,119 INFO [train.py:1087] (1/4) Epoch 81, batch 150, loss[loss=0.1441, simple_loss=0.2393, pruned_loss=0.02442, over 24766.00 frames. ], tot_loss[loss=0.1478, simple_loss=0.2413, pruned_loss=0.02716, over 2557323.51 frames. ], batch size: 70, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:29:28,236 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=478333.3333333333, ans=0.015 2023-12-04 23:29:32,152 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-12-04 23:29:48,671 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=478400.0, ans=0.125 2023-12-04 23:29:57,272 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=5.80 vs. limit=15.0 2023-12-04 23:30:06,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=478533.3333333333, ans=0.125 2023-12-04 23:30:23,821 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=478600.0, ans=0.0 2023-12-04 23:30:28,451 INFO [train.py:1087] (1/4) Epoch 81, batch 200, loss[loss=0.1446, simple_loss=0.2396, pruned_loss=0.02487, over 24549.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2412, pruned_loss=0.02673, over 3041859.17 frames. ], batch size: 62, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:30:37,262 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=478666.6666666667, ans=0.05 2023-12-04 23:30:45,292 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2023-12-04 23:30:52,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=478800.0, ans=0.125 2023-12-04 23:30:57,198 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.124e+02 1.273e+02 1.365e+02 1.482e+02 1.977e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-04 23:31:10,973 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2023-12-04 23:31:17,887 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:31:23,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=478933.3333333333, ans=0.125 2023-12-04 23:31:28,941 INFO [train.py:1087] (1/4) Epoch 81, batch 250, loss[loss=0.1536, simple_loss=0.2477, pruned_loss=0.02973, over 24461.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2411, pruned_loss=0.02664, over 3449476.97 frames. ], batch size: 77, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:31:48,185 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479066.6666666667, ans=0.1 2023-12-04 23:31:49,479 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=479066.6666666667, ans=0.0 2023-12-04 23:31:50,691 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:31:50,739 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=479066.6666666667, ans=0.125 2023-12-04 23:32:01,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479133.3333333333, ans=0.1 2023-12-04 23:32:09,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=479200.0, ans=0.1 2023-12-04 23:32:12,167 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=479200.0, ans=0.0 2023-12-04 23:32:29,536 INFO [train.py:1087] (1/4) Epoch 81, batch 300, loss[loss=0.1441, simple_loss=0.236, pruned_loss=0.02604, over 24489.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2408, pruned_loss=0.02667, over 3748562.79 frames. ], batch size: 75, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:32:29,725 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479333.3333333333, ans=0.0 2023-12-04 23:32:41,687 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=479400.0, ans=0.125 2023-12-04 23:32:57,845 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.123e+02 1.265e+02 1.364e+02 1.472e+02 1.855e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-04 23:33:29,019 INFO [train.py:1087] (1/4) Epoch 81, batch 350, loss[loss=0.1417, simple_loss=0.2388, pruned_loss=0.0223, over 24771.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.02675, over 3977562.04 frames. ], batch size: 70, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:33:37,266 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=479666.6666666667, ans=0.125 2023-12-04 23:33:37,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=479666.6666666667, ans=0.125 2023-12-04 23:33:44,754 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=479733.3333333333, ans=0.0 2023-12-04 23:34:04,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=479866.6666666667, ans=0.1 2023-12-04 23:34:06,154 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.92 vs. limit=10.0 2023-12-04 23:34:08,318 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.94 vs. limit=15.0 2023-12-04 23:34:18,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479933.3333333333, ans=0.1 2023-12-04 23:34:32,786 INFO [train.py:1087] (1/4) Epoch 81, batch 400, loss[loss=0.1444, simple_loss=0.2374, pruned_loss=0.02568, over 24761.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2406, pruned_loss=0.02679, over 4156683.91 frames. ], batch size: 65, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:35:02,674 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.269e+02 1.346e+02 1.459e+02 1.874e+02, threshold=2.692e+02, percent-clipped=0.0 2023-12-04 23:35:10,321 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.89 vs. limit=10.0 2023-12-04 23:35:27,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=480266.6666666667, ans=0.1 2023-12-04 23:35:29,242 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.64 vs. limit=12.0 2023-12-04 23:35:32,440 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=480266.6666666667, ans=0.2 2023-12-04 23:35:34,512 INFO [train.py:1087] (1/4) Epoch 81, batch 450, loss[loss=0.1421, simple_loss=0.2386, pruned_loss=0.02277, over 24732.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2407, pruned_loss=0.02692, over 4286774.45 frames. ], batch size: 69, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:35:37,446 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.50 vs. limit=15.0 2023-12-04 23:35:45,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=480400.0, ans=0.125 2023-12-04 23:35:46,747 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=480400.0, ans=0.2 2023-12-04 23:35:50,193 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=480400.0, ans=0.125 2023-12-04 23:36:06,911 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=480466.6666666667, ans=0.2 2023-12-04 23:36:28,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=480600.0, ans=0.125 2023-12-04 23:36:34,927 INFO [train.py:1087] (1/4) Epoch 81, batch 500, loss[loss=0.1536, simple_loss=0.2453, pruned_loss=0.0309, over 24548.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02691, over 4414673.94 frames. ], batch size: 62, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:36:47,453 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.30 vs. limit=15.0 2023-12-04 23:36:51,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=480733.3333333333, ans=0.125 2023-12-04 23:37:03,580 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.241e+02 1.323e+02 1.432e+02 2.037e+02, threshold=2.646e+02, percent-clipped=0.0 2023-12-04 23:37:20,515 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:37:32,639 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-12-04 23:37:35,457 INFO [train.py:1087] (1/4) Epoch 81, batch 550, loss[loss=0.143, simple_loss=0.238, pruned_loss=0.02397, over 24714.00 frames. ], tot_loss[loss=0.1476, simple_loss=0.2411, pruned_loss=0.02706, over 4476058.51 frames. ], batch size: 67, lr: 3.03e-03, grad_scale: 32.0 2023-12-04 23:37:48,982 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=481066.6666666667, ans=0.0 2023-12-04 23:37:52,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=481066.6666666667, ans=0.05 2023-12-04 23:37:58,477 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=481066.6666666667, ans=0.125 2023-12-04 23:38:07,701 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=481133.3333333333, ans=0.125 2023-12-04 23:38:31,089 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:38:37,236 INFO [train.py:1087] (1/4) Epoch 81, batch 600, loss[loss=0.1511, simple_loss=0.2441, pruned_loss=0.02908, over 24760.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.02703, over 4545163.43 frames. ], batch size: 70, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:38:41,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=481333.3333333333, ans=0.125 2023-12-04 23:38:54,086 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-12-04 23:39:03,205 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=481466.6666666667, ans=0.125 2023-12-04 23:39:05,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=481466.6666666667, ans=0.0 2023-12-04 23:39:06,410 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.246e+02 1.329e+02 1.442e+02 1.824e+02, threshold=2.657e+02, percent-clipped=0.0 2023-12-04 23:39:32,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=481600.0, ans=0.2 2023-12-04 23:39:38,083 INFO [train.py:1087] (1/4) Epoch 81, batch 650, loss[loss=0.1385, simple_loss=0.2358, pruned_loss=0.02058, over 24705.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02671, over 4603110.15 frames. ], batch size: 74, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:39:38,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481666.6666666667, ans=0.1 2023-12-04 23:39:52,119 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.78 vs. limit=15.0 2023-12-04 23:40:02,491 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=481800.0, ans=0.0 2023-12-04 23:40:10,567 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=481800.0, ans=0.125 2023-12-04 23:40:17,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=481866.6666666667, ans=0.125 2023-12-04 23:40:20,960 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-12-04 23:40:22,889 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=481866.6666666667, ans=0.1 2023-12-04 23:40:33,817 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-12-04 23:40:39,253 INFO [train.py:1087] (1/4) Epoch 81, batch 700, loss[loss=0.1561, simple_loss=0.2463, pruned_loss=0.03295, over 24142.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02671, over 4637762.65 frames. ], batch size: 82, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:40:39,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=482000.0, ans=0.5 2023-12-04 23:40:40,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=482000.0, ans=0.2 2023-12-04 23:40:41,813 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=482000.0, ans=0.125 2023-12-04 23:40:46,825 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.51 vs. limit=15.0 2023-12-04 23:40:54,512 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-12-04 23:41:08,947 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.089e+02 1.264e+02 1.382e+02 1.480e+02 1.897e+02, threshold=2.765e+02, percent-clipped=0.0 2023-12-04 23:41:12,753 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=482133.3333333333, ans=0.125 2023-12-04 23:41:12,779 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=482133.3333333333, ans=0.2 2023-12-04 23:41:35,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=482266.6666666667, ans=0.0 2023-12-04 23:41:38,565 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=482266.6666666667, ans=0.2 2023-12-04 23:41:41,171 INFO [train.py:1087] (1/4) Epoch 81, batch 750, loss[loss=0.1536, simple_loss=0.247, pruned_loss=0.03006, over 24148.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2406, pruned_loss=0.02689, over 4658686.93 frames. ], batch size: 82, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:41:58,660 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=15.0 2023-12-04 23:42:30,868 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=482600.0, ans=0.125 2023-12-04 23:42:31,966 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=482600.0, ans=0.1 2023-12-04 23:42:35,738 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.12 vs. limit=22.5 2023-12-04 23:42:41,426 INFO [train.py:1087] (1/4) Epoch 81, batch 800, loss[loss=0.1457, simple_loss=0.2406, pruned_loss=0.02541, over 21334.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2404, pruned_loss=0.02677, over 4683601.49 frames. ], batch size: 127, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:42:55,048 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=482733.3333333333, ans=0.1 2023-12-04 23:42:56,362 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=482733.3333333333, ans=0.125 2023-12-04 23:43:03,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=482800.0, ans=0.2 2023-12-04 23:43:08,914 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.262e+02 1.349e+02 1.462e+02 1.885e+02, threshold=2.698e+02, percent-clipped=0.0 2023-12-04 23:43:37,393 INFO [train.py:1087] (1/4) Epoch 81, batch 850, loss[loss=0.1559, simple_loss=0.2476, pruned_loss=0.0321, over 24223.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2403, pruned_loss=0.0269, over 4713856.93 frames. ], batch size: 82, lr: 3.02e-03, grad_scale: 32.0 2023-12-04 23:44:01,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=483133.3333333333, ans=0.2 2023-12-04 23:44:06,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=483133.3333333333, ans=0.125 2023-12-04 23:44:08,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=483133.3333333333, ans=0.0 2023-12-04 23:44:10,334 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=483200.0, ans=0.0 2023-12-04 23:44:18,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=483200.0, ans=0.0 2023-12-04 23:44:19,029 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.63 vs. limit=15.0 2023-12-04 23:44:42,184 INFO [train.py:1087] (1/4) Epoch 82, batch 0, loss[loss=0.1442, simple_loss=0.2404, pruned_loss=0.02405, over 24547.00 frames. ], tot_loss[loss=0.1442, simple_loss=0.2404, pruned_loss=0.02405, over 24547.00 frames. ], batch size: 62, lr: 3.00e-03, grad_scale: 32.0 2023-12-04 23:44:42,189 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-04 23:44:55,456 INFO [train.py:1119] (1/4) Epoch 82, validation: loss=0.1511, simple_loss=0.2466, pruned_loss=0.02783, over 944034.00 frames. 2023-12-04 23:44:55,457 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-04 23:45:25,268 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-12-04 23:45:30,354 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.295e+02 1.378e+02 1.508e+02 2.453e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-04 23:45:30,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=483500.0, ans=0.07 2023-12-04 23:45:33,415 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=483500.0, ans=0.125 2023-12-04 23:45:43,917 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=483566.6666666667, ans=0.07 2023-12-04 23:45:51,770 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.24 vs. limit=15.0 2023-12-04 23:45:56,414 INFO [train.py:1087] (1/4) Epoch 82, batch 50, loss[loss=0.1476, simple_loss=0.2431, pruned_loss=0.02604, over 24557.00 frames. ], tot_loss[loss=0.1485, simple_loss=0.242, pruned_loss=0.02755, over 1078012.22 frames. ], batch size: 62, lr: 3.00e-03, grad_scale: 32.0 2023-12-04 23:46:14,484 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=483700.0, ans=0.1 2023-12-04 23:46:22,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=483766.6666666667, ans=0.0 2023-12-04 23:46:24,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=483766.6666666667, ans=0.0 2023-12-04 23:46:31,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=483766.6666666667, ans=0.2 2023-12-04 23:46:57,132 INFO [train.py:1087] (1/4) Epoch 82, batch 100, loss[loss=0.1484, simple_loss=0.2442, pruned_loss=0.02623, over 21574.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.2423, pruned_loss=0.02726, over 1899034.36 frames. ], batch size: 128, lr: 3.00e-03, grad_scale: 16.0 2023-12-04 23:47:03,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=483966.6666666667, ans=15.0 2023-12-04 23:47:05,743 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=483966.6666666667, ans=0.0 2023-12-04 23:47:34,849 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.246e+02 1.323e+02 1.441e+02 1.748e+02, threshold=2.646e+02, percent-clipped=0.0 2023-12-04 23:47:37,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=484166.6666666667, ans=0.2 2023-12-04 23:47:38,482 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=484166.6666666667, ans=0.125 2023-12-04 23:47:50,020 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=484233.3333333333, ans=0.125 2023-12-04 23:47:54,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=484233.3333333333, ans=0.125 2023-12-04 23:47:58,372 INFO [train.py:1087] (1/4) Epoch 82, batch 150, loss[loss=0.1405, simple_loss=0.2355, pruned_loss=0.02277, over 24753.00 frames. ], tot_loss[loss=0.1484, simple_loss=0.242, pruned_loss=0.02737, over 2546293.23 frames. ], batch size: 66, lr: 3.00e-03, grad_scale: 16.0 2023-12-04 23:48:07,670 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484300.0, ans=0.1 2023-12-04 23:48:13,006 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=484366.6666666667, ans=0.2 2023-12-04 23:48:59,553 INFO [train.py:1087] (1/4) Epoch 82, batch 200, loss[loss=0.1507, simple_loss=0.2415, pruned_loss=0.02992, over 24849.00 frames. ], tot_loss[loss=0.1479, simple_loss=0.2414, pruned_loss=0.02723, over 3036650.74 frames. ], batch size: 68, lr: 3.00e-03, grad_scale: 8.0 2023-12-04 23:49:27,184 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2023-12-04 23:49:36,789 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.146e+02 1.311e+02 1.433e+02 1.635e+02 2.108e+02, threshold=2.866e+02, percent-clipped=0.0 2023-12-04 23:50:00,378 INFO [train.py:1087] (1/4) Epoch 82, batch 250, loss[loss=0.1457, simple_loss=0.24, pruned_loss=0.02571, over 24710.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2412, pruned_loss=0.02695, over 3425037.82 frames. ], batch size: 69, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:50:03,109 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=484966.6666666667, ans=0.125 2023-12-04 23:50:34,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485100.0, ans=0.1 2023-12-04 23:50:54,640 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.89 vs. limit=10.0 2023-12-04 23:51:01,015 INFO [train.py:1087] (1/4) Epoch 82, batch 300, loss[loss=0.1466, simple_loss=0.2376, pruned_loss=0.02778, over 24860.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.241, pruned_loss=0.0269, over 3720741.62 frames. ], batch size: 68, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:51:03,575 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=485300.0, ans=0.125 2023-12-04 23:51:04,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=485300.0, ans=0.0 2023-12-04 23:51:28,499 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.16 vs. limit=15.0 2023-12-04 23:51:38,091 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.293e+02 1.389e+02 1.563e+02 1.990e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-04 23:51:53,593 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=485566.6666666667, ans=0.125 2023-12-04 23:52:00,140 INFO [train.py:1087] (1/4) Epoch 82, batch 350, loss[loss=0.1398, simple_loss=0.2372, pruned_loss=0.02125, over 24558.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2411, pruned_loss=0.02691, over 3963080.22 frames. ], batch size: 63, lr: 2.99e-03, grad_scale: 8.0 2023-12-04 23:52:05,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=485633.3333333333, ans=0.125 2023-12-04 23:52:08,621 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=485633.3333333333, ans=0.125 2023-12-04 23:52:10,997 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485633.3333333333, ans=0.1 2023-12-04 23:52:38,848 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-12-04 23:53:01,245 INFO [train.py:1087] (1/4) Epoch 82, batch 400, loss[loss=0.1466, simple_loss=0.2416, pruned_loss=0.02577, over 24761.00 frames. ], tot_loss[loss=0.1482, simple_loss=0.2414, pruned_loss=0.02748, over 4117836.31 frames. ], batch size: 64, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:53:03,895 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=485966.6666666667, ans=0.1 2023-12-04 23:53:12,908 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.44 vs. limit=12.0 2023-12-04 23:53:17,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=486033.3333333333, ans=0.0 2023-12-04 23:53:28,421 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=486100.0, ans=0.0 2023-12-04 23:53:39,801 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.142e+02 1.288e+02 1.362e+02 1.482e+02 1.838e+02, threshold=2.724e+02, percent-clipped=0.0 2023-12-04 23:54:03,644 INFO [train.py:1087] (1/4) Epoch 82, batch 450, loss[loss=0.1706, simple_loss=0.2585, pruned_loss=0.04138, over 17295.00 frames. ], tot_loss[loss=0.1477, simple_loss=0.2409, pruned_loss=0.02726, over 4271638.22 frames. ], batch size: 179, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:54:05,007 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=486300.0, ans=0.0 2023-12-04 23:54:32,819 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486433.3333333333, ans=0.125 2023-12-04 23:54:47,435 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=486500.0, ans=0.0 2023-12-04 23:54:54,225 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=486566.6666666667, ans=0.0 2023-12-04 23:55:03,902 INFO [train.py:1087] (1/4) Epoch 82, batch 500, loss[loss=0.1401, simple_loss=0.2357, pruned_loss=0.02222, over 24555.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2405, pruned_loss=0.027, over 4402623.98 frames. ], batch size: 62, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:55:05,381 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=486633.3333333333, ans=0.125 2023-12-04 23:55:22,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=486700.0, ans=0.0 2023-12-04 23:55:33,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=486766.6666666667, ans=0.0 2023-12-04 23:55:34,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=486766.6666666667, ans=0.125 2023-12-04 23:55:39,723 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:55:41,585 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.266e+02 1.337e+02 1.455e+02 2.001e+02, threshold=2.675e+02, percent-clipped=0.0 2023-12-04 23:56:03,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486966.6666666667, ans=0.125 2023-12-04 23:56:04,013 INFO [train.py:1087] (1/4) Epoch 82, batch 550, loss[loss=0.1451, simple_loss=0.2384, pruned_loss=0.0259, over 24696.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2404, pruned_loss=0.027, over 4483203.38 frames. ], batch size: 69, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:56:04,125 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=486966.6666666667, ans=0.125 2023-12-04 23:56:15,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=487033.3333333333, ans=0.125 2023-12-04 23:56:41,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=487166.6666666667, ans=0.125 2023-12-04 23:56:51,456 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=487233.3333333333, ans=0.07 2023-12-04 23:56:55,489 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=487233.3333333333, ans=12.0 2023-12-04 23:57:04,687 INFO [train.py:1087] (1/4) Epoch 82, batch 600, loss[loss=0.1645, simple_loss=0.2502, pruned_loss=0.03944, over 17671.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.2407, pruned_loss=0.02712, over 4526364.10 frames. ], batch size: 177, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:57:16,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487366.6666666667, ans=0.1 2023-12-04 23:57:16,717 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.33 vs. limit=10.0 2023-12-04 23:57:19,789 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=487366.6666666667, ans=0.2 2023-12-04 23:57:24,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=487366.6666666667, ans=0.0 2023-12-04 23:57:25,340 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=487366.6666666667, ans=0.0 2023-12-04 23:57:31,143 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-04 23:57:38,980 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.83 vs. limit=12.0 2023-12-04 23:57:41,667 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.269e+02 1.328e+02 1.424e+02 1.653e+02, threshold=2.655e+02, percent-clipped=0.0 2023-12-04 23:58:04,395 INFO [train.py:1087] (1/4) Epoch 82, batch 650, loss[loss=0.1485, simple_loss=0.2435, pruned_loss=0.02671, over 21351.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02692, over 4582502.08 frames. ], batch size: 127, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:58:04,737 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=487633.3333333333, ans=0.125 2023-12-04 23:58:26,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=487700.0, ans=0.125 2023-12-04 23:58:34,442 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=487766.6666666667, ans=0.0 2023-12-04 23:58:35,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=487766.6666666667, ans=0.0 2023-12-04 23:58:50,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=487833.3333333333, ans=0.125 2023-12-04 23:58:52,854 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=487900.0, ans=0.125 2023-12-04 23:58:59,126 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=487900.0, ans=0.125 2023-12-04 23:59:05,947 INFO [train.py:1087] (1/4) Epoch 82, batch 700, loss[loss=0.1479, simple_loss=0.2446, pruned_loss=0.02563, over 24750.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2409, pruned_loss=0.02684, over 4649669.49 frames. ], batch size: 66, lr: 2.99e-03, grad_scale: 16.0 2023-12-04 23:59:10,774 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=487966.6666666667, ans=0.0 2023-12-04 23:59:23,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=488033.3333333333, ans=0.0 2023-12-04 23:59:29,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=488100.0, ans=0.125 2023-12-04 23:59:35,574 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.49 vs. limit=15.0 2023-12-04 23:59:40,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=488100.0, ans=0.125 2023-12-04 23:59:41,862 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=488166.6666666667, ans=0.0 2023-12-04 23:59:43,846 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.242e+02 1.310e+02 1.427e+02 1.837e+02, threshold=2.620e+02, percent-clipped=0.0 2023-12-04 23:59:44,040 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=488166.6666666667, ans=0.125 2023-12-04 23:59:48,207 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=15.0 2023-12-04 23:59:48,701 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=488166.6666666667, ans=0.0 2023-12-04 23:59:53,491 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.68 vs. limit=15.0 2023-12-04 23:59:54,448 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488233.3333333333, ans=0.0 2023-12-05 00:00:06,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=488300.0, ans=0.125 2023-12-05 00:00:07,474 INFO [train.py:1087] (1/4) Epoch 82, batch 750, loss[loss=0.1437, simple_loss=0.236, pruned_loss=0.02565, over 24766.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2408, pruned_loss=0.02669, over 4690680.98 frames. ], batch size: 71, lr: 2.98e-03, grad_scale: 16.0 2023-12-05 00:00:49,645 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=488500.0, ans=0.0 2023-12-05 00:00:54,025 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488566.6666666667, ans=0.0 2023-12-05 00:01:00,071 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=488566.6666666667, ans=0.125 2023-12-05 00:01:03,783 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.68 vs. limit=15.0 2023-12-05 00:01:07,137 INFO [train.py:1087] (1/4) Epoch 82, batch 800, loss[loss=0.1415, simple_loss=0.2364, pruned_loss=0.02336, over 24728.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.241, pruned_loss=0.02701, over 4694932.42 frames. ], batch size: 67, lr: 2.98e-03, grad_scale: 32.0 2023-12-05 00:01:30,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=488766.6666666667, ans=0.125 2023-12-05 00:01:34,982 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=488766.6666666667, ans=0.025 2023-12-05 00:01:40,577 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=488833.3333333333, ans=0.125 2023-12-05 00:01:42,593 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.108e+02 1.299e+02 1.380e+02 1.506e+02 1.865e+02, threshold=2.761e+02, percent-clipped=0.0 2023-12-05 00:01:55,754 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=488900.0, ans=0.125 2023-12-05 00:02:02,886 INFO [train.py:1087] (1/4) Epoch 82, batch 850, loss[loss=0.1525, simple_loss=0.2513, pruned_loss=0.02681, over 24696.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02689, over 4724782.20 frames. ], batch size: 69, lr: 2.98e-03, grad_scale: 32.0 2023-12-05 00:02:16,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=489033.3333333333, ans=0.125 2023-12-05 00:02:22,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=489033.3333333333, ans=0.125 2023-12-05 00:02:23,757 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=489100.0, ans=0.125 2023-12-05 00:02:26,837 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=489100.0, ans=0.125 2023-12-05 00:03:07,173 INFO [train.py:1087] (1/4) Epoch 83, batch 0, loss[loss=0.1441, simple_loss=0.2395, pruned_loss=0.02431, over 24564.00 frames. ], tot_loss[loss=0.1441, simple_loss=0.2395, pruned_loss=0.02431, over 24564.00 frames. ], batch size: 63, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:03:07,174 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 00:03:21,101 INFO [train.py:1119] (1/4) Epoch 83, validation: loss=0.1508, simple_loss=0.2463, pruned_loss=0.02768, over 944034.00 frames. 2023-12-05 00:03:21,102 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 00:03:26,064 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=489266.6666666667, ans=0.0 2023-12-05 00:03:39,793 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=489333.3333333333, ans=0.0 2023-12-05 00:03:43,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=489333.3333333333, ans=0.125 2023-12-05 00:04:04,723 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.101e+02 1.317e+02 1.403e+02 1.518e+02 2.335e+02, threshold=2.805e+02, percent-clipped=0.0 2023-12-05 00:04:06,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=489466.6666666667, ans=0.1 2023-12-05 00:04:09,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=489533.3333333333, ans=0.0 2023-12-05 00:04:21,592 INFO [train.py:1087] (1/4) Epoch 83, batch 50, loss[loss=0.1336, simple_loss=0.223, pruned_loss=0.02214, over 24705.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2412, pruned_loss=0.02683, over 1105229.76 frames. ], batch size: 74, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:04:29,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=489600.0, ans=15.0 2023-12-05 00:04:31,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=489600.0, ans=0.125 2023-12-05 00:04:51,957 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=489733.3333333333, ans=0.125 2023-12-05 00:04:54,198 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=489733.3333333333, ans=0.125 2023-12-05 00:05:11,846 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=489866.6666666667, ans=0.125 2023-12-05 00:05:13,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=489866.6666666667, ans=0.0 2023-12-05 00:05:20,975 INFO [train.py:1087] (1/4) Epoch 83, batch 100, loss[loss=0.1536, simple_loss=0.2482, pruned_loss=0.02945, over 24025.00 frames. ], tot_loss[loss=0.1474, simple_loss=0.2408, pruned_loss=0.02695, over 1927679.21 frames. ], batch size: 87, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:05:23,809 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=489933.3333333333, ans=0.05 2023-12-05 00:05:27,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=489933.3333333333, ans=0.125 2023-12-05 00:05:29,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=489933.3333333333, ans=0.0 2023-12-05 00:05:32,277 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-12-05 00:06:04,450 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.049e+02 1.256e+02 1.358e+02 1.501e+02 1.956e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 00:06:06,928 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=490133.3333333333, ans=0.2 2023-12-05 00:06:19,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=490266.6666666667, ans=0.125 2023-12-05 00:06:20,841 INFO [train.py:1087] (1/4) Epoch 83, batch 150, loss[loss=0.1437, simple_loss=0.237, pruned_loss=0.02519, over 24778.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2403, pruned_loss=0.02654, over 2584227.06 frames. ], batch size: 70, lr: 2.96e-03, grad_scale: 32.0 2023-12-05 00:06:45,992 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:06:51,254 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490400.0, ans=0.1 2023-12-05 00:07:21,716 INFO [train.py:1087] (1/4) Epoch 83, batch 200, loss[loss=0.1328, simple_loss=0.2291, pruned_loss=0.01828, over 24775.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2404, pruned_loss=0.0266, over 3092784.98 frames. ], batch size: 66, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:07:27,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=490600.0, ans=0.0 2023-12-05 00:07:41,916 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=490666.6666666667, ans=0.125 2023-12-05 00:07:43,008 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=490666.6666666667, ans=0.05 2023-12-05 00:07:51,850 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=490733.3333333333, ans=0.0 2023-12-05 00:08:07,060 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.244e+02 1.327e+02 1.425e+02 1.700e+02, threshold=2.653e+02, percent-clipped=0.0 2023-12-05 00:08:16,774 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.08 vs. limit=12.0 2023-12-05 00:08:23,345 INFO [train.py:1087] (1/4) Epoch 83, batch 250, loss[loss=0.1541, simple_loss=0.2471, pruned_loss=0.03056, over 24812.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2402, pruned_loss=0.02637, over 3478592.68 frames. ], batch size: 71, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:08:29,399 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=490933.3333333333, ans=0.0 2023-12-05 00:08:54,055 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=491066.6666666667, ans=0.125 2023-12-05 00:08:58,110 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=491066.6666666667, ans=0.125 2023-12-05 00:09:04,920 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=491133.3333333333, ans=0.2 2023-12-05 00:09:23,698 INFO [train.py:1087] (1/4) Epoch 83, batch 300, loss[loss=0.1402, simple_loss=0.2309, pruned_loss=0.02475, over 24612.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2404, pruned_loss=0.02666, over 3767866.22 frames. ], batch size: 68, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:09:32,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491266.6666666667, ans=0.125 2023-12-05 00:09:53,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=491400.0, ans=0.125 2023-12-05 00:10:07,833 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=491466.6666666667, ans=0.125 2023-12-05 00:10:08,635 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.277e+02 1.353e+02 1.463e+02 1.799e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 00:10:17,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=491533.3333333333, ans=0.2 2023-12-05 00:10:20,581 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.44 vs. limit=15.0 2023-12-05 00:10:23,427 INFO [train.py:1087] (1/4) Epoch 83, batch 350, loss[loss=0.1415, simple_loss=0.232, pruned_loss=0.02547, over 24808.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02667, over 4008717.88 frames. ], batch size: 62, lr: 2.96e-03, grad_scale: 16.0 2023-12-05 00:11:08,740 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-12-05 00:11:12,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=491866.6666666667, ans=0.1 2023-12-05 00:11:18,940 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-12-05 00:11:25,199 INFO [train.py:1087] (1/4) Epoch 83, batch 400, loss[loss=0.1474, simple_loss=0.2444, pruned_loss=0.02518, over 24803.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02668, over 4183719.30 frames. ], batch size: 62, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:11:52,457 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=492066.6666666667, ans=0.035 2023-12-05 00:11:52,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492066.6666666667, ans=0.1 2023-12-05 00:12:00,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=492133.3333333333, ans=0.125 2023-12-05 00:12:01,803 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.max_abs, batch_count=492133.3333333333, ans=10.0 2023-12-05 00:12:06,361 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=492133.3333333333, ans=0.2 2023-12-05 00:12:09,423 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.296e+02 1.398e+02 1.505e+02 2.033e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-05 00:12:16,418 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=492200.0, ans=0.0 2023-12-05 00:12:21,598 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=492200.0, ans=0.1 2023-12-05 00:12:22,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=492200.0, ans=0.125 2023-12-05 00:12:25,987 INFO [train.py:1087] (1/4) Epoch 83, batch 450, loss[loss=0.1447, simple_loss=0.234, pruned_loss=0.02772, over 24187.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.02637, over 4323158.69 frames. ], batch size: 82, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:12:31,998 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=492266.6666666667, ans=0.125 2023-12-05 00:12:36,691 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=492333.3333333333, ans=0.0 2023-12-05 00:12:43,892 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492333.3333333333, ans=0.1 2023-12-05 00:13:08,285 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-12-05 00:13:16,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=492533.3333333333, ans=0.035 2023-12-05 00:13:20,638 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.64 vs. limit=15.0 2023-12-05 00:13:24,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=492533.3333333333, ans=0.0 2023-12-05 00:13:27,566 INFO [train.py:1087] (1/4) Epoch 83, batch 500, loss[loss=0.1536, simple_loss=0.247, pruned_loss=0.03006, over 24739.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02638, over 4435374.64 frames. ], batch size: 63, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:13:34,178 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.59 vs. limit=22.5 2023-12-05 00:13:43,136 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492666.6666666667, ans=0.1 2023-12-05 00:13:46,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=492666.6666666667, ans=0.1 2023-12-05 00:13:47,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=492666.6666666667, ans=0.2 2023-12-05 00:13:57,748 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.30 vs. limit=15.0 2023-12-05 00:14:02,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=492800.0, ans=0.2 2023-12-05 00:14:12,438 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.097e+02 1.267e+02 1.360e+02 1.459e+02 2.108e+02, threshold=2.720e+02, percent-clipped=0.0 2023-12-05 00:14:21,487 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-12-05 00:14:27,534 INFO [train.py:1087] (1/4) Epoch 83, batch 550, loss[loss=0.1369, simple_loss=0.2314, pruned_loss=0.02121, over 24491.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02626, over 4516872.89 frames. ], batch size: 75, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:14:27,654 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=492933.3333333333, ans=0.125 2023-12-05 00:14:34,303 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:14:41,135 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-12-05 00:15:16,084 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-12-05 00:15:20,324 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=493200.0, ans=0.0 2023-12-05 00:15:28,172 INFO [train.py:1087] (1/4) Epoch 83, batch 600, loss[loss=0.1401, simple_loss=0.2364, pruned_loss=0.02195, over 24703.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2397, pruned_loss=0.02635, over 4576710.85 frames. ], batch size: 74, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:15:29,827 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493266.6666666667, ans=0.1 2023-12-05 00:15:34,553 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=493266.6666666667, ans=0.125 2023-12-05 00:16:02,104 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=493400.0, ans=0.125 2023-12-05 00:16:15,480 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.085e+02 1.284e+02 1.384e+02 1.504e+02 1.924e+02, threshold=2.768e+02, percent-clipped=0.0 2023-12-05 00:16:24,513 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=493533.3333333333, ans=0.125 2023-12-05 00:16:29,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=493600.0, ans=0.07 2023-12-05 00:16:30,197 INFO [train.py:1087] (1/4) Epoch 83, batch 650, loss[loss=0.1489, simple_loss=0.2399, pruned_loss=0.02893, over 24576.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.0264, over 4630555.95 frames. ], batch size: 64, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:16:32,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=493600.0, ans=0.0 2023-12-05 00:16:54,074 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.88 vs. limit=10.0 2023-12-05 00:17:03,497 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493733.3333333333, ans=0.1 2023-12-05 00:17:31,467 INFO [train.py:1087] (1/4) Epoch 83, batch 700, loss[loss=0.1422, simple_loss=0.2368, pruned_loss=0.02381, over 24762.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2402, pruned_loss=0.02654, over 4664047.18 frames. ], batch size: 65, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:17:33,299 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.34 vs. limit=15.0 2023-12-05 00:17:50,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=494000.0, ans=0.125 2023-12-05 00:18:17,567 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.105e+02 1.272e+02 1.376e+02 1.523e+02 1.930e+02, threshold=2.752e+02, percent-clipped=0.0 2023-12-05 00:18:33,261 INFO [train.py:1087] (1/4) Epoch 83, batch 750, loss[loss=0.1485, simple_loss=0.2388, pruned_loss=0.02914, over 24806.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.02671, over 4687065.60 frames. ], batch size: 62, lr: 2.95e-03, grad_scale: 16.0 2023-12-05 00:18:35,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=494266.6666666667, ans=0.125 2023-12-05 00:19:04,184 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=494400.0, ans=0.0 2023-12-05 00:19:11,717 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=12.0 2023-12-05 00:19:33,756 INFO [train.py:1087] (1/4) Epoch 83, batch 800, loss[loss=0.1759, simple_loss=0.2583, pruned_loss=0.04671, over 17040.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02668, over 4710281.80 frames. ], batch size: 178, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:19:50,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=494666.6666666667, ans=0.02 2023-12-05 00:19:58,778 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=494733.3333333333, ans=0.125 2023-12-05 00:20:01,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-12-05 00:20:05,308 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=494733.3333333333, ans=0.125 2023-12-05 00:20:17,144 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.277e+02 1.353e+02 1.486e+02 1.861e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 00:20:19,773 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=494866.6666666667, ans=0.125 2023-12-05 00:20:20,174 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-12-05 00:20:30,208 INFO [train.py:1087] (1/4) Epoch 83, batch 850, loss[loss=0.1405, simple_loss=0.2314, pruned_loss=0.02483, over 24491.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2406, pruned_loss=0.02693, over 4708370.48 frames. ], batch size: 75, lr: 2.95e-03, grad_scale: 32.0 2023-12-05 00:20:32,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494933.3333333333, ans=0.1 2023-12-05 00:20:50,118 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=495000.0, ans=0.0 2023-12-05 00:21:00,590 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-12-05 00:21:03,199 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=495133.3333333333, ans=0.0 2023-12-05 00:21:04,429 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495133.3333333333, ans=0.1 2023-12-05 00:21:09,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=495133.3333333333, ans=0.125 2023-12-05 00:21:17,131 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495200.0, ans=0.1 2023-12-05 00:21:30,954 INFO [train.py:1087] (1/4) Epoch 84, batch 0, loss[loss=0.1437, simple_loss=0.2378, pruned_loss=0.02481, over 24808.00 frames. ], tot_loss[loss=0.1437, simple_loss=0.2378, pruned_loss=0.02481, over 24808.00 frames. ], batch size: 62, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:21:30,955 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 00:21:44,640 INFO [train.py:1119] (1/4) Epoch 84, validation: loss=0.1501, simple_loss=0.2459, pruned_loss=0.0271, over 944034.00 frames. 2023-12-05 00:21:44,641 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 00:22:00,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=495300.0, ans=0.125 2023-12-05 00:22:00,461 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.38 vs. limit=15.0 2023-12-05 00:22:05,957 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-12-05 00:22:29,455 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=495433.3333333333, ans=0.1 2023-12-05 00:22:36,157 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.099e+02 1.295e+02 1.385e+02 1.530e+02 2.174e+02, threshold=2.770e+02, percent-clipped=0.0 2023-12-05 00:22:45,457 INFO [train.py:1087] (1/4) Epoch 84, batch 50, loss[loss=0.1528, simple_loss=0.2436, pruned_loss=0.03096, over 24865.00 frames. ], tot_loss[loss=0.1473, simple_loss=0.2408, pruned_loss=0.02689, over 1089021.07 frames. ], batch size: 68, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:23:14,906 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.72 vs. limit=22.5 2023-12-05 00:23:29,811 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.96 vs. limit=10.0 2023-12-05 00:23:45,204 INFO [train.py:1087] (1/4) Epoch 84, batch 100, loss[loss=0.1405, simple_loss=0.2356, pruned_loss=0.02271, over 24721.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2402, pruned_loss=0.02647, over 1929331.33 frames. ], batch size: 67, lr: 2.93e-03, grad_scale: 32.0 2023-12-05 00:23:57,528 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.01 vs. limit=15.0 2023-12-05 00:24:19,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=496033.3333333333, ans=0.0 2023-12-05 00:24:33,875 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=496166.6666666667, ans=0.125 2023-12-05 00:24:38,155 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.257e+02 1.321e+02 1.447e+02 1.790e+02, threshold=2.642e+02, percent-clipped=0.0 2023-12-05 00:24:41,336 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496166.6666666667, ans=0.1 2023-12-05 00:24:46,785 INFO [train.py:1087] (1/4) Epoch 84, batch 150, loss[loss=0.1478, simple_loss=0.24, pruned_loss=0.02777, over 24503.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02666, over 2565126.42 frames. ], batch size: 77, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:24:47,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=496233.3333333333, ans=0.125 2023-12-05 00:24:47,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496233.3333333333, ans=0.125 2023-12-05 00:25:03,437 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=496300.0, ans=0.125 2023-12-05 00:25:12,744 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=496366.6666666667, ans=0.125 2023-12-05 00:25:18,202 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-12-05 00:25:25,240 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=496433.3333333333, ans=15.0 2023-12-05 00:25:31,086 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.08 vs. limit=15.0 2023-12-05 00:25:35,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496500.0, ans=0.1 2023-12-05 00:25:44,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=496500.0, ans=0.0 2023-12-05 00:25:48,167 INFO [train.py:1087] (1/4) Epoch 84, batch 200, loss[loss=0.1569, simple_loss=0.2525, pruned_loss=0.03063, over 24807.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2402, pruned_loss=0.02643, over 3068858.11 frames. ], batch size: 62, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:26:11,784 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=496700.0, ans=0.2 2023-12-05 00:26:18,390 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=496700.0, ans=0.125 2023-12-05 00:26:19,551 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496700.0, ans=0.1 2023-12-05 00:26:24,225 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=496766.6666666667, ans=0.125 2023-12-05 00:26:40,415 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.262e+02 1.348e+02 1.464e+02 1.857e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-05 00:26:41,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=496833.3333333333, ans=0.125 2023-12-05 00:26:48,346 INFO [train.py:1087] (1/4) Epoch 84, batch 250, loss[loss=0.1579, simple_loss=0.2458, pruned_loss=0.03495, over 24479.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2404, pruned_loss=0.02668, over 3453637.46 frames. ], batch size: 75, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:27:19,341 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=497033.3333333333, ans=0.05 2023-12-05 00:27:46,372 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=497166.6666666667, ans=0.0 2023-12-05 00:27:48,460 INFO [train.py:1087] (1/4) Epoch 84, batch 300, loss[loss=0.1469, simple_loss=0.24, pruned_loss=0.02687, over 24850.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2403, pruned_loss=0.02663, over 3761407.86 frames. ], batch size: 68, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:28:11,788 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=497366.6666666667, ans=0.0 2023-12-05 00:28:27,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=497433.3333333333, ans=0.125 2023-12-05 00:28:39,222 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-12-05 00:28:40,034 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=497500.0, ans=0.0 2023-12-05 00:28:42,152 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.298e+02 1.372e+02 1.505e+02 1.725e+02, threshold=2.744e+02, percent-clipped=0.0 2023-12-05 00:28:49,142 INFO [train.py:1087] (1/4) Epoch 84, batch 350, loss[loss=0.1431, simple_loss=0.2364, pruned_loss=0.02491, over 24558.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2403, pruned_loss=0.02677, over 4006731.65 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 16.0 2023-12-05 00:28:51,666 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=497566.6666666667, ans=0.125 2023-12-05 00:29:14,223 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=497700.0, ans=0.125 2023-12-05 00:29:16,808 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=497700.0, ans=0.0 2023-12-05 00:29:17,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=497700.0, ans=0.0 2023-12-05 00:29:27,858 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497766.6666666667, ans=0.1 2023-12-05 00:29:36,421 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=497766.6666666667, ans=0.125 2023-12-05 00:29:50,404 INFO [train.py:1087] (1/4) Epoch 84, batch 400, loss[loss=0.1358, simple_loss=0.2334, pruned_loss=0.01906, over 24752.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2399, pruned_loss=0.02658, over 4191999.83 frames. ], batch size: 65, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:30:19,964 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=498033.3333333333, ans=0.1 2023-12-05 00:30:37,696 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.63 vs. limit=15.0 2023-12-05 00:30:42,376 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:30:43,586 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.306e+02 1.394e+02 1.478e+02 1.847e+02, threshold=2.788e+02, percent-clipped=0.0 2023-12-05 00:30:44,414 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=498166.6666666667, ans=0.0 2023-12-05 00:30:50,926 INFO [train.py:1087] (1/4) Epoch 84, batch 450, loss[loss=0.1628, simple_loss=0.249, pruned_loss=0.03825, over 16628.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02637, over 4331602.06 frames. ], batch size: 177, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:31:17,559 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=498366.6666666667, ans=0.1 2023-12-05 00:31:33,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=498433.3333333333, ans=0.1 2023-12-05 00:31:51,040 INFO [train.py:1087] (1/4) Epoch 84, batch 500, loss[loss=0.1364, simple_loss=0.2275, pruned_loss=0.02262, over 24750.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2399, pruned_loss=0.0263, over 4446194.30 frames. ], batch size: 61, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:32:06,024 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.34 vs. limit=10.0 2023-12-05 00:32:25,142 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=498700.0, ans=0.125 2023-12-05 00:32:41,978 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.07 vs. limit=15.0 2023-12-05 00:32:45,961 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.250e+02 1.315e+02 1.443e+02 1.623e+02, threshold=2.631e+02, percent-clipped=0.0 2023-12-05 00:32:46,297 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=498833.3333333333, ans=0.125 2023-12-05 00:32:49,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=498833.3333333333, ans=0.0 2023-12-05 00:32:52,134 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=498900.0, ans=0.125 2023-12-05 00:32:53,035 INFO [train.py:1087] (1/4) Epoch 84, batch 550, loss[loss=0.1503, simple_loss=0.244, pruned_loss=0.02827, over 24743.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2399, pruned_loss=0.02616, over 4524526.64 frames. ], batch size: 63, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:32:57,098 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=498900.0, ans=0.125 2023-12-05 00:33:16,143 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-12-05 00:33:25,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=499033.3333333333, ans=22.5 2023-12-05 00:33:53,649 INFO [train.py:1087] (1/4) Epoch 84, batch 600, loss[loss=0.1514, simple_loss=0.2447, pruned_loss=0.02904, over 24708.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2402, pruned_loss=0.02639, over 4577688.97 frames. ], batch size: 67, lr: 2.92e-03, grad_scale: 32.0 2023-12-05 00:34:01,188 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.59 vs. limit=15.0 2023-12-05 00:34:47,829 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.265e+02 1.379e+02 1.489e+02 1.749e+02, threshold=2.757e+02, percent-clipped=0.0 2023-12-05 00:34:55,401 INFO [train.py:1087] (1/4) Epoch 84, batch 650, loss[loss=0.1375, simple_loss=0.2288, pruned_loss=0.02315, over 24544.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.02644, over 4624940.02 frames. ], batch size: 66, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:34:57,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=499566.6666666667, ans=0.0 2023-12-05 00:35:07,642 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.08 vs. limit=6.0 2023-12-05 00:35:15,406 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=499633.3333333333, ans=0.125 2023-12-05 00:35:24,788 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-12-05 00:35:25,695 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=499700.0, ans=0.0 2023-12-05 00:35:25,832 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=499700.0, ans=0.125 2023-12-05 00:35:26,044 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.58 vs. limit=22.5 2023-12-05 00:35:31,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=499766.6666666667, ans=0.125 2023-12-05 00:35:33,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=499766.6666666667, ans=0.125 2023-12-05 00:35:41,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=499766.6666666667, ans=0.0 2023-12-05 00:35:56,158 INFO [train.py:1087] (1/4) Epoch 84, batch 700, loss[loss=0.1467, simple_loss=0.2385, pruned_loss=0.02746, over 24767.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.24, pruned_loss=0.02633, over 4660112.35 frames. ], batch size: 64, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:35:57,688 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=499900.0, ans=0.125 2023-12-05 00:35:57,810 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=499900.0, ans=0.1 2023-12-05 00:36:06,887 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=499966.6666666667, ans=0.125 2023-12-05 00:36:08,083 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=499966.6666666667, ans=0.0 2023-12-05 00:36:38,932 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=500100.0, ans=0.1 2023-12-05 00:36:49,268 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.278e+02 1.395e+02 1.500e+02 1.937e+02, threshold=2.790e+02, percent-clipped=0.0 2023-12-05 00:36:54,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=500166.6666666667, ans=0.2 2023-12-05 00:36:56,326 INFO [train.py:1087] (1/4) Epoch 84, batch 750, loss[loss=0.1473, simple_loss=0.2416, pruned_loss=0.02647, over 24771.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02624, over 4694595.19 frames. ], batch size: 70, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:36:58,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=500233.3333333333, ans=0.0 2023-12-05 00:37:01,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500233.3333333333, ans=0.1 2023-12-05 00:37:06,771 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=500233.3333333333, ans=0.95 2023-12-05 00:37:12,733 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=500300.0, ans=0.0 2023-12-05 00:37:45,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=500500.0, ans=0.0 2023-12-05 00:37:57,167 INFO [train.py:1087] (1/4) Epoch 84, batch 800, loss[loss=0.1428, simple_loss=0.2349, pruned_loss=0.02532, over 24764.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02619, over 4724195.93 frames. ], batch size: 64, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:38:21,905 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=500700.0, ans=0.09899494936611666 2023-12-05 00:38:21,975 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500700.0, ans=0.125 2023-12-05 00:38:26,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=500700.0, ans=0.95 2023-12-05 00:38:48,043 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.143e+02 1.297e+02 1.393e+02 1.552e+02 2.763e+02, threshold=2.785e+02, percent-clipped=0.0 2023-12-05 00:38:54,462 INFO [train.py:1087] (1/4) Epoch 84, batch 850, loss[loss=0.1465, simple_loss=0.2422, pruned_loss=0.02542, over 24322.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2398, pruned_loss=0.0262, over 4758951.67 frames. ], batch size: 79, lr: 2.91e-03, grad_scale: 32.0 2023-12-05 00:38:57,874 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500900.0, ans=0.1 2023-12-05 00:39:27,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501100.0, ans=0.125 2023-12-05 00:39:31,772 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=501100.0, ans=0.125 2023-12-05 00:39:33,662 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=501100.0, ans=0.125 2023-12-05 00:39:59,459 INFO [train.py:1087] (1/4) Epoch 85, batch 0, loss[loss=0.138, simple_loss=0.2376, pruned_loss=0.01915, over 24680.00 frames. ], tot_loss[loss=0.138, simple_loss=0.2376, pruned_loss=0.01915, over 24680.00 frames. ], batch size: 74, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:39:59,461 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 00:40:12,993 INFO [train.py:1119] (1/4) Epoch 85, validation: loss=0.1507, simple_loss=0.2462, pruned_loss=0.02756, over 944034.00 frames. 2023-12-05 00:40:12,994 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 00:40:14,879 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-12-05 00:40:19,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=501200.0, ans=0.125 2023-12-05 00:40:27,010 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=501266.6666666667, ans=0.2 2023-12-05 00:40:28,115 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=501266.6666666667, ans=0.125 2023-12-05 00:41:00,193 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.27 vs. limit=15.0 2023-12-05 00:41:00,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=501466.6666666667, ans=0.125 2023-12-05 00:41:02,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501466.6666666667, ans=0.125 2023-12-05 00:41:10,012 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=501466.6666666667, ans=0.025 2023-12-05 00:41:12,003 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.093e+02 1.320e+02 1.398e+02 1.650e+02 2.233e+02, threshold=2.795e+02, percent-clipped=0.0 2023-12-05 00:41:13,232 INFO [train.py:1087] (1/4) Epoch 85, batch 50, loss[loss=0.1534, simple_loss=0.2458, pruned_loss=0.03052, over 23572.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2393, pruned_loss=0.0265, over 1084707.30 frames. ], batch size: 94, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:41:13,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=501533.3333333333, ans=0.2 2023-12-05 00:41:22,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=501533.3333333333, ans=0.0 2023-12-05 00:41:26,823 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=501600.0, ans=0.2 2023-12-05 00:41:34,989 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501600.0, ans=0.1 2023-12-05 00:42:11,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=501866.6666666667, ans=0.125 2023-12-05 00:42:12,730 INFO [train.py:1087] (1/4) Epoch 85, batch 100, loss[loss=0.1462, simple_loss=0.2415, pruned_loss=0.02549, over 24760.00 frames. ], tot_loss[loss=0.146, simple_loss=0.24, pruned_loss=0.02607, over 1918696.84 frames. ], batch size: 64, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:42:14,182 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=501866.6666666667, ans=0.1 2023-12-05 00:42:15,298 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=501866.6666666667, ans=0.125 2023-12-05 00:42:17,399 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.87 vs. limit=6.0 2023-12-05 00:42:19,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=501866.6666666667, ans=0.125 2023-12-05 00:42:31,427 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-12-05 00:42:53,041 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=502066.6666666667, ans=0.1 2023-12-05 00:42:57,839 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=502066.6666666667, ans=0.125 2023-12-05 00:43:02,344 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502133.3333333333, ans=0.1 2023-12-05 00:43:09,310 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=502133.3333333333, ans=0.125 2023-12-05 00:43:11,812 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.083e+02 1.240e+02 1.330e+02 1.421e+02 1.906e+02, threshold=2.659e+02, percent-clipped=0.0 2023-12-05 00:43:13,043 INFO [train.py:1087] (1/4) Epoch 85, batch 150, loss[loss=0.1521, simple_loss=0.2455, pruned_loss=0.0294, over 24542.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2404, pruned_loss=0.02635, over 2562504.48 frames. ], batch size: 75, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:43:14,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502200.0, ans=0.1 2023-12-05 00:43:18,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=502200.0, ans=0.125 2023-12-05 00:43:19,253 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=502200.0, ans=0.125 2023-12-05 00:43:20,687 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.33 vs. limit=15.0 2023-12-05 00:43:23,264 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:43:52,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=502400.0, ans=0.125 2023-12-05 00:44:14,073 INFO [train.py:1087] (1/4) Epoch 85, batch 200, loss[loss=0.1614, simple_loss=0.2537, pruned_loss=0.0345, over 24476.00 frames. ], tot_loss[loss=0.1475, simple_loss=0.241, pruned_loss=0.02696, over 3038853.13 frames. ], batch size: 75, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:44:28,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=502600.0, ans=0.125 2023-12-05 00:44:52,318 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=502733.3333333333, ans=0.0 2023-12-05 00:45:06,119 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=502800.0, ans=0.125 2023-12-05 00:45:15,834 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.274e+02 1.339e+02 1.453e+02 1.938e+02, threshold=2.677e+02, percent-clipped=0.0 2023-12-05 00:45:16,981 INFO [train.py:1087] (1/4) Epoch 85, batch 250, loss[loss=0.143, simple_loss=0.2384, pruned_loss=0.02374, over 24562.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2406, pruned_loss=0.02674, over 3424746.28 frames. ], batch size: 63, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:45:27,802 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=502933.3333333333, ans=0.0 2023-12-05 00:45:39,608 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=503000.0, ans=0.125 2023-12-05 00:46:10,018 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.97 vs. limit=6.0 2023-12-05 00:46:18,517 INFO [train.py:1087] (1/4) Epoch 85, batch 300, loss[loss=0.1408, simple_loss=0.2348, pruned_loss=0.02335, over 24691.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2398, pruned_loss=0.02619, over 3749632.55 frames. ], batch size: 74, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:46:24,597 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=503200.0, ans=0.125 2023-12-05 00:46:38,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=503266.6666666667, ans=0.125 2023-12-05 00:46:47,067 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.89 vs. limit=6.0 2023-12-05 00:47:17,382 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.225e+02 1.286e+02 1.410e+02 1.666e+02, threshold=2.573e+02, percent-clipped=0.0 2023-12-05 00:47:18,595 INFO [train.py:1087] (1/4) Epoch 85, batch 350, loss[loss=0.1361, simple_loss=0.231, pruned_loss=0.02064, over 24551.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.02618, over 3973288.83 frames. ], batch size: 66, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:47:20,015 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=503533.3333333333, ans=0.125 2023-12-05 00:47:22,492 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=503533.3333333333, ans=0.0 2023-12-05 00:47:22,499 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503533.3333333333, ans=0.1 2023-12-05 00:47:38,060 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=503600.0, ans=0.125 2023-12-05 00:47:39,655 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-12-05 00:47:43,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=503666.6666666667, ans=0.09899494936611666 2023-12-05 00:47:45,194 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=503666.6666666667, ans=0.0 2023-12-05 00:47:52,093 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=503666.6666666667, ans=0.0 2023-12-05 00:47:52,128 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=503666.6666666667, ans=0.0 2023-12-05 00:48:20,738 INFO [train.py:1087] (1/4) Epoch 85, batch 400, loss[loss=0.1426, simple_loss=0.2407, pruned_loss=0.02228, over 22785.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2394, pruned_loss=0.02594, over 4162074.96 frames. ], batch size: 106, lr: 2.89e-03, grad_scale: 32.0 2023-12-05 00:48:21,045 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=503866.6666666667, ans=0.1 2023-12-05 00:48:42,413 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=503933.3333333333, ans=0.125 2023-12-05 00:48:44,165 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=503933.3333333333, ans=0.0 2023-12-05 00:49:14,921 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=504133.3333333333, ans=22.5 2023-12-05 00:49:21,843 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.086e+02 1.308e+02 1.368e+02 1.432e+02 1.757e+02, threshold=2.737e+02, percent-clipped=0.0 2023-12-05 00:49:22,975 INFO [train.py:1087] (1/4) Epoch 85, batch 450, loss[loss=0.1369, simple_loss=0.2285, pruned_loss=0.02261, over 24803.00 frames. ], tot_loss[loss=0.1453, simple_loss=0.239, pruned_loss=0.02575, over 4306819.51 frames. ], batch size: 73, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:49:46,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=504333.3333333333, ans=0.125 2023-12-05 00:50:00,561 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:50:00,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=504400.0, ans=0.125 2023-12-05 00:50:02,873 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=504400.0, ans=0.125 2023-12-05 00:50:22,219 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.68 vs. limit=15.0 2023-12-05 00:50:25,080 INFO [train.py:1087] (1/4) Epoch 85, batch 500, loss[loss=0.142, simple_loss=0.2327, pruned_loss=0.02569, over 24776.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02615, over 4405674.89 frames. ], batch size: 71, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:50:31,286 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=504533.3333333333, ans=0.125 2023-12-05 00:50:45,139 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.53 vs. limit=15.0 2023-12-05 00:50:58,619 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=504666.6666666667, ans=0.125 2023-12-05 00:51:19,172 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=504800.0, ans=0.125 2023-12-05 00:51:24,570 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.116e+02 1.256e+02 1.341e+02 1.457e+02 1.904e+02, threshold=2.682e+02, percent-clipped=0.0 2023-12-05 00:51:25,731 INFO [train.py:1087] (1/4) Epoch 85, batch 550, loss[loss=0.1457, simple_loss=0.2402, pruned_loss=0.02561, over 24592.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2397, pruned_loss=0.02613, over 4499097.35 frames. ], batch size: 68, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:51:47,495 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:51:49,869 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=505000.0, ans=0.125 2023-12-05 00:52:06,266 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-12-05 00:52:08,627 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.84 vs. limit=6.0 2023-12-05 00:52:09,727 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.17 vs. limit=15.0 2023-12-05 00:52:10,585 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=505066.6666666667, ans=0.0 2023-12-05 00:52:16,650 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.01 vs. limit=12.0 2023-12-05 00:52:24,026 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=505133.3333333333, ans=0.125 2023-12-05 00:52:24,161 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=505133.3333333333, ans=0.125 2023-12-05 00:52:27,391 INFO [train.py:1087] (1/4) Epoch 85, batch 600, loss[loss=0.1438, simple_loss=0.2356, pruned_loss=0.02601, over 24767.00 frames. ], tot_loss[loss=0.1462, simple_loss=0.2399, pruned_loss=0.02622, over 4560361.76 frames. ], batch size: 64, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:52:40,593 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505266.6666666667, ans=0.1 2023-12-05 00:53:02,000 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.22 vs. limit=22.5 2023-12-05 00:53:28,368 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.257e+02 1.345e+02 1.467e+02 2.022e+02, threshold=2.691e+02, percent-clipped=0.0 2023-12-05 00:53:29,583 INFO [train.py:1087] (1/4) Epoch 85, batch 650, loss[loss=0.1351, simple_loss=0.2259, pruned_loss=0.02216, over 24800.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2398, pruned_loss=0.02611, over 4620481.44 frames. ], batch size: 62, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:53:39,156 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-12-05 00:53:51,755 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.15 vs. limit=22.5 2023-12-05 00:53:52,607 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=505600.0, ans=0.125 2023-12-05 00:53:59,055 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-12-05 00:54:00,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=505666.6666666667, ans=0.0 2023-12-05 00:54:04,576 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=505666.6666666667, ans=0.04949747468305833 2023-12-05 00:54:10,464 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=505733.3333333333, ans=0.0 2023-12-05 00:54:31,744 INFO [train.py:1087] (1/4) Epoch 85, batch 700, loss[loss=0.1479, simple_loss=0.2396, pruned_loss=0.02812, over 24865.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02605, over 4665236.39 frames. ], batch size: 68, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:54:34,346 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=505866.6666666667, ans=0.2 2023-12-05 00:54:37,807 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=505866.6666666667, ans=0.125 2023-12-05 00:54:39,425 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.34 vs. limit=15.0 2023-12-05 00:54:40,295 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=505866.6666666667, ans=0.2 2023-12-05 00:54:40,409 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-12-05 00:54:43,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505933.3333333333, ans=0.0 2023-12-05 00:54:46,110 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=505933.3333333333, ans=0.125 2023-12-05 00:54:49,412 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-12-05 00:54:51,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=505933.3333333333, ans=0.125 2023-12-05 00:54:59,338 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=11.72 vs. limit=22.5 2023-12-05 00:55:01,462 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=506000.0, ans=0.125 2023-12-05 00:55:15,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=506066.6666666667, ans=0.2 2023-12-05 00:55:31,834 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.090e+02 1.262e+02 1.357e+02 1.466e+02 1.953e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 00:55:33,017 INFO [train.py:1087] (1/4) Epoch 85, batch 750, loss[loss=0.1491, simple_loss=0.2413, pruned_loss=0.02847, over 24776.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02605, over 4707737.48 frames. ], batch size: 70, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:55:40,931 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-12-05 00:55:56,882 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:56:01,082 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.88 vs. limit=15.0 2023-12-05 00:56:13,668 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=506400.0, ans=0.125 2023-12-05 00:56:16,422 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-12-05 00:56:24,255 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=506466.6666666667, ans=0.0 2023-12-05 00:56:33,330 INFO [train.py:1087] (1/4) Epoch 85, batch 800, loss[loss=0.1419, simple_loss=0.2353, pruned_loss=0.02422, over 24761.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2398, pruned_loss=0.02609, over 4728196.50 frames. ], batch size: 70, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:56:39,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=506533.3333333333, ans=0.125 2023-12-05 00:56:54,059 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506600.0, ans=0.1 2023-12-05 00:57:00,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=506666.6666666667, ans=0.2 2023-12-05 00:57:12,857 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=506733.3333333333, ans=0.125 2023-12-05 00:57:22,707 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=506800.0, ans=0.0 2023-12-05 00:57:32,460 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.125e+02 1.281e+02 1.341e+02 1.448e+02 1.887e+02, threshold=2.683e+02, percent-clipped=0.0 2023-12-05 00:57:33,628 INFO [train.py:1087] (1/4) Epoch 85, batch 850, loss[loss=0.1601, simple_loss=0.2556, pruned_loss=0.03226, over 23810.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2398, pruned_loss=0.02607, over 4757445.21 frames. ], batch size: 95, lr: 2.88e-03, grad_scale: 32.0 2023-12-05 00:57:43,563 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=506933.3333333333, ans=0.125 2023-12-05 00:57:50,826 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=506933.3333333333, ans=0.125 2023-12-05 00:57:53,369 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-12-05 00:58:12,351 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=507066.6666666667, ans=0.125 2023-12-05 00:58:14,904 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=507066.6666666667, ans=0.125 2023-12-05 00:58:17,516 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.92 vs. limit=10.0 2023-12-05 00:58:19,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=507133.3333333333, ans=0.125 2023-12-05 00:58:32,920 INFO [train.py:1087] (1/4) Epoch 86, batch 0, loss[loss=0.1461, simple_loss=0.2401, pruned_loss=0.02608, over 24563.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2401, pruned_loss=0.02608, over 24563.00 frames. ], batch size: 64, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 00:58:32,921 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 00:58:46,425 INFO [train.py:1119] (1/4) Epoch 86, validation: loss=0.151, simple_loss=0.2462, pruned_loss=0.02793, over 944034.00 frames. 2023-12-05 00:58:46,426 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 00:58:58,305 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=507233.3333333333, ans=0.1 2023-12-05 00:59:12,302 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=507300.0, ans=0.125 2023-12-05 00:59:13,468 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=507300.0, ans=0.125 2023-12-05 00:59:13,629 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 00:59:45,530 INFO [train.py:1087] (1/4) Epoch 86, batch 50, loss[loss=0.1786, simple_loss=0.2653, pruned_loss=0.04599, over 17386.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2402, pruned_loss=0.02661, over 1082944.85 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 00:59:50,555 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.112e+02 1.300e+02 1.379e+02 1.470e+02 1.879e+02, threshold=2.758e+02, percent-clipped=0.0 2023-12-05 01:00:07,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=507566.6666666667, ans=0.0 2023-12-05 01:00:16,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507633.3333333333, ans=0.1 2023-12-05 01:00:38,363 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:00:44,849 INFO [train.py:1087] (1/4) Epoch 86, batch 100, loss[loss=0.1341, simple_loss=0.2239, pruned_loss=0.02215, over 24724.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2403, pruned_loss=0.02682, over 1899923.54 frames. ], batch size: 67, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:00:45,594 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=507833.3333333333, ans=0.125 2023-12-05 01:01:06,901 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-12-05 01:01:08,133 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.71 vs. limit=15.0 2023-12-05 01:01:13,570 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=507966.6666666667, ans=0.125 2023-12-05 01:01:22,522 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=508033.3333333333, ans=0.125 2023-12-05 01:01:45,401 INFO [train.py:1087] (1/4) Epoch 86, batch 150, loss[loss=0.1309, simple_loss=0.2261, pruned_loss=0.01778, over 24755.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2397, pruned_loss=0.02618, over 2545182.34 frames. ], batch size: 61, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:01:45,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=508166.6666666667, ans=0.125 2023-12-05 01:01:49,934 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.111e+02 1.280e+02 1.367e+02 1.450e+02 1.965e+02, threshold=2.735e+02, percent-clipped=0.0 2023-12-05 01:01:52,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=508166.6666666667, ans=0.125 2023-12-05 01:02:11,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=508300.0, ans=0.2 2023-12-05 01:02:31,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=508366.6666666667, ans=0.125 2023-12-05 01:02:45,701 INFO [train.py:1087] (1/4) Epoch 86, batch 200, loss[loss=0.147, simple_loss=0.2367, pruned_loss=0.02866, over 24578.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2401, pruned_loss=0.02639, over 3058474.11 frames. ], batch size: 64, lr: 2.86e-03, grad_scale: 32.0 2023-12-05 01:02:55,969 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-12-05 01:03:17,487 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=508633.3333333333, ans=0.0 2023-12-05 01:03:22,092 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=508700.0, ans=0.0 2023-12-05 01:03:35,176 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=508766.6666666667, ans=0.0 2023-12-05 01:03:45,989 INFO [train.py:1087] (1/4) Epoch 86, batch 250, loss[loss=0.153, simple_loss=0.2488, pruned_loss=0.02862, over 24552.00 frames. ], tot_loss[loss=0.1472, simple_loss=0.2406, pruned_loss=0.02693, over 3430557.20 frames. ], batch size: 63, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:03:50,703 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.128e+02 1.297e+02 1.391e+02 1.499e+02 1.858e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 01:04:03,594 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2023-12-05 01:04:16,313 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.32 vs. limit=22.5 2023-12-05 01:04:33,342 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=509100.0, ans=0.125 2023-12-05 01:04:33,760 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.88 vs. limit=10.0 2023-12-05 01:04:37,603 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.46 vs. limit=6.0 2023-12-05 01:04:42,024 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=509100.0, ans=0.1 2023-12-05 01:04:46,716 INFO [train.py:1087] (1/4) Epoch 86, batch 300, loss[loss=0.138, simple_loss=0.23, pruned_loss=0.02305, over 24770.00 frames. ], tot_loss[loss=0.1471, simple_loss=0.2405, pruned_loss=0.02683, over 3720460.55 frames. ], batch size: 72, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:04:50,566 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.15 vs. limit=6.0 2023-12-05 01:04:51,278 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=509166.6666666667, ans=0.1 2023-12-05 01:04:56,487 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.96 vs. limit=15.0 2023-12-05 01:05:06,624 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=509233.3333333333, ans=0.0 2023-12-05 01:05:09,053 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=509233.3333333333, ans=0.07 2023-12-05 01:05:24,095 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=509366.6666666667, ans=0.125 2023-12-05 01:05:47,676 INFO [train.py:1087] (1/4) Epoch 86, batch 350, loss[loss=0.1386, simple_loss=0.2326, pruned_loss=0.02229, over 24860.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.2403, pruned_loss=0.0266, over 3970759.60 frames. ], batch size: 68, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:05:47,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=509500.0, ans=0.04949747468305833 2023-12-05 01:05:51,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=509500.0, ans=0.025 2023-12-05 01:05:52,640 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.308e+02 1.400e+02 1.542e+02 2.076e+02, threshold=2.800e+02, percent-clipped=0.0 2023-12-05 01:06:16,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=509633.3333333333, ans=0.04949747468305833 2023-12-05 01:06:27,675 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.52 vs. limit=22.5 2023-12-05 01:06:29,528 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=509700.0, ans=0.0 2023-12-05 01:06:32,145 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=509700.0, ans=0.0 2023-12-05 01:06:36,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=509766.6666666667, ans=0.0 2023-12-05 01:06:38,929 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:06:49,003 INFO [train.py:1087] (1/4) Epoch 86, batch 400, loss[loss=0.1622, simple_loss=0.2488, pruned_loss=0.03779, over 24490.00 frames. ], tot_loss[loss=0.147, simple_loss=0.2405, pruned_loss=0.02677, over 4152789.85 frames. ], batch size: 77, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:06:56,224 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=509833.3333333333, ans=0.125 2023-12-05 01:06:59,737 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:07:00,752 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=509900.0, ans=0.2 2023-12-05 01:07:17,377 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=509966.6666666667, ans=0.0 2023-12-05 01:07:26,951 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=510033.3333333333, ans=0.0 2023-12-05 01:07:50,441 INFO [train.py:1087] (1/4) Epoch 86, batch 450, loss[loss=0.1501, simple_loss=0.2496, pruned_loss=0.02528, over 24717.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.0266, over 4289360.82 frames. ], batch size: 67, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:07:51,995 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=510166.6666666667, ans=0.0 2023-12-05 01:07:55,004 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.161e+02 1.256e+02 1.336e+02 1.478e+02 2.181e+02, threshold=2.672e+02, percent-clipped=0.0 2023-12-05 01:07:56,982 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-12-05 01:08:26,977 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=510366.6666666667, ans=0.125 2023-12-05 01:08:51,071 INFO [train.py:1087] (1/4) Epoch 86, batch 500, loss[loss=0.151, simple_loss=0.2469, pruned_loss=0.02756, over 24540.00 frames. ], tot_loss[loss=0.1469, simple_loss=0.2405, pruned_loss=0.02662, over 4406495.63 frames. ], batch size: 77, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:08:54,192 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=510500.0, ans=0.125 2023-12-05 01:09:01,684 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.40 vs. limit=22.5 2023-12-05 01:09:16,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=510633.3333333333, ans=0.0 2023-12-05 01:09:20,927 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.82 vs. limit=15.0 2023-12-05 01:09:23,755 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=510633.3333333333, ans=0.0 2023-12-05 01:09:47,941 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=510766.6666666667, ans=0.1 2023-12-05 01:09:50,388 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=510766.6666666667, ans=0.125 2023-12-05 01:09:52,313 INFO [train.py:1087] (1/4) Epoch 86, batch 550, loss[loss=0.1363, simple_loss=0.2287, pruned_loss=0.02198, over 24725.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2403, pruned_loss=0.02645, over 4491486.52 frames. ], batch size: 69, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:09:57,405 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.129e+02 1.286e+02 1.391e+02 1.520e+02 1.971e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 01:10:04,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=510900.0, ans=0.0 2023-12-05 01:10:10,290 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=510900.0, ans=0.125 2023-12-05 01:10:17,642 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=510966.6666666667, ans=0.0 2023-12-05 01:10:23,798 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-12-05 01:10:33,217 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=511033.3333333333, ans=0.125 2023-12-05 01:10:34,303 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=511033.3333333333, ans=0.2 2023-12-05 01:10:53,860 INFO [train.py:1087] (1/4) Epoch 86, batch 600, loss[loss=0.1613, simple_loss=0.2545, pruned_loss=0.03404, over 21694.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2402, pruned_loss=0.02638, over 4565167.46 frames. ], batch size: 128, lr: 2.85e-03, grad_scale: 64.0 2023-12-05 01:10:55,554 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=511166.6666666667, ans=0.125 2023-12-05 01:11:10,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511233.3333333333, ans=0.1 2023-12-05 01:11:16,533 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=511233.3333333333, ans=0.0 2023-12-05 01:11:31,829 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=511366.6666666667, ans=0.035 2023-12-05 01:11:33,159 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=511366.6666666667, ans=0.125 2023-12-05 01:11:56,643 INFO [train.py:1087] (1/4) Epoch 86, batch 650, loss[loss=0.1545, simple_loss=0.247, pruned_loss=0.03106, over 24557.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2401, pruned_loss=0.02647, over 4598631.45 frames. ], batch size: 63, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:12:02,416 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.278e+02 1.366e+02 1.457e+02 1.861e+02, threshold=2.732e+02, percent-clipped=0.0 2023-12-05 01:12:07,356 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=511566.6666666667, ans=0.1 2023-12-05 01:12:16,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=511566.6666666667, ans=0.1 2023-12-05 01:12:31,924 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=511633.3333333333, ans=0.125 2023-12-05 01:12:58,712 INFO [train.py:1087] (1/4) Epoch 86, batch 700, loss[loss=0.1417, simple_loss=0.2369, pruned_loss=0.02328, over 24771.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2398, pruned_loss=0.02624, over 4644622.74 frames. ], batch size: 71, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:13:11,644 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.66 vs. limit=10.0 2023-12-05 01:13:14,790 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=511900.0, ans=0.125 2023-12-05 01:13:43,179 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=512033.3333333333, ans=0.125 2023-12-05 01:14:00,331 INFO [train.py:1087] (1/4) Epoch 86, batch 750, loss[loss=0.1397, simple_loss=0.2306, pruned_loss=0.0244, over 24565.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2395, pruned_loss=0.02598, over 4685125.20 frames. ], batch size: 62, lr: 2.85e-03, grad_scale: 32.0 2023-12-05 01:14:01,799 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=512166.6666666667, ans=0.125 2023-12-05 01:14:07,293 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.140e+02 1.276e+02 1.356e+02 1.468e+02 2.035e+02, threshold=2.712e+02, percent-clipped=0.0 2023-12-05 01:14:13,530 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512233.3333333333, ans=0.1 2023-12-05 01:14:16,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=512233.3333333333, ans=0.2 2023-12-05 01:14:31,641 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=512300.0, ans=0.0 2023-12-05 01:14:54,182 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=15.0 2023-12-05 01:15:01,783 INFO [train.py:1087] (1/4) Epoch 86, batch 800, loss[loss=0.1383, simple_loss=0.2355, pruned_loss=0.02049, over 24712.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2391, pruned_loss=0.02586, over 4714974.87 frames. ], batch size: 74, lr: 2.84e-03, grad_scale: 32.0 2023-12-05 01:15:15,384 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=512566.6666666667, ans=0.125 2023-12-05 01:15:21,306 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=512566.6666666667, ans=0.0 2023-12-05 01:15:25,635 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.47 vs. limit=15.0 2023-12-05 01:15:32,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512633.3333333333, ans=0.1 2023-12-05 01:15:37,816 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=512700.0, ans=12.0 2023-12-05 01:15:41,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=512700.0, ans=0.125 2023-12-05 01:15:42,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=512700.0, ans=0.125 2023-12-05 01:15:59,071 INFO [train.py:1087] (1/4) Epoch 86, batch 850, loss[loss=0.1458, simple_loss=0.2387, pruned_loss=0.02643, over 24799.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2392, pruned_loss=0.02593, over 4734358.49 frames. ], batch size: 73, lr: 2.84e-03, grad_scale: 32.0 2023-12-05 01:16:04,745 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=512833.3333333333, ans=0.2 2023-12-05 01:16:05,599 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.087e+02 1.258e+02 1.364e+02 1.461e+02 1.903e+02, threshold=2.728e+02, percent-clipped=0.0 2023-12-05 01:16:15,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=512900.0, ans=0.125 2023-12-05 01:16:16,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=512900.0, ans=0.125 2023-12-05 01:16:58,329 INFO [train.py:1087] (1/4) Epoch 87, batch 0, loss[loss=0.1489, simple_loss=0.2466, pruned_loss=0.02554, over 24573.00 frames. ], tot_loss[loss=0.1489, simple_loss=0.2466, pruned_loss=0.02554, over 24573.00 frames. ], batch size: 63, lr: 2.83e-03, grad_scale: 32.0 2023-12-05 01:16:58,330 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 01:17:07,843 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.1515, 4.9263, 4.3579, 4.5164], device='cuda:1') 2023-12-05 01:17:11,954 INFO [train.py:1119] (1/4) Epoch 87, validation: loss=0.1509, simple_loss=0.246, pruned_loss=0.02789, over 944034.00 frames. 2023-12-05 01:17:11,955 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 01:17:12,175 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513133.3333333333, ans=0.1 2023-12-05 01:17:24,674 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=513200.0, ans=0.04949747468305833 2023-12-05 01:17:26,173 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-12-05 01:17:27,444 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=5.40 vs. limit=15.0 2023-12-05 01:17:34,748 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=513266.6666666667, ans=0.125 2023-12-05 01:17:50,570 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=513333.3333333333, ans=0.125 2023-12-05 01:18:09,503 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=513400.0, ans=0.2 2023-12-05 01:18:11,618 INFO [train.py:1087] (1/4) Epoch 87, batch 50, loss[loss=0.1511, simple_loss=0.2441, pruned_loss=0.02903, over 24304.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2405, pruned_loss=0.0259, over 1104057.09 frames. ], batch size: 79, lr: 2.83e-03, grad_scale: 32.0 2023-12-05 01:18:17,937 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=513466.6666666667, ans=0.2 2023-12-05 01:18:20,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513466.6666666667, ans=0.1 2023-12-05 01:18:24,582 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.166e+02 1.275e+02 1.354e+02 1.461e+02 2.447e+02, threshold=2.709e+02, percent-clipped=0.0 2023-12-05 01:18:34,601 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=513600.0, ans=0.2 2023-12-05 01:18:45,177 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=513600.0, ans=0.04949747468305833 2023-12-05 01:19:03,552 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=4.93 vs. limit=15.0 2023-12-05 01:19:08,930 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=513733.3333333333, ans=0.04949747468305833 2023-12-05 01:19:11,037 INFO [train.py:1087] (1/4) Epoch 87, batch 100, loss[loss=0.1523, simple_loss=0.248, pruned_loss=0.02831, over 22078.00 frames. ], tot_loss[loss=0.1453, simple_loss=0.2398, pruned_loss=0.02544, over 1920973.14 frames. ], batch size: 53, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:19:14,249 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-12-05 01:19:34,824 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513933.3333333333, ans=0.125 2023-12-05 01:19:41,108 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=513933.3333333333, ans=0.0 2023-12-05 01:19:52,163 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.45 vs. limit=15.0 2023-12-05 01:20:05,947 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.60 vs. limit=15.0 2023-12-05 01:20:09,961 INFO [train.py:1087] (1/4) Epoch 87, batch 150, loss[loss=0.1517, simple_loss=0.2461, pruned_loss=0.02866, over 24766.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2402, pruned_loss=0.02593, over 2559296.87 frames. ], batch size: 64, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:20:11,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=514133.3333333333, ans=0.125 2023-12-05 01:20:24,195 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.100e+02 1.275e+02 1.353e+02 1.490e+02 2.022e+02, threshold=2.707e+02, percent-clipped=0.0 2023-12-05 01:20:24,518 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514200.0, ans=0.1 2023-12-05 01:20:40,791 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=514266.6666666667, ans=0.125 2023-12-05 01:20:43,123 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=514266.6666666667, ans=0.0 2023-12-05 01:20:43,135 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=514266.6666666667, ans=0.125 2023-12-05 01:20:47,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=514333.3333333333, ans=0.125 2023-12-05 01:21:06,639 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.05 vs. limit=22.5 2023-12-05 01:21:07,588 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=514400.0, ans=0.0 2023-12-05 01:21:08,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=514400.0, ans=0.09899494936611666 2023-12-05 01:21:10,735 INFO [train.py:1087] (1/4) Epoch 87, batch 200, loss[loss=0.1394, simple_loss=0.2377, pruned_loss=0.02048, over 24562.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2398, pruned_loss=0.02596, over 3063537.15 frames. ], batch size: 63, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:21:29,654 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=514533.3333333333, ans=0.125 2023-12-05 01:21:30,835 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=514533.3333333333, ans=0.125 2023-12-05 01:21:57,903 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=514733.3333333333, ans=0.125 2023-12-05 01:22:11,183 INFO [train.py:1087] (1/4) Epoch 87, batch 250, loss[loss=0.1484, simple_loss=0.246, pruned_loss=0.02538, over 23532.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.026, over 3452275.09 frames. ], batch size: 94, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:22:14,373 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-12-05 01:22:23,908 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.113e+02 1.316e+02 1.383e+02 1.508e+02 1.759e+02, threshold=2.767e+02, percent-clipped=0.0 2023-12-05 01:22:24,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=514866.6666666667, ans=0.0 2023-12-05 01:22:25,416 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:22:28,880 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=514866.6666666667, ans=0.1 2023-12-05 01:22:31,180 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=514866.6666666667, ans=0.0 2023-12-05 01:22:32,173 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=514866.6666666667, ans=0.035 2023-12-05 01:22:37,672 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=514933.3333333333, ans=0.125 2023-12-05 01:22:42,603 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.93 vs. limit=15.0 2023-12-05 01:22:43,343 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=514933.3333333333, ans=0.0 2023-12-05 01:22:50,610 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.15 vs. limit=15.0 2023-12-05 01:22:52,673 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=515000.0, ans=0.0 2023-12-05 01:23:05,519 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=515066.6666666667, ans=0.0 2023-12-05 01:23:10,272 INFO [train.py:1087] (1/4) Epoch 87, batch 300, loss[loss=0.1406, simple_loss=0.2388, pruned_loss=0.02117, over 24688.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2397, pruned_loss=0.02579, over 3765430.56 frames. ], batch size: 74, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:23:19,461 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=515133.3333333333, ans=0.125 2023-12-05 01:23:21,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515200.0, ans=0.1 2023-12-05 01:23:40,149 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=515266.6666666667, ans=0.025 2023-12-05 01:24:10,031 INFO [train.py:1087] (1/4) Epoch 87, batch 350, loss[loss=0.1367, simple_loss=0.2314, pruned_loss=0.02098, over 24800.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2399, pruned_loss=0.0261, over 4000588.84 frames. ], batch size: 73, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:24:16,936 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=515466.6666666667, ans=0.125 2023-12-05 01:24:17,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=515466.6666666667, ans=0.0 2023-12-05 01:24:23,139 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=515533.3333333333, ans=0.125 2023-12-05 01:24:24,327 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.250e+02 1.331e+02 1.413e+02 1.864e+02, threshold=2.662e+02, percent-clipped=0.0 2023-12-05 01:24:30,444 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=515533.3333333333, ans=0.0 2023-12-05 01:24:32,020 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=22.5 2023-12-05 01:24:32,727 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=515533.3333333333, ans=0.125 2023-12-05 01:25:10,265 INFO [train.py:1087] (1/4) Epoch 87, batch 400, loss[loss=0.1466, simple_loss=0.2414, pruned_loss=0.02591, over 24573.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02625, over 4166891.58 frames. ], batch size: 64, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:25:10,590 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=515800.0, ans=0.0 2023-12-05 01:25:23,628 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515866.6666666667, ans=0.1 2023-12-05 01:25:37,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=515933.3333333333, ans=0.125 2023-12-05 01:26:02,878 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=516066.6666666667, ans=0.0 2023-12-05 01:26:11,014 INFO [train.py:1087] (1/4) Epoch 87, batch 450, loss[loss=0.1724, simple_loss=0.2548, pruned_loss=0.04498, over 17054.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2393, pruned_loss=0.02615, over 4299623.19 frames. ], batch size: 177, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:26:15,999 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=516133.3333333333, ans=0.0 2023-12-05 01:26:21,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=516200.0, ans=0.125 2023-12-05 01:26:23,848 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.247e+02 1.316e+02 1.414e+02 1.848e+02, threshold=2.632e+02, percent-clipped=0.0 2023-12-05 01:26:33,744 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.72 vs. limit=15.0 2023-12-05 01:26:39,105 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=516266.6666666667, ans=0.0 2023-12-05 01:26:46,269 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=516333.3333333333, ans=0.07 2023-12-05 01:26:54,007 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=22.5 2023-12-05 01:27:09,811 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=516400.0, ans=0.0 2023-12-05 01:27:11,715 INFO [train.py:1087] (1/4) Epoch 87, batch 500, loss[loss=0.1478, simple_loss=0.2449, pruned_loss=0.02532, over 24582.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2392, pruned_loss=0.0261, over 4407268.80 frames. ], batch size: 64, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:27:25,344 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=516533.3333333333, ans=0.125 2023-12-05 01:27:47,929 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=516666.6666666667, ans=0.125 2023-12-05 01:28:11,476 INFO [train.py:1087] (1/4) Epoch 87, batch 550, loss[loss=0.146, simple_loss=0.2456, pruned_loss=0.02319, over 24575.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2395, pruned_loss=0.02602, over 4484280.25 frames. ], batch size: 63, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:28:24,756 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.107e+02 1.296e+02 1.354e+02 1.450e+02 2.128e+02, threshold=2.708e+02, percent-clipped=0.0 2023-12-05 01:28:51,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=517000.0, ans=0.125 2023-12-05 01:28:52,938 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=517000.0, ans=0.125 2023-12-05 01:29:05,930 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=517066.6666666667, ans=0.0 2023-12-05 01:29:07,114 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=517066.6666666667, ans=0.05 2023-12-05 01:29:11,293 INFO [train.py:1087] (1/4) Epoch 87, batch 600, loss[loss=0.1471, simple_loss=0.243, pruned_loss=0.02561, over 24559.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2394, pruned_loss=0.02606, over 4546899.26 frames. ], batch size: 63, lr: 2.82e-03, grad_scale: 32.0 2023-12-05 01:29:17,775 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=517133.3333333333, ans=0.125 2023-12-05 01:29:28,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=517200.0, ans=0.125 2023-12-05 01:30:06,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=517400.0, ans=0.0 2023-12-05 01:30:12,102 INFO [train.py:1087] (1/4) Epoch 87, batch 650, loss[loss=0.1528, simple_loss=0.2486, pruned_loss=0.02853, over 24555.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2393, pruned_loss=0.02601, over 4616739.77 frames. ], batch size: 66, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:30:20,838 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=517466.6666666667, ans=0.125 2023-12-05 01:30:22,077 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=517466.6666666667, ans=0.2 2023-12-05 01:30:25,007 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.098e+02 1.241e+02 1.373e+02 1.498e+02 1.949e+02, threshold=2.746e+02, percent-clipped=0.0 2023-12-05 01:30:26,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=517533.3333333333, ans=0.95 2023-12-05 01:30:26,976 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=517533.3333333333, ans=0.125 2023-12-05 01:30:33,056 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517533.3333333333, ans=0.1 2023-12-05 01:31:02,842 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=517733.3333333333, ans=0.0 2023-12-05 01:31:02,850 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=517733.3333333333, ans=0.2 2023-12-05 01:31:05,260 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=517733.3333333333, ans=0.0 2023-12-05 01:31:12,112 INFO [train.py:1087] (1/4) Epoch 87, batch 700, loss[loss=0.1436, simple_loss=0.2363, pruned_loss=0.02546, over 24562.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2391, pruned_loss=0.02593, over 4651957.86 frames. ], batch size: 63, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:31:19,515 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=517800.0, ans=10.0 2023-12-05 01:31:22,907 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=517866.6666666667, ans=0.125 2023-12-05 01:31:28,960 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:31:41,718 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=517933.3333333333, ans=0.0 2023-12-05 01:31:42,949 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517933.3333333333, ans=0.125 2023-12-05 01:31:47,368 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518000.0, ans=0.1 2023-12-05 01:31:57,820 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.35 vs. limit=15.0 2023-12-05 01:32:10,627 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=518133.3333333333, ans=0.0 2023-12-05 01:32:11,960 INFO [train.py:1087] (1/4) Epoch 87, batch 750, loss[loss=0.1469, simple_loss=0.2442, pruned_loss=0.0248, over 24745.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2392, pruned_loss=0.02602, over 4683319.26 frames. ], batch size: 61, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:32:21,239 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=518133.3333333333, ans=0.125 2023-12-05 01:32:24,537 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.264e+02 1.322e+02 1.416e+02 1.779e+02, threshold=2.644e+02, percent-clipped=0.0 2023-12-05 01:32:29,581 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=518200.0, ans=0.0 2023-12-05 01:32:41,676 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=518266.6666666667, ans=0.025 2023-12-05 01:32:45,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=518266.6666666667, ans=0.0 2023-12-05 01:33:10,545 INFO [train.py:1087] (1/4) Epoch 87, batch 800, loss[loss=0.1311, simple_loss=0.2226, pruned_loss=0.01976, over 24782.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2391, pruned_loss=0.02591, over 4709674.93 frames. ], batch size: 64, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:33:10,872 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=518466.6666666667, ans=0.2 2023-12-05 01:33:47,741 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=518666.6666666667, ans=0.125 2023-12-05 01:33:57,643 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=518733.3333333333, ans=0.0 2023-12-05 01:34:05,973 INFO [train.py:1087] (1/4) Epoch 87, batch 850, loss[loss=0.144, simple_loss=0.241, pruned_loss=0.02346, over 24620.00 frames. ], tot_loss[loss=0.1452, simple_loss=0.239, pruned_loss=0.02576, over 4736002.14 frames. ], batch size: 68, lr: 2.81e-03, grad_scale: 32.0 2023-12-05 01:34:07,347 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=518800.0, ans=0.0 2023-12-05 01:34:09,724 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.42 vs. limit=15.0 2023-12-05 01:34:17,650 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.150e+02 1.326e+02 1.393e+02 1.545e+02 2.140e+02, threshold=2.786e+02, percent-clipped=0.0 2023-12-05 01:34:24,833 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.23 vs. limit=15.0 2023-12-05 01:34:25,478 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=518866.6666666667, ans=0.0 2023-12-05 01:34:49,207 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=519066.6666666667, ans=0.0 2023-12-05 01:35:10,934 INFO [train.py:1087] (1/4) Epoch 88, batch 0, loss[loss=0.1403, simple_loss=0.2292, pruned_loss=0.02575, over 24780.00 frames. ], tot_loss[loss=0.1403, simple_loss=0.2292, pruned_loss=0.02575, over 24780.00 frames. ], batch size: 64, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:35:10,935 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 01:35:19,262 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.7728, 4.1808, 3.3063, 4.4868], device='cuda:1') 2023-12-05 01:35:24,428 INFO [train.py:1119] (1/4) Epoch 88, validation: loss=0.1507, simple_loss=0.2459, pruned_loss=0.02777, over 944034.00 frames. 2023-12-05 01:35:24,429 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 01:35:26,035 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=519100.0, ans=0.0 2023-12-05 01:35:41,895 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=519166.6666666667, ans=0.125 2023-12-05 01:36:06,974 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=519300.0, ans=10.0 2023-12-05 01:36:20,935 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=5.13 vs. limit=15.0 2023-12-05 01:36:23,746 INFO [train.py:1087] (1/4) Epoch 88, batch 50, loss[loss=0.1392, simple_loss=0.2328, pruned_loss=0.02278, over 24818.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.02589, over 1083183.53 frames. ], batch size: 68, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:36:36,973 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=519500.0, ans=0.125 2023-12-05 01:36:42,464 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.145e+02 1.338e+02 1.439e+02 1.555e+02 2.123e+02, threshold=2.877e+02, percent-clipped=0.0 2023-12-05 01:36:42,708 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=519500.0, ans=0.1 2023-12-05 01:37:05,349 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=12.0 2023-12-05 01:37:11,132 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.84 vs. limit=15.0 2023-12-05 01:37:16,604 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=519700.0, ans=0.125 2023-12-05 01:37:23,404 INFO [train.py:1087] (1/4) Epoch 88, batch 100, loss[loss=0.1323, simple_loss=0.232, pruned_loss=0.01628, over 24805.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2393, pruned_loss=0.02609, over 1913112.38 frames. ], batch size: 72, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:37:33,190 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=519766.6666666667, ans=0.125 2023-12-05 01:37:34,280 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=519833.3333333333, ans=0.0 2023-12-05 01:37:34,420 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=519833.3333333333, ans=0.0 2023-12-05 01:37:58,047 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519966.6666666667, ans=0.1 2023-12-05 01:38:03,196 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=519966.6666666667, ans=0.04949747468305833 2023-12-05 01:38:10,258 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=520033.3333333333, ans=0.125 2023-12-05 01:38:16,643 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:38:23,192 INFO [train.py:1087] (1/4) Epoch 88, batch 150, loss[loss=0.144, simple_loss=0.2393, pruned_loss=0.02436, over 24738.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2389, pruned_loss=0.02591, over 2563168.45 frames. ], batch size: 63, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:38:27,092 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=520100.0, ans=0.125 2023-12-05 01:38:42,591 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.102e+02 1.250e+02 1.326e+02 1.416e+02 1.687e+02, threshold=2.651e+02, percent-clipped=0.0 2023-12-05 01:38:53,114 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-12-05 01:38:58,359 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=520300.0, ans=0.125 2023-12-05 01:39:20,697 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=520366.6666666667, ans=0.125 2023-12-05 01:39:23,938 INFO [train.py:1087] (1/4) Epoch 88, batch 200, loss[loss=0.1416, simple_loss=0.2368, pruned_loss=0.02327, over 24759.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.02599, over 3063358.66 frames. ], batch size: 64, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:39:24,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=520433.3333333333, ans=0.035 2023-12-05 01:39:35,096 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=520500.0, ans=0.0 2023-12-05 01:39:48,864 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=520566.6666666667, ans=0.125 2023-12-05 01:39:53,501 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=520566.6666666667, ans=0.125 2023-12-05 01:40:00,556 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.44 vs. limit=22.5 2023-12-05 01:40:01,856 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520633.3333333333, ans=0.1 2023-12-05 01:40:24,710 INFO [train.py:1087] (1/4) Epoch 88, batch 250, loss[loss=0.1534, simple_loss=0.2451, pruned_loss=0.0309, over 24154.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2395, pruned_loss=0.02619, over 3448102.58 frames. ], batch size: 82, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:40:34,174 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=520766.6666666667, ans=0.125 2023-12-05 01:40:43,723 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.172e+02 1.286e+02 1.349e+02 1.447e+02 1.873e+02, threshold=2.697e+02, percent-clipped=0.0 2023-12-05 01:40:47,017 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.59 vs. limit=15.0 2023-12-05 01:40:51,845 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=520900.0, ans=0.125 2023-12-05 01:41:11,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=521033.3333333333, ans=0.0 2023-12-05 01:41:21,881 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521033.3333333333, ans=0.1 2023-12-05 01:41:25,467 INFO [train.py:1087] (1/4) Epoch 88, batch 300, loss[loss=0.1352, simple_loss=0.2291, pruned_loss=0.02067, over 24852.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2397, pruned_loss=0.02626, over 3741598.85 frames. ], batch size: 68, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:41:28,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521100.0, ans=0.1 2023-12-05 01:41:30,596 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=521100.0, ans=0.1 2023-12-05 01:41:32,731 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=521100.0, ans=0.1 2023-12-05 01:41:52,364 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-12-05 01:42:12,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.56 vs. limit=10.0 2023-12-05 01:42:25,210 INFO [train.py:1087] (1/4) Epoch 88, batch 350, loss[loss=0.1489, simple_loss=0.2477, pruned_loss=0.02501, over 21445.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.24, pruned_loss=0.02644, over 3963174.29 frames. ], batch size: 127, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:42:26,956 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=521433.3333333333, ans=0.125 2023-12-05 01:42:34,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521433.3333333333, ans=0.125 2023-12-05 01:42:36,961 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=521500.0, ans=0.125 2023-12-05 01:42:39,360 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=521500.0, ans=0.125 2023-12-05 01:42:44,771 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.131e+02 1.283e+02 1.389e+02 1.504e+02 1.993e+02, threshold=2.777e+02, percent-clipped=0.0 2023-12-05 01:42:56,612 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=521566.6666666667, ans=0.125 2023-12-05 01:43:11,606 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=521633.3333333333, ans=0.125 2023-12-05 01:43:20,918 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=521700.0, ans=0.125 2023-12-05 01:43:25,148 INFO [train.py:1087] (1/4) Epoch 88, batch 400, loss[loss=0.1428, simple_loss=0.238, pruned_loss=0.02383, over 24561.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2394, pruned_loss=0.02607, over 4162240.75 frames. ], batch size: 66, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:43:26,547 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=521766.6666666667, ans=0.125 2023-12-05 01:43:28,986 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=521766.6666666667, ans=0.0 2023-12-05 01:43:42,274 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=521833.3333333333, ans=0.0 2023-12-05 01:43:45,723 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521833.3333333333, ans=0.125 2023-12-05 01:43:50,369 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=521900.0, ans=0.125 2023-12-05 01:43:51,493 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=521900.0, ans=0.125 2023-12-05 01:44:15,899 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=522033.3333333333, ans=0.125 2023-12-05 01:44:25,954 INFO [train.py:1087] (1/4) Epoch 88, batch 450, loss[loss=0.1442, simple_loss=0.2358, pruned_loss=0.02628, over 24769.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2394, pruned_loss=0.02619, over 4302182.65 frames. ], batch size: 64, lr: 2.79e-03, grad_scale: 32.0 2023-12-05 01:44:36,896 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.83 vs. limit=22.5 2023-12-05 01:44:45,802 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.274e+02 1.381e+02 1.491e+02 2.054e+02, threshold=2.762e+02, percent-clipped=0.0 2023-12-05 01:44:47,143 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=522166.6666666667, ans=0.125 2023-12-05 01:44:54,724 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=22.5 2023-12-05 01:45:03,322 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=522300.0, ans=0.0 2023-12-05 01:45:25,905 INFO [train.py:1087] (1/4) Epoch 88, batch 500, loss[loss=0.1488, simple_loss=0.2408, pruned_loss=0.02835, over 24796.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2397, pruned_loss=0.02628, over 4403946.47 frames. ], batch size: 62, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:45:54,237 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-12-05 01:46:04,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=522633.3333333333, ans=0.1 2023-12-05 01:46:08,573 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=522633.3333333333, ans=0.125 2023-12-05 01:46:16,531 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=522700.0, ans=0.95 2023-12-05 01:46:25,641 INFO [train.py:1087] (1/4) Epoch 88, batch 550, loss[loss=0.1553, simple_loss=0.2523, pruned_loss=0.02915, over 24766.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2399, pruned_loss=0.02635, over 4492981.50 frames. ], batch size: 65, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:46:28,545 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=522766.6666666667, ans=10.0 2023-12-05 01:46:30,762 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=522766.6666666667, ans=0.5 2023-12-05 01:46:43,035 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=522833.3333333333, ans=0.0 2023-12-05 01:46:44,288 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=522833.3333333333, ans=0.025 2023-12-05 01:46:45,076 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.139e+02 1.255e+02 1.343e+02 1.426e+02 1.828e+02, threshold=2.686e+02, percent-clipped=0.0 2023-12-05 01:46:54,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522900.0, ans=0.125 2023-12-05 01:47:03,683 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=522966.6666666667, ans=0.1 2023-12-05 01:47:23,794 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-12-05 01:47:25,516 INFO [train.py:1087] (1/4) Epoch 88, batch 600, loss[loss=0.1519, simple_loss=0.2492, pruned_loss=0.02728, over 24502.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2396, pruned_loss=0.02615, over 4553160.78 frames. ], batch size: 75, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:47:42,803 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.16 vs. limit=15.0 2023-12-05 01:47:52,158 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2023-12-05 01:48:21,247 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=523366.6666666667, ans=0.035 2023-12-05 01:48:25,787 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-12-05 01:48:26,043 INFO [train.py:1087] (1/4) Epoch 88, batch 650, loss[loss=0.1525, simple_loss=0.2458, pruned_loss=0.02964, over 24810.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2395, pruned_loss=0.02629, over 4606125.77 frames. ], batch size: 62, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:48:34,350 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=523433.3333333333, ans=0.125 2023-12-05 01:48:34,405 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=523433.3333333333, ans=0.1 2023-12-05 01:48:34,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.77 vs. limit=22.5 2023-12-05 01:48:45,120 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.133e+02 1.280e+02 1.378e+02 1.457e+02 1.837e+02, threshold=2.755e+02, percent-clipped=0.0 2023-12-05 01:48:52,222 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523566.6666666667, ans=0.0 2023-12-05 01:48:55,712 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=523566.6666666667, ans=0.125 2023-12-05 01:48:56,741 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523566.6666666667, ans=0.1 2023-12-05 01:49:03,325 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.73 vs. limit=10.0 2023-12-05 01:49:07,539 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523633.3333333333, ans=0.1 2023-12-05 01:49:25,642 INFO [train.py:1087] (1/4) Epoch 88, batch 700, loss[loss=0.1472, simple_loss=0.2382, pruned_loss=0.02815, over 24773.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2393, pruned_loss=0.02613, over 4669412.42 frames. ], batch size: 64, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:49:25,860 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=523766.6666666667, ans=0.125 2023-12-05 01:49:26,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=523766.6666666667, ans=0.125 2023-12-05 01:49:27,317 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.13 vs. limit=22.5 2023-12-05 01:49:28,499 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.19 vs. limit=15.0 2023-12-05 01:49:31,520 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=523766.6666666667, ans=0.0 2023-12-05 01:49:37,237 INFO [scaling.py:1022] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=8.0 2023-12-05 01:49:44,936 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.88 vs. limit=12.0 2023-12-05 01:49:52,709 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=523900.0, ans=0.04949747468305833 2023-12-05 01:50:12,690 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=524033.3333333333, ans=0.0 2023-12-05 01:50:23,970 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-12-05 01:50:25,574 INFO [train.py:1087] (1/4) Epoch 88, batch 750, loss[loss=0.1481, simple_loss=0.2412, pruned_loss=0.02745, over 24536.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2394, pruned_loss=0.02624, over 4703879.85 frames. ], batch size: 66, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:50:25,832 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=524100.0, ans=0.0 2023-12-05 01:50:34,895 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=524100.0, ans=0.125 2023-12-05 01:50:35,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=524166.6666666667, ans=0.125 2023-12-05 01:50:43,991 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.193e+02 1.285e+02 1.365e+02 1.514e+02 2.030e+02, threshold=2.729e+02, percent-clipped=0.0 2023-12-05 01:50:44,218 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=524166.6666666667, ans=0.0 2023-12-05 01:50:44,326 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=524166.6666666667, ans=0.0 2023-12-05 01:50:46,825 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-12-05 01:50:48,371 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.78 vs. limit=15.0 2023-12-05 01:51:01,277 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=524300.0, ans=0.0 2023-12-05 01:51:14,255 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-12-05 01:51:16,546 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.95 vs. limit=15.0 2023-12-05 01:51:17,267 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=524366.6666666666, ans=0.0 2023-12-05 01:51:22,014 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=524366.6666666666, ans=0.125 2023-12-05 01:51:24,698 INFO [train.py:1087] (1/4) Epoch 88, batch 800, loss[loss=0.145, simple_loss=0.2365, pruned_loss=0.02672, over 24791.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2393, pruned_loss=0.0261, over 4736139.46 frames. ], batch size: 70, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:51:34,148 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:51:41,867 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524500.0, ans=0.1 2023-12-05 01:51:46,694 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-12-05 01:51:48,327 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=524566.6666666666, ans=0.125 2023-12-05 01:51:53,558 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=524566.6666666666, ans=0.2 2023-12-05 01:52:01,266 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-12-05 01:52:19,124 INFO [train.py:1087] (1/4) Epoch 88, batch 850, loss[loss=0.1484, simple_loss=0.242, pruned_loss=0.02745, over 24818.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.239, pruned_loss=0.02607, over 4763785.93 frames. ], batch size: 62, lr: 2.78e-03, grad_scale: 32.0 2023-12-05 01:52:23,867 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-12-05 01:52:30,970 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=524833.3333333334, ans=0.125 2023-12-05 01:52:33,175 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:52:35,476 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.69 vs. limit=15.0 2023-12-05 01:52:36,086 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.077e+02 1.284e+02 1.365e+02 1.470e+02 3.303e+02, threshold=2.731e+02, percent-clipped=1.0 2023-12-05 01:52:38,913 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=524833.3333333334, ans=0.0 2023-12-05 01:52:39,094 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=524833.3333333334, ans=0.0 2023-12-05 01:52:44,361 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=524900.0, ans=0.0 2023-12-05 01:52:58,532 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=524966.6666666666, ans=0.2 2023-12-05 01:53:18,538 INFO [train.py:1087] (1/4) Epoch 89, batch 0, loss[loss=0.148, simple_loss=0.2422, pruned_loss=0.02693, over 24334.00 frames. ], tot_loss[loss=0.148, simple_loss=0.2422, pruned_loss=0.02693, over 24334.00 frames. ], batch size: 79, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:53:18,539 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 01:53:26,839 INFO [zipformer.py:1876] (1/4) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([2.5676, 3.6370, 3.1085, 3.8187, 3.3697, 3.2094, 3.7270, 3.6697], device='cuda:1') 2023-12-05 01:53:32,046 INFO [train.py:1119] (1/4) Epoch 89, validation: loss=0.1517, simple_loss=0.2464, pruned_loss=0.02848, over 944034.00 frames. 2023-12-05 01:53:32,047 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 01:53:38,043 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=525066.6666666666, ans=0.1 2023-12-05 01:54:25,413 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=525333.3333333334, ans=0.125 2023-12-05 01:54:31,146 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=525400.0, ans=0.0 2023-12-05 01:54:31,913 INFO [train.py:1087] (1/4) Epoch 89, batch 50, loss[loss=0.1493, simple_loss=0.2454, pruned_loss=0.02664, over 24720.00 frames. ], tot_loss[loss=0.1463, simple_loss=0.2401, pruned_loss=0.02625, over 1079228.67 frames. ], batch size: 69, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:54:35,644 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=525400.0, ans=0.125 2023-12-05 01:54:56,619 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.154e+02 1.263e+02 1.333e+02 1.502e+02 2.119e+02, threshold=2.665e+02, percent-clipped=0.0 2023-12-05 01:55:00,031 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=525533.3333333334, ans=0.0 2023-12-05 01:55:05,750 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=525533.3333333334, ans=0.2 2023-12-05 01:55:31,468 INFO [train.py:1087] (1/4) Epoch 89, batch 100, loss[loss=0.1515, simple_loss=0.244, pruned_loss=0.02951, over 23551.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2394, pruned_loss=0.02592, over 1917022.35 frames. ], batch size: 94, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:55:31,661 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=525733.3333333334, ans=0.125 2023-12-05 01:55:42,256 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.85 vs. limit=10.0 2023-12-05 01:55:42,971 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=525800.0, ans=0.0 2023-12-05 01:55:42,993 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=525800.0, ans=0.125 2023-12-05 01:55:49,152 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-12-05 01:55:53,312 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-12-05 01:56:01,436 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=525866.6666666666, ans=0.125 2023-12-05 01:56:06,774 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.60 vs. limit=12.0 2023-12-05 01:56:14,396 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-12-05 01:56:31,446 INFO [train.py:1087] (1/4) Epoch 89, batch 150, loss[loss=0.1407, simple_loss=0.2326, pruned_loss=0.02441, over 24776.00 frames. ], tot_loss[loss=0.1453, simple_loss=0.2391, pruned_loss=0.02575, over 2563752.64 frames. ], batch size: 65, lr: 2.76e-03, grad_scale: 32.0 2023-12-05 01:56:39,432 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=6.00 vs. limit=15.0 2023-12-05 01:56:53,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=526133.3333333334, ans=0.2 2023-12-05 01:56:56,266 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.104e+02 1.268e+02 1.348e+02 1.503e+02 1.865e+02, threshold=2.695e+02, percent-clipped=0.0 2023-12-05 01:57:26,638 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=526333.3333333334, ans=0.07 2023-12-05 01:57:30,798 INFO [train.py:1087] (1/4) Epoch 89, batch 200, loss[loss=0.1415, simple_loss=0.236, pruned_loss=0.02356, over 24793.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2392, pruned_loss=0.02581, over 3055994.88 frames. ], batch size: 62, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 01:57:31,000 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=526400.0, ans=0.125 2023-12-05 01:57:33,383 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:57:53,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=526466.6666666666, ans=0.125 2023-12-05 01:58:07,208 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.26 vs. limit=15.0 2023-12-05 01:58:17,693 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=526666.6666666666, ans=0.0 2023-12-05 01:58:23,492 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=526666.6666666666, ans=0.09899494936611666 2023-12-05 01:58:31,078 INFO [train.py:1087] (1/4) Epoch 89, batch 250, loss[loss=0.1438, simple_loss=0.2357, pruned_loss=0.02598, over 24548.00 frames. ], tot_loss[loss=0.1451, simple_loss=0.2388, pruned_loss=0.02572, over 3448143.20 frames. ], batch size: 62, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 01:58:33,654 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=526733.3333333334, ans=0.125 2023-12-05 01:58:34,249 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-12-05 01:58:34,768 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526733.3333333334, ans=0.125 2023-12-05 01:58:44,855 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=526800.0, ans=0.0 2023-12-05 01:58:47,071 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=526800.0, ans=0.125 2023-12-05 01:58:56,432 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.092e+02 1.263e+02 1.358e+02 1.436e+02 1.753e+02, threshold=2.715e+02, percent-clipped=0.0 2023-12-05 01:59:03,775 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=8.62 vs. limit=15.0 2023-12-05 01:59:04,897 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.39 vs. limit=10.0 2023-12-05 01:59:30,778 INFO [train.py:1087] (1/4) Epoch 89, batch 300, loss[loss=0.1446, simple_loss=0.2407, pruned_loss=0.02425, over 24581.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2394, pruned_loss=0.02593, over 3753004.47 frames. ], batch size: 64, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 01:59:44,931 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-12-05 01:59:54,033 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=527200.0, ans=0.0 2023-12-05 02:00:00,274 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-12-05 02:00:20,749 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=527333.3333333334, ans=0.2 2023-12-05 02:00:29,560 INFO [train.py:1087] (1/4) Epoch 89, batch 350, loss[loss=0.1469, simple_loss=0.2395, pruned_loss=0.02709, over 24305.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2397, pruned_loss=0.0261, over 3993828.98 frames. ], batch size: 79, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 02:00:45,452 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=527466.6666666666, ans=0.1 2023-12-05 02:00:52,063 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=527466.6666666666, ans=0.125 2023-12-05 02:00:55,288 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.138e+02 1.292e+02 1.378e+02 1.472e+02 1.847e+02, threshold=2.756e+02, percent-clipped=0.0 2023-12-05 02:01:11,320 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=527600.0, ans=0.125 2023-12-05 02:01:23,962 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.49 vs. limit=22.5 2023-12-05 02:01:30,214 INFO [train.py:1087] (1/4) Epoch 89, batch 400, loss[loss=0.1488, simple_loss=0.2459, pruned_loss=0.02583, over 24720.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02611, over 4183615.28 frames. ], batch size: 69, lr: 2.76e-03, grad_scale: 64.0 2023-12-05 02:01:34,070 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=527733.3333333334, ans=0.0 2023-12-05 02:02:15,906 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-12-05 02:02:18,036 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 02:02:21,099 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.57 vs. limit=15.0 2023-12-05 02:02:26,354 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=528000.0, ans=0.2 2023-12-05 02:02:29,849 INFO [train.py:1087] (1/4) Epoch 89, batch 450, loss[loss=0.1385, simple_loss=0.2363, pruned_loss=0.02035, over 24748.00 frames. ], tot_loss[loss=0.1454, simple_loss=0.2392, pruned_loss=0.02584, over 4340541.55 frames. ], batch size: 70, lr: 2.75e-03, grad_scale: 64.0 2023-12-05 02:02:30,141 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=528066.6666666666, ans=0.125 2023-12-05 02:02:32,416 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=528066.6666666666, ans=0.125 2023-12-05 02:02:54,096 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.114e+02 1.292e+02 1.369e+02 1.510e+02 2.020e+02, threshold=2.739e+02, percent-clipped=0.0 2023-12-05 02:02:54,934 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=528200.0, ans=0.2 2023-12-05 02:03:00,248 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=528200.0, ans=0.125 2023-12-05 02:03:10,972 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=528266.6666666666, ans=0.125 2023-12-05 02:03:28,776 INFO [train.py:1087] (1/4) Epoch 89, batch 500, loss[loss=0.1481, simple_loss=0.2438, pruned_loss=0.02618, over 24795.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2393, pruned_loss=0.02595, over 4441848.07 frames. ], batch size: 72, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:03:32,631 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=528400.0, ans=0.125 2023-12-05 02:03:39,935 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=528466.6666666666, ans=10.0 2023-12-05 02:03:50,186 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=528466.6666666666, ans=0.125 2023-12-05 02:04:28,315 INFO [train.py:1087] (1/4) Epoch 89, batch 550, loss[loss=0.1535, simple_loss=0.2435, pruned_loss=0.03175, over 24541.00 frames. ], tot_loss[loss=0.1456, simple_loss=0.2392, pruned_loss=0.02597, over 4535413.79 frames. ], batch size: 62, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:04:28,540 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=528733.3333333334, ans=0.125 2023-12-05 02:04:37,049 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=528733.3333333334, ans=0.0 2023-12-05 02:04:54,039 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.094e+02 1.299e+02 1.416e+02 1.516e+02 2.054e+02, threshold=2.831e+02, percent-clipped=0.0 2023-12-05 02:05:08,425 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=528933.3333333334, ans=10.0 2023-12-05 02:05:09,584 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=528933.3333333334, ans=0.125 2023-12-05 02:05:27,936 INFO [train.py:1087] (1/4) Epoch 89, batch 600, loss[loss=0.1424, simple_loss=0.2346, pruned_loss=0.02507, over 24714.00 frames. ], tot_loss[loss=0.1455, simple_loss=0.2393, pruned_loss=0.0259, over 4609394.43 frames. ], batch size: 67, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:05:54,349 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529200.0, ans=0.1 2023-12-05 02:06:10,669 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=529266.6666666666, ans=0.125 2023-12-05 02:06:24,511 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=529333.3333333334, ans=0.0 2023-12-05 02:06:27,688 INFO [train.py:1087] (1/4) Epoch 89, batch 650, loss[loss=0.1405, simple_loss=0.2331, pruned_loss=0.02398, over 24775.00 frames. ], tot_loss[loss=0.1453, simple_loss=0.2392, pruned_loss=0.02573, over 4671632.63 frames. ], batch size: 70, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:06:55,363 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.122e+02 1.271e+02 1.358e+02 1.474e+02 2.858e+02, threshold=2.716e+02, percent-clipped=1.0 2023-12-05 02:07:05,980 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=529600.0, ans=0.0 2023-12-05 02:07:16,560 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=529666.6666666666, ans=0.5 2023-12-05 02:07:16,732 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=529666.6666666666, ans=0.0 2023-12-05 02:07:27,032 INFO [train.py:1087] (1/4) Epoch 89, batch 700, loss[loss=0.1414, simple_loss=0.2346, pruned_loss=0.02406, over 24565.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2397, pruned_loss=0.02604, over 4676180.01 frames. ], batch size: 62, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:07:53,564 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-12-05 02:07:56,722 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=529866.6666666666, ans=0.125 2023-12-05 02:08:15,293 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=530000.0, ans=0.125 2023-12-05 02:08:26,374 INFO [train.py:1087] (1/4) Epoch 89, batch 750, loss[loss=0.15, simple_loss=0.2407, pruned_loss=0.0296, over 24518.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2395, pruned_loss=0.02603, over 4696866.08 frames. ], batch size: 75, lr: 2.75e-03, grad_scale: 16.0 2023-12-05 02:08:47,523 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.58 vs. limit=12.0 2023-12-05 02:08:52,912 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.095e+02 1.289e+02 1.385e+02 1.489e+02 1.926e+02, threshold=2.771e+02, percent-clipped=0.0 2023-12-05 02:09:04,164 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=530266.6666666666, ans=0.125 2023-12-05 02:09:07,620 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530266.6666666666, ans=0.125 2023-12-05 02:09:07,818 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=3.70 vs. limit=10.0 2023-12-05 02:09:12,148 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530333.3333333334, ans=0.1 2023-12-05 02:09:20,157 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=530333.3333333334, ans=0.07 2023-12-05 02:09:25,214 INFO [train.py:1087] (1/4) Epoch 89, batch 800, loss[loss=0.1537, simple_loss=0.241, pruned_loss=0.03322, over 24144.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2395, pruned_loss=0.02623, over 4727716.04 frames. ], batch size: 58, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:09:54,801 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-12-05 02:09:58,085 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=530600.0, ans=0.2 2023-12-05 02:10:11,113 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530666.6666666666, ans=0.1 2023-12-05 02:10:17,473 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530666.6666666666, ans=0.1 2023-12-05 02:10:17,526 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=530666.6666666666, ans=0.0 2023-12-05 02:10:18,523 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=530733.3333333334, ans=0.125 2023-12-05 02:10:18,843 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=15.0 2023-12-05 02:10:19,325 INFO [train.py:1087] (1/4) Epoch 89, batch 850, loss[loss=0.1485, simple_loss=0.2399, pruned_loss=0.0285, over 24534.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2396, pruned_loss=0.0261, over 4743374.18 frames. ], batch size: 75, lr: 2.75e-03, grad_scale: 32.0 2023-12-05 02:10:23,720 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530733.3333333334, ans=0.1 2023-12-05 02:10:34,919 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.35 vs. limit=10.0 2023-12-05 02:10:37,220 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-12-05 02:10:37,760 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=530800.0, ans=0.125 2023-12-05 02:10:37,871 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=530800.0, ans=0.0 2023-12-05 02:10:44,187 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.126e+02 1.313e+02 1.399e+02 1.494e+02 1.955e+02, threshold=2.797e+02, percent-clipped=0.0 2023-12-05 02:10:44,261 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=530866.6666666666, ans=0.125 2023-12-05 02:10:54,342 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.07 vs. limit=15.0 2023-12-05 02:11:05,529 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531000.0, ans=0.1 2023-12-05 02:11:17,970 INFO [train.py:1087] (1/4) Epoch 90, batch 0, loss[loss=0.1466, simple_loss=0.2407, pruned_loss=0.02619, over 24556.00 frames. ], tot_loss[loss=0.1466, simple_loss=0.2407, pruned_loss=0.02619, over 24556.00 frames. ], batch size: 63, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:11:17,971 INFO [train.py:1110] (1/4) Computing validation loss 2023-12-05 02:11:31,297 INFO [train.py:1119] (1/4) Epoch 90, validation: loss=0.1508, simple_loss=0.2458, pruned_loss=0.02792, over 944034.00 frames. 2023-12-05 02:11:31,298 INFO [train.py:1120] (1/4) Maximum memory allocated so far is 16610MB 2023-12-05 02:11:39,629 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=531033.3333333334, ans=0.1 2023-12-05 02:11:54,122 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=531166.6666666666, ans=0.2 2023-12-05 02:12:10,281 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=531233.3333333334, ans=0.125 2023-12-05 02:12:11,304 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531233.3333333334, ans=0.1 2023-12-05 02:12:31,028 INFO [train.py:1087] (1/4) Epoch 90, batch 50, loss[loss=0.1347, simple_loss=0.2274, pruned_loss=0.021, over 24600.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.24, pruned_loss=0.02613, over 1073646.70 frames. ], batch size: 68, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:12:53,710 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531500.0, ans=0.125 2023-12-05 02:12:56,357 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=531500.0, ans=0.125 2023-12-05 02:13:03,787 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.134e+02 1.265e+02 1.348e+02 1.429e+02 2.050e+02, threshold=2.696e+02, percent-clipped=0.0 2023-12-05 02:13:14,734 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=531566.6666666666, ans=0.0 2023-12-05 02:13:24,804 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=531633.3333333334, ans=0.125 2023-12-05 02:13:29,636 INFO [train.py:1087] (1/4) Epoch 90, batch 100, loss[loss=0.1485, simple_loss=0.2448, pruned_loss=0.0261, over 24745.00 frames. ], tot_loss[loss=0.1458, simple_loss=0.2396, pruned_loss=0.02601, over 1921152.57 frames. ], batch size: 63, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:13:32,557 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=531700.0, ans=0.2 2023-12-05 02:13:32,672 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=531700.0, ans=0.0 2023-12-05 02:13:44,744 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.83 vs. limit=15.0 2023-12-05 02:13:51,466 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=22.5 2023-12-05 02:13:54,307 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=531833.3333333334, ans=0.125 2023-12-05 02:13:55,934 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=8.71 vs. limit=12.0 2023-12-05 02:14:09,156 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=531900.0, ans=0.1 2023-12-05 02:14:28,650 INFO [train.py:1087] (1/4) Epoch 90, batch 150, loss[loss=0.1448, simple_loss=0.2394, pruned_loss=0.02515, over 24557.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.2392, pruned_loss=0.0261, over 2557164.57 frames. ], batch size: 66, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:14:32,166 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=532033.3333333334, ans=0.0 2023-12-05 02:14:35,787 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=532033.3333333334, ans=0.0 2023-12-05 02:14:48,740 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.44 vs. limit=22.5 2023-12-05 02:14:53,103 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=532166.6666666666, ans=0.125 2023-12-05 02:14:54,525 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-12-05 02:14:56,680 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=532166.6666666666, ans=0.0 2023-12-05 02:14:57,655 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=532166.6666666666, ans=0.125 2023-12-05 02:14:59,980 INFO [scaling.py:1118] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-12-05 02:15:01,843 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.135e+02 1.270e+02 1.373e+02 1.508e+02 1.981e+02, threshold=2.747e+02, percent-clipped=0.0 2023-12-05 02:15:07,742 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=532233.3333333334, ans=0.125 2023-12-05 02:15:14,242 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532233.3333333334, ans=0.1 2023-12-05 02:15:23,959 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532300.0, ans=0.1 2023-12-05 02:15:25,082 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=532300.0, ans=0.125 2023-12-05 02:15:28,238 INFO [train.py:1087] (1/4) Epoch 90, batch 200, loss[loss=0.1393, simple_loss=0.2351, pruned_loss=0.02179, over 24787.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.02634, over 3049083.31 frames. ], batch size: 71, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:15:32,643 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.39 vs. limit=15.0 2023-12-05 02:16:07,151 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=532566.6666666666, ans=0.125 2023-12-05 02:16:13,406 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.73 vs. limit=6.0 2023-12-05 02:16:28,519 INFO [train.py:1087] (1/4) Epoch 90, batch 250, loss[loss=0.1483, simple_loss=0.2413, pruned_loss=0.02764, over 24765.00 frames. ], tot_loss[loss=0.1468, simple_loss=0.2402, pruned_loss=0.02668, over 3419709.75 frames. ], batch size: 64, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:16:31,150 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=532700.0, ans=0.125 2023-12-05 02:16:42,822 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532766.6666666666, ans=0.1 2023-12-05 02:16:45,001 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:16:45,062 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:16:45,102 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=532766.6666666666, ans=0.125 2023-12-05 02:16:46,238 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=532766.6666666666, ans=0.05 2023-12-05 02:16:47,234 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=532766.6666666666, ans=0.0 2023-12-05 02:16:55,316 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=532833.3333333334, ans=0.0 2023-12-05 02:17:01,773 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.285e+02 1.409e+02 1.536e+02 2.196e+02, threshold=2.819e+02, percent-clipped=0.0 2023-12-05 02:17:27,312 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=533033.3333333334, ans=0.125 2023-12-05 02:17:28,265 INFO [train.py:1087] (1/4) Epoch 90, batch 300, loss[loss=0.1393, simple_loss=0.2325, pruned_loss=0.02304, over 24809.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.24, pruned_loss=0.02669, over 3714097.87 frames. ], batch size: 71, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:17:35,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=533033.3333333334, ans=0.0 2023-12-05 02:17:46,922 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533100.0, ans=0.125 2023-12-05 02:17:53,153 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.31 vs. limit=22.5 2023-12-05 02:18:17,227 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=533300.0, ans=0.125 2023-12-05 02:18:25,402 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.43 vs. limit=15.0 2023-12-05 02:18:29,396 INFO [train.py:1087] (1/4) Epoch 90, batch 350, loss[loss=0.1474, simple_loss=0.245, pruned_loss=0.02484, over 24800.00 frames. ], tot_loss[loss=0.1465, simple_loss=0.2399, pruned_loss=0.02651, over 3952229.68 frames. ], batch size: 62, lr: 2.73e-03, grad_scale: 32.0 2023-12-05 02:18:38,412 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=533366.6666666666, ans=0.0 2023-12-05 02:18:59,458 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=533500.0, ans=0.125 2023-12-05 02:19:00,434 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=533500.0, ans=0.05 2023-12-05 02:19:02,404 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 1.245e+02 1.317e+02 1.420e+02 1.727e+02, threshold=2.633e+02, percent-clipped=0.0 2023-12-05 02:19:06,580 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=533566.6666666666, ans=0.125 2023-12-05 02:19:28,709 INFO [train.py:1087] (1/4) Epoch 90, batch 400, loss[loss=0.1496, simple_loss=0.2438, pruned_loss=0.02764, over 24711.00 frames. ], tot_loss[loss=0.1467, simple_loss=0.24, pruned_loss=0.02669, over 4140925.55 frames. ], batch size: 74, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:19:28,926 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=533700.0, ans=0.0 2023-12-05 02:19:31,124 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=533700.0, ans=0.125 2023-12-05 02:19:45,950 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-12-05 02:19:52,987 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=533833.3333333334, ans=0.07 2023-12-05 02:20:00,944 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=533833.3333333334, ans=0.0 2023-12-05 02:20:06,692 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533900.0, ans=0.125 2023-12-05 02:20:10,353 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=533900.0, ans=0.125 2023-12-05 02:20:25,021 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=533966.6666666666, ans=0.1 2023-12-05 02:20:28,283 INFO [train.py:1087] (1/4) Epoch 90, batch 450, loss[loss=0.1369, simple_loss=0.2287, pruned_loss=0.02255, over 24575.00 frames. ], tot_loss[loss=0.1464, simple_loss=0.2396, pruned_loss=0.0266, over 4287896.48 frames. ], batch size: 65, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:20:41,100 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=534100.0, ans=0.0 2023-12-05 02:20:46,003 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534100.0, ans=0.125 2023-12-05 02:20:53,865 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=534166.6666666666, ans=0.125 2023-12-05 02:20:56,249 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.04 vs. limit=15.0 2023-12-05 02:21:01,374 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.157e+02 1.287e+02 1.360e+02 1.496e+02 1.959e+02, threshold=2.719e+02, percent-clipped=0.0 2023-12-05 02:21:27,493 INFO [train.py:1087] (1/4) Epoch 90, batch 500, loss[loss=0.1591, simple_loss=0.2517, pruned_loss=0.03331, over 24466.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2394, pruned_loss=0.02638, over 4412990.95 frames. ], batch size: 75, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:21:42,911 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-12-05 02:21:46,992 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=534433.3333333334, ans=0.125 2023-12-05 02:21:57,502 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=534500.0, ans=0.125 2023-12-05 02:22:18,236 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.33 vs. limit=15.0 2023-12-05 02:22:25,774 INFO [train.py:1087] (1/4) Epoch 90, batch 550, loss[loss=0.1445, simple_loss=0.2343, pruned_loss=0.02738, over 24475.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2394, pruned_loss=0.02635, over 4509962.34 frames. ], batch size: 75, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:22:32,946 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-12-05 02:22:38,682 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-12-05 02:22:40,735 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.52 vs. limit=15.0 2023-12-05 02:22:46,132 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=534766.6666666666, ans=0.125 2023-12-05 02:22:52,113 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.98 vs. limit=10.0 2023-12-05 02:22:58,187 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.130e+02 1.275e+02 1.357e+02 1.447e+02 1.780e+02, threshold=2.714e+02, percent-clipped=0.0 2023-12-05 02:23:24,891 INFO [train.py:1087] (1/4) Epoch 90, batch 600, loss[loss=0.136, simple_loss=0.2329, pruned_loss=0.01957, over 24776.00 frames. ], tot_loss[loss=0.1457, simple_loss=0.239, pruned_loss=0.02618, over 4573062.23 frames. ], batch size: 65, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:23:27,498 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=535033.3333333334, ans=0.0 2023-12-05 02:23:31,965 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=535033.3333333334, ans=0.0 2023-12-05 02:23:36,052 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=535100.0, ans=0.125 2023-12-05 02:23:37,022 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=535100.0, ans=0.125 2023-12-05 02:23:42,121 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=535100.0, ans=0.125 2023-12-05 02:23:46,553 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=535100.0, ans=0.2 2023-12-05 02:24:07,646 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535233.3333333334, ans=0.1 2023-12-05 02:24:21,371 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535300.0, ans=0.1 2023-12-05 02:24:23,410 INFO [train.py:1087] (1/4) Epoch 90, batch 650, loss[loss=0.1392, simple_loss=0.2356, pruned_loss=0.02137, over 24714.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2394, pruned_loss=0.02628, over 4618331.67 frames. ], batch size: 69, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:24:30,358 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=535366.6666666666, ans=0.125 2023-12-05 02:24:55,911 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.117e+02 1.319e+02 1.391e+02 1.531e+02 2.026e+02, threshold=2.782e+02, percent-clipped=0.0 2023-12-05 02:25:02,397 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-12-05 02:25:16,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=535633.3333333334, ans=0.0 2023-12-05 02:25:21,073 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=535700.0, ans=0.125 2023-12-05 02:25:21,892 INFO [train.py:1087] (1/4) Epoch 90, batch 700, loss[loss=0.1603, simple_loss=0.2489, pruned_loss=0.03584, over 24449.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2395, pruned_loss=0.0262, over 4644598.51 frames. ], batch size: 75, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:25:30,649 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535700.0, ans=0.1 2023-12-05 02:25:37,747 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-12-05 02:25:39,582 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=535766.6666666666, ans=0.125 2023-12-05 02:26:05,637 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=535900.0, ans=0.125 2023-12-05 02:26:10,099 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=535966.6666666666, ans=0.125 2023-12-05 02:26:15,897 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=535966.6666666666, ans=0.125 2023-12-05 02:26:21,116 INFO [train.py:1087] (1/4) Epoch 90, batch 750, loss[loss=0.1438, simple_loss=0.2427, pruned_loss=0.02247, over 24755.00 frames. ], tot_loss[loss=0.1461, simple_loss=0.2396, pruned_loss=0.0263, over 4687465.81 frames. ], batch size: 66, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:26:30,387 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=536033.3333333334, ans=0.125 2023-12-05 02:26:31,647 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=536100.0, ans=0.125 2023-12-05 02:26:35,398 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=536100.0, ans=0.125 2023-12-05 02:26:53,938 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.120e+02 1.283e+02 1.414e+02 1.604e+02 3.255e+02, threshold=2.828e+02, percent-clipped=0.0 2023-12-05 02:26:54,158 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=536166.6666666666, ans=0.0 2023-12-05 02:27:01,101 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=536233.3333333334, ans=15.0 2023-12-05 02:27:18,380 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=536366.6666666666, ans=0.125 2023-12-05 02:27:18,510 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=536366.6666666666, ans=0.125 2023-12-05 02:27:19,322 INFO [train.py:1087] (1/4) Epoch 90, batch 800, loss[loss=0.1542, simple_loss=0.2434, pruned_loss=0.0325, over 24310.00 frames. ], tot_loss[loss=0.146, simple_loss=0.2394, pruned_loss=0.02631, over 4704541.00 frames. ], batch size: 79, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:27:48,805 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=536500.0, ans=0.125 2023-12-05 02:27:49,925 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=536500.0, ans=0.0 2023-12-05 02:27:51,407 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=15.0 2023-12-05 02:28:10,186 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten.whitening_limit, batch_count=536633.3333333334, ans=15.0 2023-12-05 02:28:14,287 INFO [train.py:1087] (1/4) Epoch 90, batch 850, loss[loss=0.1453, simple_loss=0.2408, pruned_loss=0.02485, over 24550.00 frames. ], tot_loss[loss=0.1459, simple_loss=0.2393, pruned_loss=0.02625, over 4730734.19 frames. ], batch size: 66, lr: 2.72e-03, grad_scale: 32.0 2023-12-05 02:28:15,538 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=536700.0, ans=0.2 2023-12-05 02:28:20,077 INFO [scaling.py:1022] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-12-05 02:28:40,706 INFO [scaling.py:213] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=536833.3333333334, ans=0.04949747468305833 2023-12-05 02:28:44,746 INFO [optim.py:468] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.119e+02 1.273e+02 1.345e+02 1.462e+02 2.000e+02, threshold=2.691e+02, percent-clipped=1.0 2023-12-05 02:29:01,588 INFO [train.py:1352] (1/4) Done!