diff --git "a/wandb/run-20220503_172048-zotxt8wa/files/output.log" "b/wandb/run-20220503_172048-zotxt8wa/files/output.log" --- "a/wandb/run-20220503_172048-zotxt8wa/files/output.log" +++ "b/wandb/run-20220503_172048-zotxt8wa/files/output.log" @@ -93796,5 +93796,10484 @@ To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4237, 'learning_rate': 0.0002813167554103239, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████████▋ | 4501/19440 [13:29:22<4731:47:29, 1140.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2378, 'learning_rate': 0.00028129793320000605, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████████▉ | 4502/19440 [13:29:27<3317:42:11, 799.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2826, 'learning_rate': 0.00028127911098968825, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████████▉ | 4503/19440 [13:29:31<2327:39:41, 560.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1289, 'learning_rate': 0.0002812602887793704, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████████▉ | 4504/19440 [13:29:36<1634:26:10, 393.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.919, 'learning_rate': 0.0002812414665690526, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████████▉ | 4505/19440 [13:29:40<1149:11:51, 277.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0, 'learning_rate': 0.00028122264435873474, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4506/19440 [13:29:44<809:30:21, 195.14s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4507/19440 [13:29:48<571:45:03, 137.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8547, 'learning_rate': 0.0002812038221484169, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8928, 'learning_rate': 0.00028118499993809903, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4508/19440 [13:29:52<405:08:06, 97.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7831, 'learning_rate': 0.00028116617772778123, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4509/19440 [13:29:56<288:31:28, 69.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7818, 'learning_rate': 0.00028114735551746343, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4510/19440 [13:30:00<206:57:36, 49.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8456, 'learning_rate': 0.0002811285333071456, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4511/19440 [13:30:04<149:45:21, 36.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7198, 'learning_rate': 0.0002811097110968278, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4512/19440 [13:30:08<109:41:31, 26.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4513/19440 [13:30:12<82:19:49, 19.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6283, 'learning_rate': 0.0002810908888865099, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.751, 'learning_rate': 0.00028107206667619207, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4514/19440 [13:30:16<62:25:51, 15.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5684, 'learning_rate': 0.0002810532444658742, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4515/19440 [13:30:20<48:24:51, 11.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4482, 'learning_rate': 0.0002810344222555564, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4516/19440 [13:30:24<38:33:13, 9.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6299, 'learning_rate': 0.00028101560004523856, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4517/19440 [13:30:27<31:35:23, 7.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5827, 'learning_rate': 0.00028099677783492076, 'epoch': 0.7} + 23%|████████████████▋ | 4518/19440 [13:30:31<26:39:44, 6.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4519/19440 [13:30:35<23:05:03, 5.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5993, 'learning_rate': 0.00028097795562460296, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5308, 'learning_rate': 0.0002809591334142851, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4520/19440 [13:30:38<20:33:28, 4.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6817, 'learning_rate': 0.00028094031120396725, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4521/19440 [13:30:42<18:49:41, 4.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5082, 'learning_rate': 0.0002809214889936494, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4522/19440 [13:30:45<17:35:39, 4.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4523/19440 [13:30:49<16:36:10, 4.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4242, 'learning_rate': 0.0002809026667833316, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.125, 'learning_rate': 0.00028088384457301374, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4524/19440 [13:30:52<15:56:19, 3.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2727, 'learning_rate': 0.00028086502236269594, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4525/19440 [13:30:56<15:57:44, 3.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6254, 'learning_rate': 0.0002808462001523781, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4526/19440 [13:30:59<15:28:40, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4527/19440 [13:31:03<15:09:32, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.408, 'learning_rate': 0.0002808273779420603, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3624, 'learning_rate': 0.00028080855573174243, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4528/19440 [13:31:06<14:43:24, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4529/19440 [13:31:09<14:12:58, 3.43s/it] + 23%|████████████████▊ | 4529/19440 [13:31:09<14:12:58, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2004, 'learning_rate': 0.0002807709113111068, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4530/19440 [13:31:12<13:38:15, 3.29s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4531/19440 [13:31:15<13:14:02, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3322, 'learning_rate': 0.0002807520891007889, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9912, 'learning_rate': 0.0002807332668904711, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4532/19440 [13:31:18<12:55:11, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4533/19440 [13:31:21<12:38:25, 3.05s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.244, 'learning_rate': 0.00028071444468015326, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1275, 'learning_rate': 0.0002806956224698354, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4534/19440 [13:31:24<12:26:54, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4535/19440 [13:31:27<12:17:09, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9826, 'learning_rate': 0.0002806768002595176, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9892, 'learning_rate': 0.00028065797804919975, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4536/19440 [13:31:30<12:10:03, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.148, 'learning_rate': 0.00028063915583888195, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4537/19440 [13:31:33<12:01:08, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0443, 'learning_rate': 0.0002806203336285641, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4538/19440 [13:31:36<12:27:48, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6847, 'learning_rate': 0.0002806015114182463, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4539/19440 [13:31:39<12:14:52, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4540/19440 [13:31:41<11:57:55, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6434, 'learning_rate': 0.00028058268920792844, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7759, 'learning_rate': 0.0002805638669976106, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4541/19440 [13:31:44<11:44:27, 2.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8346, 'learning_rate': 0.00028054504478729274, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4542/19440 [13:31:47<11:32:42, 2.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4543/19440 [13:31:49<11:19:09, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7956, 'learning_rate': 0.00028052622257697493, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5726, 'learning_rate': 0.00028050740036665713, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4544/19440 [13:31:52<11:08:27, 2.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2464, 'learning_rate': 0.0002804885781563393, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4545/19440 [13:31:55<10:58:08, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4546/19440 [13:31:57<10:50:35, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0498, 'learning_rate': 0.0002804697559460215, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4547/19440 [13:32:00<10:37:43, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2867, 'learning_rate': 0.0002804509337357036, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1079, 'learning_rate': 0.00028043211152538577, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4548/19440 [13:32:02<10:23:53, 2.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8891, 'learning_rate': 0.0002804132893150679, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4549/19440 [13:32:04<10:14:40, 2.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8687, 'learning_rate': 0.0002803944671047501, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4550/19440 [13:32:07<10:25:50, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.3228, 'learning_rate': 0.00028037564489443226, 'epoch': 0.7} + 23%|████████████████▊ | 4551/19440 [13:32:12<12:59:38, 3.14s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4552/19440 [13:32:16<14:16:59, 3.45s/it] + 23%|████████████████▊ | 4552/19440 [13:32:16<14:16:59, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4553/19440 [13:32:20<14:59:36, 3.63s/it] + 23%|████████████████▊ | 4553/19440 [13:32:20<14:59:36, 3.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4554/19440 [13:32:24<15:23:15, 3.72s/it] + 23%|████████████████▊ | 4554/19440 [13:32:24<15:23:15, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4555/19440 [13:32:28<15:30:29, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1995, 'learning_rate': 0.00028030035605316095, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▊ | 4556/19440 [13:32:31<15:37:36, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9675, 'learning_rate': 0.0002802815338428431, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4557/19440 [13:32:35<15:40:17, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7945, 'learning_rate': 0.0002802627116325253, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7677, 'learning_rate': 0.00028024388942220744, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4558/19440 [13:32:39<15:30:42, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7621, 'learning_rate': 0.00028022506721188964, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4559/19440 [13:32:43<15:22:46, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9578, 'learning_rate': 0.0002802062450015718, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4560/19440 [13:32:46<15:13:35, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4561/19440 [13:32:50<15:01:33, 3.64s/it] + 23%|████████████████▉ | 4561/19440 [13:32:50<15:01:33, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.713, 'learning_rate': 0.00028016860058093613, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4562/19440 [13:32:53<14:47:54, 3.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8682, 'learning_rate': 0.0002801497783706183, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4563/19440 [13:32:57<15:06:21, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5688, 'learning_rate': 0.0002801309561603005, 'epoch': 0.7} + 23%|████████████████▉ | 4564/19440 [13:33:00<14:53:01, 3.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4565/19440 [13:33:04<14:42:38, 3.56s/it] + 23%|████████████████▉ | 4565/19440 [13:33:04<14:42:38, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5789, 'learning_rate': 0.0002800933117396648, 'epoch': 0.7} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4566/19440 [13:33:07<14:25:41, 3.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5343, 'learning_rate': 0.00028007448952934697, 'epoch': 0.7} + 23%|████████████████▉ | 4567/19440 [13:33:11<14:12:11, 3.44s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▉ | 4568/19440 [13:33:14<13:58:05, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3431, 'learning_rate': 0.0002800556673190291, 'epoch': 0.7} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7636, 'learning_rate': 0.00028003684510871126, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4569/19440 [13:33:17<13:48:16, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5224, 'learning_rate': 0.00028001802289839346, 'epoch': 0.71} + 24%|████████████████▉ | 4570/19440 [13:33:20<13:48:53, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3541, 'learning_rate': 0.00027999920068807566, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4571/19440 [13:33:24<13:38:22, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5071, 'learning_rate': 0.0002799803784777578, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4572/19440 [13:33:27<13:29:05, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4573/19440 [13:33:30<13:20:24, 3.23s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2892, 'learning_rate': 0.00027996155626744, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4813, 'learning_rate': 0.00027994273405712215, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4574/19440 [13:33:33<13:16:22, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4107, 'learning_rate': 0.0002799239118468043, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4575/19440 [13:33:37<13:43:16, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4576/19440 [13:33:40<13:32:47, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.237, 'learning_rate': 0.00027990508963648644, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1458, 'learning_rate': 0.00027988626742616864, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4577/19440 [13:33:43<13:23:03, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4578/19440 [13:33:46<13:07:05, 3.18s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2139, 'learning_rate': 0.00027986744521585084, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2677, 'learning_rate': 0.000279848623005533, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████���███████████▉ | 4579/19440 [13:33:49<12:50:40, 3.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.101, 'learning_rate': 0.0002798298007952152, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4580/19440 [13:33:52<12:39:05, 3.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0691, 'learning_rate': 0.0002798109785848973, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4581/19440 [13:33:55<12:30:02, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9382, 'learning_rate': 0.00027979215637457947, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4582/19440 [13:33:58<12:20:46, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4583/19440 [13:34:01<12:11:34, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.06, 'learning_rate': 0.0002797733341642616, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2101, 'learning_rate': 0.0002797545119539438, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4584/19440 [13:34:03<12:02:06, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4585/19440 [13:34:06<11:52:41, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7082, 'learning_rate': 0.00027973568974362596, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3537, 'learning_rate': 0.00027971686753330816, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4586/19440 [13:34:09<11:51:28, 2.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1404, 'learning_rate': 0.0002796980453229903, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4587/19440 [13:34:12<11:44:01, 2.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.805, 'learning_rate': 0.00027967922311267245, 'epoch': 0.71} + 24%|████████████████▉ | 4588/19440 [13:34:15<12:08:59, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9558, 'learning_rate': 0.00027966040090235465, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|████████████████▉ | 4589/19440 [13:34:18<11:59:06, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4590/19440 [13:34:21<11:43:17, 2.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7895, 'learning_rate': 0.0002796415786920368, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7649, 'learning_rate': 0.000279622756481719, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4591/19440 [13:34:23<11:28:37, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4592/19440 [13:34:26<11:15:55, 2.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6782, 'learning_rate': 0.00027960393427140114, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4593/19440 [13:34:28<11:07:24, 2.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5654, 'learning_rate': 0.00027958511206108334, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4594/19440 [13:34:31<10:57:19, 2.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4962, 'learning_rate': 0.0002795662898507655, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5134, 'learning_rate': 0.00027954746764044763, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4595/19440 [13:34:34<10:43:59, 2.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4596/19440 [13:34:36<10:32:19, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0668, 'learning_rate': 0.00027952864543012983, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4597/19440 [13:34:38<10:21:27, 2.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0836, 'learning_rate': 0.000279509823219812, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4598/19440 [13:34:41<10:14:11, 2.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1468, 'learning_rate': 0.0002794910010094942, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4599/19440 [13:34:43<10:04:17, 2.44s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8469, 'learning_rate': 0.0002794721787991763, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.495, 'learning_rate': 0.0002794533565888585, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4600/19440 [13:34:46<10:22:37, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4601/19440 [13:34:51<13:04:23, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.313, 'learning_rate': 0.00027943453437854067, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4602/19440 [13:34:55<14:20:08, 3.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2561, 'learning_rate': 0.0002794157121682228, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4603/19440 [13:34:59<15:02:45, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1925, 'learning_rate': 0.00027939688995790496, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4604/19440 [13:35:03<15:26:07, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9173, 'learning_rate': 0.00027937806774758716, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4605/19440 [13:35:07<15:36:32, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1908, 'learning_rate': 0.00027935924553726936, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4606/19440 [13:35:10<15:36:53, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0544, 'learning_rate': 0.0002793404233269515, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4607/19440 [13:35:14<15:46:20, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9934, 'learning_rate': 0.0002793216011166337, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4608/19440 [13:35:18<15:42:39, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7927, 'learning_rate': 0.00027930277890631585, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8547, 'learning_rate': 0.000279283956695998, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4609/19440 [13:35:22<15:34:03, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4610/19440 [13:35:25<15:24:18, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6559, 'learning_rate': 0.00027926513448568014, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4611/19440 [13:35:29<15:13:15, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8104, 'learning_rate': 0.00027924631227536234, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4612/19440 [13:35:33<14:59:37, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7559, 'learning_rate': 0.0002792274900650445, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4613/19440 [13:35:36<15:17:34, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.716, 'learning_rate': 0.0002792086678547267, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4389, 'learning_rate': 0.0002791898456444088, 'epoch': 0.71} + 24%|█████████████████ | 4614/19440 [13:35:40<15:02:43, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4615/19440 [13:35:43<14:44:48, 3.58s/it] + 24%|█████████████████ | 4615/19440 [13:35:43<14:44:48, 3.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4616/19440 [13:35:47<14:32:19, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.936, 'learning_rate': 0.00027915220122377317, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6493, 'learning_rate': 0.0002791333790134553, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4617/19440 [13:35:50<14:25:52, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4618/19440 [13:35:53<14:08:24, 3.43s/it] + 24%|█████████████████ | 4618/19440 [13:35:53<14:08:24, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4619/19440 [13:35:57<13:53:40, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4923, 'learning_rate': 0.00027909573459281966, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6779, 'learning_rate': 0.00027907691238250186, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4620/19440 [13:36:00<13:38:52, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4621/19440 [13:36:03<13:29:26, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4646, 'learning_rate': 0.000279058090172184, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4581, 'learning_rate': 0.00027903926796186615, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4622/19440 [13:36:06<13:20:48, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████ | 4623/19440 [13:36:09<13:11:36, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4344, 'learning_rate': 0.00027902044575154835, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4624/19440 [13:36:12<13:01:13, 3.16s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2495, 'learning_rate': 0.0002790016235412305, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3407, 'learning_rate': 0.0002789828013309127, 'epoch': 0.71} + 24%|█████████████████▏ | 4625/19440 [13:36:16<13:23:41, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4626/19440 [13:36:19<13:26:07, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1946, 'learning_rate': 0.00027896397912059484, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2175, 'learning_rate': 0.00027894515691027704, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4627/19440 [13:36:22<13:09:22, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4628/19440 [13:36:25<12:55:48, 3.14s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3475, 'learning_rate': 0.0002789263346999592, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0805, 'learning_rate': 0.00027890751248964133, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4629/19440 [13:36:28<12:42:07, 3.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4630/19440 [13:36:31<12:34:13, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1501, 'learning_rate': 0.00027888869027932353, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.215, 'learning_rate': 0.0002788698680690057, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4631/19440 [13:36:34<12:25:45, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4632/19440 [13:36:37<12:14:06, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1541, 'learning_rate': 0.0002788510458586879, 'epoch': 0.71} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1569, 'learning_rate': 0.00027883222364837, 'epoch': 0.71} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4633/19440 [13:36:40<12:11:42, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4634/19440 [13:36:43<12:08:09, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9354, 'learning_rate': 0.0002788134014380522, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4635/19440 [13:36:46<12:11:15, 2.96s/it] + 24%|█████████████████▏ | 4635/19440 [13:36:46<12:11:15, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4636/19440 [13:36:49<12:30:26, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2107, 'learning_rate': 0.0002787757570174165, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0549, 'learning_rate': 0.00027875693480709866, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4637/19440 [13:36:52<12:34:33, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4638/19440 [13:36:56<13:01:57, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9145, 'learning_rate': 0.00027873811259678086, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9033, 'learning_rate': 0.00027871929038646306, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4639/19440 [13:36:59<12:48:10, 3.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4640/19440 [13:37:01<12:29:59, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8004, 'learning_rate': 0.0002787004681761452, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7907, 'learning_rate': 0.00027868164596582735, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4641/19440 [13:37:04<12:15:52, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4642/19440 [13:37:07<12:06:38, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5941, 'learning_rate': 0.0002786628237555095, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7125, 'learning_rate': 0.0002786440015451917, 'epoch': 0.72} + 24%|█████████████████▏ | 4643/19440 [13:37:10<12:09:27, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4644/19440 [13:37:13<12:11:36, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.887, 'learning_rate': 0.00027862517933487384, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4645/19440 [13:37:16<12:14:10, 2.98s/it] + 24%|█████████████████▏ | 4645/19440 [13:37:16<12:14:10, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4646/19440 [13:37:19<12:13:26, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.3773, 'learning_rate': 0.0002785875349142382, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4647/19440 [13:37:22<12:09:07, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0636, 'learning_rate': 0.0002785687127039204, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9951, 'learning_rate': 0.00027854989049360253, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4648/19440 [13:37:25<11:53:11, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4649/19440 [13:37:28<11:53:03, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.115, 'learning_rate': 0.0002785310682832847, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9107, 'learning_rate': 0.00027851224607296687, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4650/19440 [13:37:31<12:01:14, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4651/19440 [13:37:36<14:41:40, 3.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.348, 'learning_rate': 0.000278493423862649, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0331, 'learning_rate': 0.0002784746016523312, 'epoch': 0.72} + 24%|█████████████████▏ | 4652/19440 [13:37:41<16:07:20, 3.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4653/19440 [13:37:45<17:04:39, 4.16s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.074, 'learning_rate': 0.00027845577944201336, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4654/19440 [13:37:49<17:07:20, 4.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8843, 'learning_rate': 0.00027843695723169556, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4655/19440 [13:37:53<16:55:48, 4.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0542, 'learning_rate': 0.0002784181350213777, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4656/19440 [13:37:58<16:56:49, 4.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7422, 'learning_rate': 0.00027839931281105985, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.981, 'learning_rate': 0.00027838049060074205, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▏ | 4657/19440 [13:38:02<16:44:10, 4.08s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8078, 'learning_rate': 0.0002783616683904242, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4658/19440 [13:38:05<16:34:42, 4.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8195, 'learning_rate': 0.0002783428461801064, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4659/19440 [13:38:09<16:29:09, 4.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6207, 'learning_rate': 0.00027832402396978854, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4660/19440 [13:38:13<16:20:58, 3.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7993, 'learning_rate': 0.00027830520175947074, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4661/19440 [13:38:17<16:12:41, 3.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7186, 'learning_rate': 0.0002782863795491529, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4662/19440 [13:38:21<16:05:15, 3.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7545, 'learning_rate': 0.00027826755733883503, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4663/19440 [13:38:25<16:15:07, 3.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.723, 'learning_rate': 0.0002782487351285172, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4664/19440 [13:38:29<16:16:39, 3.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7797, 'learning_rate': 0.0002782299129181994, 'epoch': 0.72} + 24%|█████████████████▎ | 4665/19440 [13:38:33<16:12:51, 3.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5621, 'learning_rate': 0.0002782110907078816, 'epoch': 0.72} + 24%|█████████████████▎ | 4666/19440 [13:38:37<16:12:35, 3.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6034, 'learning_rate': 0.0002781922684975637, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4667/19440 [13:38:41<16:12:27, 3.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5688, 'learning_rate': 0.00027817344628724587, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4668/19440 [13:38:45<16:02:41, 3.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4596, 'learning_rate': 0.000278154624076928, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4669/19440 [13:38:48<15:53:22, 3.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2746, 'learning_rate': 0.0002781358018666102, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4670/19440 [13:38:52<15:32:55, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6376, 'learning_rate': 0.00027811697965629236, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4671/19440 [13:38:56<15:24:03, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4536, 'learning_rate': 0.00027809815744597456, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4672/19440 [13:38:59<15:03:03, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3652, 'learning_rate': 0.00027807933523565676, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4673/19440 [13:39:03<14:49:05, 3.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4557, 'learning_rate': 0.0002780605130253389, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4674/19440 [13:39:06<14:38:37, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3577, 'learning_rate': 0.00027804169081502105, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4675/19440 [13:39:10<15:11:03, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3268, 'learning_rate': 0.0002780228686047032, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4676/19440 [13:39:14<15:04:24, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4677/19440 [13:39:17<14:38:37, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2078, 'learning_rate': 0.0002780040463943854, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0041, 'learning_rate': 0.00027798522418406754, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4678/19440 [13:39:20<14:20:38, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1512, 'learning_rate': 0.00027796640197374974, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4679/19440 [13:39:24<14:10:07, 3.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4513, 'learning_rate': 0.0002779475797634319, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4680/19440 [13:39:27<13:54:13, 3.39s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1786, 'learning_rate': 0.0002779287575531141, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4681/19440 [13:39:30<13:37:24, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4682/19440 [13:39:33<13:28:31, 3.29s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0944, 'learning_rate': 0.00027790993534279623, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1413, 'learning_rate': 0.0002778911131324784, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4683/19440 [13:39:37<13:19:29, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4684/19440 [13:39:40<13:12:31, 3.22s/it] + 24%|█████████████████▎ | 4684/19440 [13:39:40<13:12:31, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.889, 'learning_rate': 0.0002778534687118427, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4685/19440 [13:39:43<13:04:54, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8464, 'learning_rate': 0.0002778346465015249, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4686/19440 [13:39:46<13:09:11, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0663, 'learning_rate': 0.00027781582429120706, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4687/19440 [13:39:49<13:00:47, 3.18s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9391, 'learning_rate': 0.00027779700208088926, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4688/19440 [13:39:53<13:26:14, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4689/19440 [13:39:56<13:05:50, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8678, 'learning_rate': 0.0002777781798705714, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8005, 'learning_rate': 0.00027775935766025355, 'epoch': 0.72} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▎ | 4690/19440 [13:39:59<12:47:33, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6473, 'learning_rate': 0.00027774053544993575, 'epoch': 0.72} + 24%|█████████████████▎ | 4691/19440 [13:40:02<12:35:58, 3.08s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6299, 'learning_rate': 0.0002777217132396179, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4692/19440 [13:40:05<12:27:09, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5183, 'learning_rate': 0.0002777028910293001, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4693/19440 [13:40:08<12:22:10, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4435, 'learning_rate': 0.00027768406881898224, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4694/19440 [13:40:10<12:11:20, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4231, 'learning_rate': 0.0002776652466086644, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4695/19440 [13:40:13<11:56:42, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4258, 'learning_rate': 0.00027764642439834653, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4696/19440 [13:40:16<11:40:38, 2.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1854, 'learning_rate': 0.00027762760218802873, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4697/19440 [13:40:19<11:33:02, 2.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1783, 'learning_rate': 0.0002776087799777109, 'epoch': 0.72} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4698/19440 [13:40:21<11:16:59, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4699/19440 [13:40:24<10:51:10, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.5861, 'learning_rate': 0.0002775899577673931, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.7934, 'learning_rate': 0.0002775711355570753, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4700/19440 [13:40:26<10:55:56, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.589, 'learning_rate': 0.0002775523133467574, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4701/19440 [13:40:31<13:25:19, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4702/19440 [13:40:35<14:33:48, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2765, 'learning_rate': 0.00027753349113643957, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4703/19440 [13:40:39<15:12:24, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1871, 'learning_rate': 0.0002775146689261217, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9412, 'learning_rate': 0.0002774958467158039, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4704/19440 [13:40:43<15:31:18, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0798, 'learning_rate': 0.00027747702450548606, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4705/19440 [13:40:47<15:36:42, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9433, 'learning_rate': 0.00027745820229516826, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4706/19440 [13:40:51<15:40:57, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.045, 'learning_rate': 0.0002774393800848504, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4707/19440 [13:40:55<15:35:06, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8228, 'learning_rate': 0.0002774205578745326, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4708/19440 [13:40:59<15:27:56, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4709/19440 [13:41:02<15:20:35, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9051, 'learning_rate': 0.00027740173566421475, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4710/19440 [13:41:06<15:13:25, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6681, 'learning_rate': 0.0002773829134538969, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7665, 'learning_rate': 0.0002773640912435791, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4711/19440 [13:41:09<15:01:23, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4712/19440 [13:41:13<14:52:05, 3.63s/it] + 24%|█████████████████▍ | 4712/19440 [13:41:13<14:52:05, 3.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4713/19440 [13:41:17<15:12:25, 3.72s/it] + 24%|█████████████████▍ | 4713/19440 [13:41:17<15:12:25, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4714/19440 [13:41:20<14:55:20, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7514, 'learning_rate': 0.0002773076246126256, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2819, 'learning_rate': 0.0002772888024023078, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4715/19440 [13:41:24<14:35:38, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6231, 'learning_rate': 0.00027726998019198993, 'epoch': 0.73} + 24%|█████████████████▍ | 4716/19440 [13:41:27<14:26:27, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4717/19440 [13:41:31<14:12:40, 3.47s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6114, 'learning_rate': 0.0002772511579816721, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5138, 'learning_rate': 0.0002772323357713543, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4718/19440 [13:41:34<13:55:34, 3.41s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4719/19440 [13:41:37<13:42:38, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4041, 'learning_rate': 0.0002772135135610364, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4720/19440 [13:41:40<13:33:31, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.429, 'learning_rate': 0.0002771946913507186, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5116, 'learning_rate': 0.00027717586914040076, 'epoch': 0.73} + 24%|█████████████████▍ | 4721/19440 [13:41:43<13:25:39, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4722/19440 [13:41:47<13:21:14, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5101, 'learning_rate': 0.0002771570469300829, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5043, 'learning_rate': 0.00027713822471976506, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4723/19440 [13:41:50<13:17:24, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▍ | 4724/19440 [13:41:53<13:08:24, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3563, 'learning_rate': 0.00027711940250944726, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4725/19440 [13:41:57<13:35:39, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4411, 'learning_rate': 0.00027710058029912945, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.389, 'learning_rate': 0.0002770817580888116, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4726/19440 [13:42:00<13:23:18, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4727/19440 [13:42:03<13:07:20, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.273, 'learning_rate': 0.0002770629358784938, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3196, 'learning_rate': 0.00027704411366817594, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4728/19440 [13:42:06<12:53:28, 3.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4729/19440 [13:42:09<12:44:02, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3879, 'learning_rate': 0.0002770252914578581, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.161, 'learning_rate': 0.00027700646924754024, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4730/19440 [13:42:12<12:31:16, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4731/19440 [13:42:15<12:22:56, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.033, 'learning_rate': 0.00027698764703722244, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9967, 'learning_rate': 0.0002769688248269046, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4732/19440 [13:42:18<12:15:21, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4733/19440 [13:42:21<12:10:54, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0794, 'learning_rate': 0.0002769500026165868, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0235, 'learning_rate': 0.000276931180406269, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4734/19440 [13:42:24<12:06:26, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4735/19440 [13:42:26<11:59:07, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0298, 'learning_rate': 0.0002769123581959511, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4736/19440 [13:42:29<11:50:53, 2.90s/it] + 24%|█████████████████▌ | 4736/19440 [13:42:29<11:50:53, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.895, 'learning_rate': 0.0002768747137753154, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4737/19440 [13:42:32<11:46:15, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4738/19440 [13:42:35<12:08:38, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7284, 'learning_rate': 0.0002768558915649976, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8782, 'learning_rate': 0.00027683706935467976, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4739/19440 [13:42:38<12:02:30, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4740/19440 [13:42:41<11:48:53, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8922, 'learning_rate': 0.00027681824714436196, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4741/19440 [13:42:44<11:34:27, 2.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7636, 'learning_rate': 0.0002767994249340441, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4742/19440 [13:42:46<11:26:54, 2.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4597, 'learning_rate': 0.0002767806027237263, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4743/19440 [13:42:49<11:17:07, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3201, 'learning_rate': 0.00027676178051340845, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4241, 'learning_rate': 0.0002767429583030906, 'epoch': 0.73} + 24%|█████████████████▌ | 4744/19440 [13:42:52<11:03:23, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3724, 'learning_rate': 0.0002767241360927728, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4745/19440 [13:42:54<11:02:35, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4746/19440 [13:42:57<10:50:12, 2.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2164, 'learning_rate': 0.00027670531388245494, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4747/19440 [13:42:59<10:36:10, 2.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9254, 'learning_rate': 0.00027668649167213714, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0058, 'learning_rate': 0.0002766676694618193, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4748/19440 [13:43:02<10:28:33, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8257, 'learning_rate': 0.0002766488472515015, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4749/19440 [13:43:04<10:13:52, 2.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4750/19440 [13:43:07<10:28:05, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.4699, 'learning_rate': 0.00027663002504118363, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4751/19440 [13:43:12<13:09:23, 3.22s/it] + 24%|█████████████████▌ | 4751/19440 [13:43:12<13:09:23, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4752/19440 [13:43:16<14:32:32, 3.56s/it] + 24%|█████████████████▌ | 4752/19440 [13:43:16<14:32:32, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2062, 'learning_rate': 0.0002765735584102301, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4753/19440 [13:43:20<15:12:51, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1004, 'learning_rate': 0.0002765547361999123, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4754/19440 [13:43:24<15:33:41, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9636, 'learning_rate': 0.00027653591398959447, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4755/19440 [13:43:28<15:44:11, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9021, 'learning_rate': 0.0002765170917792766, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4756/19440 [13:43:32<15:45:00, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4757/19440 [13:43:36<15:47:07, 3.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8594, 'learning_rate': 0.00027649826956895876, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▌ | 4758/19440 [13:43:40<15:36:53, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.014, 'learning_rate': 0.00027647944735864096, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▋ | 4759/19440 [13:43:43<15:26:48, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9357, 'learning_rate': 0.0002764606251483231, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▋ | 4760/19440 [13:43:47<15:18:50, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9399, 'learning_rate': 0.0002764418029380053, 'epoch': 0.73} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7303, 'learning_rate': 0.0002764229807276875, 'epoch': 0.73} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▋ | 4761/19440 [13:43:51<15:06:47, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 24%|█████████████████▋ | 4762/19440 [13:43:54<14:55:42, 3.66s/it] + 24%|█████████████████▋ | 4762/19440 [13:43:54<14:55:42, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4763/19440 [13:43:58<15:13:23, 3.73s/it] + 25%|█████████████████▋ | 4763/19440 [13:43:58<15:13:23, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4764/19440 [13:44:02<14:57:17, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4275, 'learning_rate': 0.00027636651409673394, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4765/19440 [13:44:05<14:42:36, 3.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7445, 'learning_rate': 0.00027634769188641614, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4766/19440 [13:44:08<14:26:33, 3.54s/it] + 25%|█████████████████▋ | 4766/19440 [13:44:08<14:26:33, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4767/19440 [13:44:12<14:15:08, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6948, 'learning_rate': 0.0002763100474657805, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4768/19440 [13:44:15<13:58:22, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5109, 'learning_rate': 0.0002762912252554627, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4769/19440 [13:44:18<13:44:45, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4878, 'learning_rate': 0.0002762724030451448, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4770/19440 [13:44:22<13:38:46, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5263, 'learning_rate': 0.00027625358083482697, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5089, 'learning_rate': 0.0002762347586245091, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|███████████████���█▋ | 4771/19440 [13:44:25<13:33:19, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4772/19440 [13:44:28<13:27:04, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.692, 'learning_rate': 0.0002762159364141913, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4773/19440 [13:44:31<13:18:41, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3741, 'learning_rate': 0.00027619711420387346, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4774/19440 [13:44:34<13:07:03, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3142, 'learning_rate': 0.00027617829199355566, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4775/19440 [13:44:38<13:33:12, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1901, 'learning_rate': 0.0002761594697832378, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2671, 'learning_rate': 0.00027614064757292, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4776/19440 [13:44:41<13:22:44, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4777/19440 [13:44:44<13:08:26, 3.23s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1154, 'learning_rate': 0.00027612182536260215, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4778/19440 [13:44:47<12:57:24, 3.18s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2491, 'learning_rate': 0.0002761030031522843, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4779/19440 [13:44:50<12:45:24, 3.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.211, 'learning_rate': 0.0002760841809419665, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4780/19440 [13:44:53<12:35:19, 3.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0488, 'learning_rate': 0.00027606535873164864, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4781/19440 [13:44:56<12:27:47, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1039, 'learning_rate': 0.00027604653652133084, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4782/19440 [13:44:59<12:21:26, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2226, 'learning_rate': 0.000276027714311013, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4783/19440 [13:45:02<12:13:47, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9709, 'learning_rate': 0.00027600889210069513, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4784/19440 [13:45:05<12:09:03, 2.98s/it] + 25%|█████████████████▋ | 4784/19440 [13:45:05<12:09:03, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4785/19440 [13:45:08<12:06:11, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3233, 'learning_rate': 0.0002759712476800595, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4786/19440 [13:45:11<12:02:33, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0436, 'learning_rate': 0.0002759524254697417, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4787/19440 [13:45:14<11:56:02, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7645, 'learning_rate': 0.0002759336032594238, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9731, 'learning_rate': 0.000275914781049106, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4788/19440 [13:45:17<12:19:21, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4789/19440 [13:45:20<12:06:45, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9301, 'learning_rate': 0.00027589595883878817, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4790/19440 [13:45:23<11:52:03, 2.92s/it] + 25%|█████████████████▋ | 4790/19440 [13:45:23<11:52:03, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4791/19440 [13:45:26<11:39:04, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9246, 'learning_rate': 0.00027585831441815246, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▋ | 4792/19440 [13:45:28<11:27:26, 2.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5285, 'learning_rate': 0.00027583949220783466, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4793/19440 [13:45:31<11:16:01, 2.77s/it] + 25%|█████████████████▊ | 4793/19440 [13:45:31<11:16:01, 2.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4794/19440 [13:45:34<11:08:44, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4692, 'learning_rate': 0.000275801847787199, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4795/19440 [13:45:36<10:56:03, 2.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2974, 'learning_rate': 0.0002757830255768812, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4796/19440 [13:45:39<10:44:29, 2.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.2711, 'learning_rate': 0.00027576420336656335, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2314, 'learning_rate': 0.0002757453811562455, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4797/19440 [13:45:41<10:36:19, 2.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4798/19440 [13:45:44<10:30:38, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0352, 'learning_rate': 0.00027572655894592764, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4799/19440 [13:45:46<10:20:34, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.781, 'learning_rate': 0.00027570773673560984, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4800/19440 [13:45:49<10:47:16, 2.65s/it] + 25%|█████████████████▊ | 4800/19440 [13:45:49<10:47:16, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4801/19440 [13:45:54<13:33:24, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2947, 'learning_rate': 0.0002756700923149742, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4802/19440 [13:45:58<14:49:05, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4005, 'learning_rate': 0.00027565127010465633, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4803/19440 [13:46:03<15:25:18, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1867, 'learning_rate': 0.0002756324478943385, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4804/19440 [13:46:07<15:39:46, 3.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2521, 'learning_rate': 0.00027561362568402067, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4805/19440 [13:46:10<15:43:43, 3.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0579, 'learning_rate': 0.0002755948034737028, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4806/19440 [13:46:14<15:41:57, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0671, 'learning_rate': 0.000275575981263385, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4807/19440 [13:46:18<15:48:02, 3.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8629, 'learning_rate': 0.00027555715905306716, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4808/19440 [13:46:22<15:40:47, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8529, 'learning_rate': 0.00027553833684274936, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4809/19440 [13:46:26<15:31:36, 3.82s/it] + 25%|█████████████████▊ | 4809/19440 [13:46:26<15:31:36, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4810/19440 [13:46:29<15:19:05, 3.77s/it] + 25%|█████████████████▊ | 4810/19440 [13:46:29<15:19:05, 3.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9694, 'learning_rate': 0.00027548187021179585, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4811/19440 [13:46:33<15:04:47, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8707, 'learning_rate': 0.000275463048001478, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4812/19440 [13:46:37<14:51:48, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.54, 'learning_rate': 0.0002754442257911602, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4813/19440 [13:46:41<15:12:08, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8488, 'learning_rate': 0.00027542540358084234, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4814/19440 [13:46:44<14:55:06, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6953, 'learning_rate': 0.00027540658137052454, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4815/19440 [13:46:47<14:39:17, 3.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4816/19440 [13:46:51<14:24:42, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5627, 'learning_rate': 0.0002753877591602067, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5748, 'learning_rate': 0.00027536893694988883, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4817/19440 [13:46:54<14:14:30, 3.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6648, 'learning_rate': 0.000275350114739571, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4818/19440 [13:46:58<14:00:46, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5565, 'learning_rate': 0.0002753312925292532, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4819/19440 [13:47:01<13:49:41, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2167, 'learning_rate': 0.0002753124703189354, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4820/19440 [13:47:04<13:36:56, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3719, 'learning_rate': 0.0002752936481086175, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4821/19440 [13:47:07<13:25:55, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3988, 'learning_rate': 0.0002752748258982997, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4822/19440 [13:47:11<13:19:58, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4968, 'learning_rate': 0.00027525600368798187, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4823/19440 [13:47:14<13:11:43, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.283, 'learning_rate': 0.000275237181477664, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4824/19440 [13:47:17<13:06:42, 3.23s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5509, 'learning_rate': 0.00027521835926734616, 'epoch': 0.74} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4825/19440 [13:47:21<13:33:04, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2809, 'learning_rate': 0.00027519953705702836, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▊ | 4826/19440 [13:47:24<13:26:10, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.478, 'learning_rate': 0.0002751807148467105, 'epoch': 0.74} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4827/19440 [13:47:27<13:19:37, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4069, 'learning_rate': 0.0002751618926363927, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4828/19440 [13:47:30<13:05:15, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1734, 'learning_rate': 0.0002751430704260749, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4829/19440 [13:47:33<12:55:25, 3.18s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2004, 'learning_rate': 0.00027512424821575705, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4830/19440 [13:47:36<12:44:52, 3.14s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4747, 'learning_rate': 0.0002751054260054392, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4831/19440 [13:47:39<12:35:22, 3.10s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8532, 'learning_rate': 0.00027508660379512134, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4832/19440 [13:47:42<12:24:46, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9937, 'learning_rate': 0.00027506778158480354, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4833/19440 [13:47:45<12:14:34, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.135, 'learning_rate': 0.0002750489593744857, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4834/19440 [13:47:48<12:05:24, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9562, 'learning_rate': 0.0002750301371641679, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4835/19440 [13:47:51<11:59:25, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0763, 'learning_rate': 0.00027501131495385003, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4836/19440 [13:47:54<11:55:40, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.014, 'learning_rate': 0.0002749924927435322, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4837/19440 [13:47:57<11:50:20, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8597, 'learning_rate': 0.0002749736705332144, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4838/19440 [13:48:00<12:16:41, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9722, 'learning_rate': 0.0002749548483228965, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4839/19440 [13:48:03<12:12:28, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8003, 'learning_rate': 0.0002749360261125787, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4840/19440 [13:48:06<11:59:54, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6318, 'learning_rate': 0.00027491720390226086, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4841/19440 [13:48:08<11:43:07, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8095, 'learning_rate': 0.00027489838169194306, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4842/19440 [13:48:11<11:27:40, 2.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4658, 'learning_rate': 0.0002748795594816252, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4843/19440 [13:48:14<11:16:17, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5432, 'learning_rate': 0.00027486073727130735, 'epoch': 0.75} + 25%|█████████████████▉ | 4844/19440 [13:48:16<11:05:38, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3245, 'learning_rate': 0.0002748419150609895, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4845/19440 [13:48:19<10:55:56, 2.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2222, 'learning_rate': 0.0002748230928506717, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4846/19440 [13:48:22<10:43:52, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9979, 'learning_rate': 0.0002748042706403539, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4847/19440 [13:48:24<10:35:46, 2.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0529, 'learning_rate': 0.00027478544843003604, 'epoch': 0.75} + 25%|█████████████████▉ | 4848/19440 [13:48:27<10:25:24, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4849/19440 [13:48:29<10:14:25, 2.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8162, 'learning_rate': 0.00027476662621971824, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4850/19440 [13:48:32<10:26:40, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.3599, 'learning_rate': 0.0002747478040094004, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4851/19440 [13:48:36<13:09:31, 3.25s/it] + 25%|█████████████████▉ | 4851/19440 [13:48:36<13:09:31, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.2122, 'learning_rate': 0.0002747101595887647, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4852/19440 [13:48:41<14:28:36, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7006, 'learning_rate': 0.0002746913373784469, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4853/19440 [13:48:45<15:08:26, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4854/19440 [13:48:49<15:25:02, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0065, 'learning_rate': 0.0002746725151681291, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4855/19440 [13:48:53<15:30:13, 3.83s/it] + 25%|█████████████████▉ | 4855/19440 [13:48:53<15:30:13, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9263, 'learning_rate': 0.0002746348707474934, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4856/19440 [13:48:57<15:31:55, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0855, 'learning_rate': 0.00027461604853717557, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4857/19440 [13:49:01<15:38:19, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7751, 'learning_rate': 0.0002745972263268577, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4858/19440 [13:49:04<15:27:16, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6826, 'learning_rate': 0.00027457840411653986, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|█████████████████▉ | 4859/19440 [13:49:08<15:18:15, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8831, 'learning_rate': 0.00027455958190622206, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4860/19440 [13:49:12<15:08:42, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7364, 'learning_rate': 0.0002745407596959042, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4861/19440 [13:49:15<15:09:43, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5898, 'learning_rate': 0.0002745219374855864, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4862/19440 [13:49:19<14:55:33, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6043, 'learning_rate': 0.00027450311527526855, 'epoch': 0.75} + 25%|██████████████████ | 4863/19440 [13:49:23<15:11:57, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7575, 'learning_rate': 0.0002744842930649507, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4864/19440 [13:49:26<14:54:19, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6094, 'learning_rate': 0.0002744654708546329, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4865/19440 [13:49:30<14:32:50, 3.59s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6242, 'learning_rate': 0.00027444664864431504, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4866/19440 [13:49:33<14:17:50, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7218, 'learning_rate': 0.00027442782643399724, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4867/19440 [13:49:36<14:05:05, 3.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4868/19440 [13:49:40<13:51:58, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6493, 'learning_rate': 0.0002744090042236794, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4869/19440 [13:49:43<13:40:22, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1999, 'learning_rate': 0.0002743901820133616, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4870/19440 [13:49:46<13:31:54, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6203, 'learning_rate': 0.00027437135980304373, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4871/19440 [13:49:49<13:19:56, 3.29s/it] + 25%|██████████████████ | 4871/19440 [13:49:49<13:19:56, 3.29s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.408, 'learning_rate': 0.0002743337153824081, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4872/19440 [13:49:53<13:08:38, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4555, 'learning_rate': 0.0002743148931720902, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4873/19440 [13:49:56<12:58:04, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4002, 'learning_rate': 0.0002742960709617724, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4874/19440 [13:49:59<12:50:34, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5174, 'learning_rate': 0.00027427724875145456, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4875/19440 [13:50:02<13:19:01, 3.29s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2615, 'learning_rate': 0.00027425842654113676, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4876/19440 [13:50:06<13:11:35, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.441, 'learning_rate': 0.0002742396043308189, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4877/19440 [13:50:09<12:57:42, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4878/19440 [13:50:12<12:46:48, 3.16s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3628, 'learning_rate': 0.00027422078212050105, 'epoch': 0.75} +{'loss': 6.2503, 'learning_rate': 0.0002742019599101832, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4879/19440 [13:50:15<12:37:25, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1378, 'learning_rate': 0.0002741831376998654, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4880/19440 [13:50:18<12:30:45, 3.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4845, 'learning_rate': 0.0002741643154895476, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4881/19440 [13:50:21<12:23:32, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2495, 'learning_rate': 0.00027414549327922974, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4882/19440 [13:50:24<12:17:38, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2519, 'learning_rate': 0.00027412667106891194, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4883/19440 [13:50:27<12:07:37, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4884/19440 [13:50:30<12:00:27, 2.97s/it] + 25%|██████████████████ | 4884/19440 [13:50:30<12:00:27, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9846, 'learning_rate': 0.00027408902664827623, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4885/19440 [13:50:32<11:51:43, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8919, 'learning_rate': 0.0002740702044379584, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4886/19440 [13:50:35<11:46:59, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2938, 'learning_rate': 0.0002740513822276406, 'epoch': 0.75} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4887/19440 [13:50:38<11:38:29, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6733, 'learning_rate': 0.0002740325600173227, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4888/19440 [13:50:41<12:04:36, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8034, 'learning_rate': 0.0002740137378070049, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4889/19440 [13:50:44<11:55:49, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0186, 'learning_rate': 0.00027399491559668707, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4890/19440 [13:50:47<11:42:50, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4891/19440 [13:50:50<11:28:57, 2.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7093, 'learning_rate': 0.0002739760933863692, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.624, 'learning_rate': 0.0002739572711760514, 'epoch': 0.75} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4892/19440 [13:50:52<11:16:51, 2.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8464, 'learning_rate': 0.00027393844896573356, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████ | 4893/19440 [13:50:55<11:08:15, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4867, 'learning_rate': 0.00027391962675541576, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4894/19440 [13:50:58<10:55:59, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4445, 'learning_rate': 0.0002739008045450979, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4895/19440 [13:51:00<10:46:13, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6013, 'learning_rate': 0.0002738819823347801, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4896/19440 [13:51:03<10:36:59, 2.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.2468, 'learning_rate': 0.00027386316012446225, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4897/19440 [13:51:05<10:25:31, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8301, 'learning_rate': 0.0002738443379141444, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4898/19440 [13:51:08<10:15:06, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.7085, 'learning_rate': 0.0002738255157038266, 'epoch': 0.76} + 25%|██████████████████▏ | 4899/19440 [13:51:10<10:05:48, 2.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.5608, 'learning_rate': 0.00027380669349350874, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4900/19440 [13:51:13<10:23:45, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1923, 'learning_rate': 0.00027378787128319094, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4901/19440 [13:51:17<12:59:27, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.179, 'learning_rate': 0.0002737690490728731, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4902/19440 [13:51:22<14:11:52, 3.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2539, 'learning_rate': 0.0002737502268625553, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4903/19440 [13:51:26<14:51:10, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9662, 'learning_rate': 0.00027373140465223743, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4904/19440 [13:51:30<15:06:18, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8825, 'learning_rate': 0.0002737125824419196, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4905/19440 [13:51:33<15:13:14, 3.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7272, 'learning_rate': 0.0002736937602316018, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4906/19440 [13:51:37<15:20:56, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8392, 'learning_rate': 0.0002736749380212839, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4907/19440 [13:51:41<15:24:03, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7329, 'learning_rate': 0.0002736561158109661, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4908/19440 [13:51:45<15:17:27, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7501, 'learning_rate': 0.00027363729360064827, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4909/19440 [13:51:49<15:08:43, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4910/19440 [13:51:52<14:57:33, 3.71s/it] + 25%|██████████████████▏ | 4910/19440 [13:51:52<14:57:33, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7851, 'learning_rate': 0.0002735996491800126, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4911/19440 [13:51:56<14:46:38, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8397, 'learning_rate': 0.00027358082696969476, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4912/19440 [13:51:59<14:34:23, 3.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6595, 'learning_rate': 0.0002735620047593769, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4913/19440 [13:52:03<15:06:54, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6558, 'learning_rate': 0.0002735431825490591, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4914/19440 [13:52:07<14:43:47, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6714, 'learning_rate': 0.0002735243603387413, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4915/19440 [13:52:10<14:24:56, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7129, 'learning_rate': 0.00027350553812842345, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4916/19440 [13:52:13<14:09:23, 3.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6867, 'learning_rate': 0.0002734867159181056, 'epoch': 0.76} + 25%|██████████████████▏ | 4917/19440 [13:52:17<13:53:33, 3.44s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5279, 'learning_rate': 0.00027346789370778774, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4918/19440 [13:52:20<13:41:59, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4237, 'learning_rate': 0.00027344907149746994, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4919/19440 [13:52:23<13:31:28, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4920/19440 [13:52:27<13:21:39, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4714, 'learning_rate': 0.0002734302492871521, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4741, 'learning_rate': 0.0002734114270768343, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4921/19440 [13:52:30<13:12:22, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4922/19440 [13:52:33<13:05:02, 3.24s/it] + 25%|██████████████████▏ | 4922/19440 [13:52:33<13:05:02, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2436, 'learning_rate': 0.0002733737826561986, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4923/19440 [13:52:36<12:59:02, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4666, 'learning_rate': 0.00027335496044588077, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4924/19440 [13:52:39<12:51:48, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4925/19440 [13:52:43<13:13:05, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2857, 'learning_rate': 0.0002733361382355629, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1286, 'learning_rate': 0.0002733173160252451, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4926/19440 [13:52:46<13:07:15, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▏ | 4927/19440 [13:52:49<12:51:32, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2811, 'learning_rate': 0.00027329849381492726, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4048, 'learning_rate': 0.00027327967160460946, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4928/19440 [13:52:52<12:42:44, 3.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4929/19440 [13:52:55<12:30:37, 3.10s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2248, 'learning_rate': 0.0002732608493942916, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1655, 'learning_rate': 0.0002732420271839738, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4930/19440 [13:52:58<12:21:35, 3.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4931/19440 [13:53:01<12:19:53, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.054, 'learning_rate': 0.00027322320497365595, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1326, 'learning_rate': 0.0002732043827633381, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4932/19440 [13:53:04<12:15:05, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1716, 'learning_rate': 0.0002731855605530203, 'epoch': 0.76} + 25%|██████████████████▎ | 4933/19440 [13:53:07<12:06:50, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9764, 'learning_rate': 0.00027316673834270244, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4934/19440 [13:53:10<11:58:50, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4935/19440 [13:53:13<11:50:34, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7058, 'learning_rate': 0.00027314791613238464, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8308, 'learning_rate': 0.0002731290939220668, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4936/19440 [13:53:16<11:44:47, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9954, 'learning_rate': 0.000273110271711749, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4937/19440 [13:53:18<11:37:12, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7846, 'learning_rate': 0.00027309144950143113, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4938/19440 [13:53:22<12:02:33, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9522, 'learning_rate': 0.0002730726272911133, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4939/19440 [13:53:24<11:50:47, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8006, 'learning_rate': 0.0002730538050807954, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4940/19440 [13:53:27<11:38:40, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5435, 'learning_rate': 0.0002730349828704776, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4941/19440 [13:53:30<11:28:14, 2.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7108, 'learning_rate': 0.0002730161606601598, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4942/19440 [13:53:33<11:17:26, 2.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6292, 'learning_rate': 0.00027299733844984197, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4943/19440 [13:53:35<11:06:25, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4479, 'learning_rate': 0.0002729785162395241, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4944/19440 [13:53:38<10:56:14, 2.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3075, 'learning_rate': 0.00027295969402920626, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4945/19440 [13:53:40<10:44:42, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4946/19440 [13:53:43<10:34:07, 2.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1371, 'learning_rate': 0.00027294087181888846, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.37, 'learning_rate': 0.0002729220496085706, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4947/19440 [13:53:45<10:23:47, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1046, 'learning_rate': 0.0002729032273982528, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4948/19440 [13:53:48<10:14:07, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.7352, 'learning_rate': 0.000272884405187935, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4949/19440 [13:53:50<10:02:18, 2.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4950/19440 [13:53:53<10:19:01, 2.56s/it] + 25%|██████████████████▎ | 4950/19440 [13:53:53<10:19:01, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.2163, 'learning_rate': 0.0002728467607672993, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4951/19440 [13:53:58<12:53:37, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1437, 'learning_rate': 0.00027282793855698144, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4952/19440 [13:54:02<14:09:33, 3.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.94, 'learning_rate': 0.00027280911634666364, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4953/19440 [13:54:06<14:51:03, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1933, 'learning_rate': 0.0002727902941363458, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4954/19440 [13:54:10<15:12:58, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.989, 'learning_rate': 0.000272771471926028, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4955/19440 [13:54:14<15:17:32, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0769, 'learning_rate': 0.0002727526497157101, 'epoch': 0.76} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4956/19440 [13:54:18<15:18:52, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.895, 'learning_rate': 0.0002727338275053923, 'epoch': 0.76} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 25%|██████████████████▎ | 4957/19440 [13:54:22<15:25:41, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1744, 'learning_rate': 0.00027271500529507447, 'epoch': 0.77} + 26%|██████████████████▎ | 4958/19440 [13:54:25<15:25:04, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▎ | 4959/19440 [13:54:29<15:16:06, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.758, 'learning_rate': 0.0002726961830847566, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8793, 'learning_rate': 0.0002726773608744388, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▎ | 4960/19440 [13:54:33<15:08:02, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8321, 'learning_rate': 0.00027265853866412096, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▎ | 4961/19440 [13:54:36<14:58:20, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8824, 'learning_rate': 0.00027263971645380316, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4962/19440 [13:54:40<14:47:54, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6725, 'learning_rate': 0.0002726208942434853, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4963/19440 [13:54:44<15:08:21, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4964/19440 [13:54:48<14:52:31, 3.70s/it] + 26%|██████████████████▍ | 4964/19440 [13:54:48<14:52:31, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.555, 'learning_rate': 0.00027258324982284965, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4965/19440 [13:54:51<14:33:17, 3.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5654, 'learning_rate': 0.0002725644276125318, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4966/19440 [13:54:54<14:15:55, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4967/19440 [13:54:58<14:03:22, 3.50s/it] + 26%|██████████████████▍ | 4967/19440 [13:54:58<14:03:22, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5612, 'learning_rate': 0.00027252678319189614, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4968/19440 [13:55:01<13:54:53, 3.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3996, 'learning_rate': 0.00027250796098157834, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4969/19440 [13:55:04<13:44:18, 3.42s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4970/19440 [13:55:08<13:34:16, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5953, 'learning_rate': 0.0002724891387712605, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4306, 'learning_rate': 0.00027247031656094263, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4971/19440 [13:55:11<13:39:33, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5167, 'learning_rate': 0.0002724514943506248, 'epoch': 0.77} + 26%|██████████████████▍ | 4972/19440 [13:55:14<13:25:13, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4973/19440 [13:55:18<13:11:46, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.316, 'learning_rate': 0.000272432672140307, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4702, 'learning_rate': 0.0002724138499299891, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4974/19440 [13:55:21<13:00:41, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4975/19440 [13:55:24<13:20:36, 3.32s/it] + 26%|██████████████████▍ | 4975/19440 [13:55:24<13:20:36, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2876, 'learning_rate': 0.0002723762055093535, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4976/19440 [13:55:27<13:05:25, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4977/19440 [13:55:30<12:50:55, 3.20s/it] + 26%|██████████████████▍ | 4977/19440 [13:55:30<12:50:55, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2382, 'learning_rate': 0.0002723385610887178, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4978/19440 [13:55:33<12:38:42, 3.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4979/19440 [13:55:36<12:30:47, 3.12s/it] + 26%|██████████████████▍ | 4979/19440 [13:55:36<12:30:47, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2757, 'learning_rate': 0.00027230091666808216, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4980/19440 [13:55:39<12:21:04, 3.08s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4981/19440 [13:55:42<12:11:59, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0067, 'learning_rate': 0.0002722820944577643, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1627, 'learning_rate': 0.0002722632722474465, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4982/19440 [13:55:45<12:07:02, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4983/19440 [13:55:48<11:58:50, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9756, 'learning_rate': 0.00027224445003712865, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0353, 'learning_rate': 0.00027222562782681085, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4984/19440 [13:55:51<11:52:19, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4985/19440 [13:55:54<11:43:18, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9127, 'learning_rate': 0.000272206805616493, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1197, 'learning_rate': 0.00027218798340617514, 'epoch': 0.77} + 26%|██████████████████▍ | 4986/19440 [13:55:57<11:36:14, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8954, 'learning_rate': 0.00027216916119585734, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4987/19440 [13:56:00<11:29:14, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9576, 'learning_rate': 0.0002721503389855395, 'epoch': 0.77} + 26%|██████████████████▍ | 4988/19440 [13:56:03<11:58:10, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7178, 'learning_rate': 0.0002721315167752217, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4989/19440 [13:56:06<11:49:29, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4990/19440 [13:56:08<11:35:14, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7248, 'learning_rate': 0.00027211269456490383, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5454, 'learning_rate': 0.00027209387235458603, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4991/19440 [13:56:11<11:26:12, 2.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5028, 'learning_rate': 0.00027207505014426817, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4992/19440 [13:56:14<11:14:15, 2.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4993/19440 [13:56:17<11:06:59, 2.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5581, 'learning_rate': 0.0002720562279339503, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6783, 'learning_rate': 0.0002720374057236325, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▍ | 4994/19440 [13:56:19<10:55:10, 2.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4042, 'learning_rate': 0.00027201858351331466, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 4995/19440 [13:56:22<10:45:48, 2.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 4996/19440 [13:56:24<10:35:04, 2.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.385, 'learning_rate': 0.00027199976130299686, 'epoch': 0.77} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 4997/19440 [13:56:27<10:25:42, 2.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3179, 'learning_rate': 0.000271980939092679, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1026, 'learning_rate': 0.00027196211688236115, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 4998/19440 [13:56:29<10:14:31, 2.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8541, 'learning_rate': 0.0002719432946720433, 'epoch': 0.77} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 4999/19440 [13:56:32<10:07:20, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 26%|██████████████████▌ | 5000/19440 [13:56:34<10:20:30, 2.58s/it]The following columns in the evaluation set don't have a corresponding argument in `SpeechEncoderDecoderModel.forward` and have been ignored: length, lang. If length, lang are not expected by `SpeechEncoderDecoderModel.forward`, you can safely ignore this message. +***** Running Evaluation ***** + Num examples = 14760 + Batch size = 4 +{'loss': 4.673, 'learning_rate': 0.0002719244724617255, 'epoch': 0.77} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +100%|█████████████████████████████████████████████████████████████████████████████| 3690/3690 [1:01:02<00:00, 1.05it/s] + +Configuration saved in ./checkpoint-5000/config.json +Model weights saved in ./checkpoint-5000/pytorch_model.bin +Feature extractor saved in ./checkpoint-5000/preprocessor_config.json +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +Feature extractor saved in ./preprocessor_config.json +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +Adding files tracked by Git LFS: ['wandb/run-20220503_172048-zotxt8wa/logs/debug-internal.log']. This may take a bit of time if the files are large. +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +05/04/2022 08:20:36 - WARNING - huggingface_hub.repository - Adding files tracked by Git LFS: ['wandb/run-20220503_172048-zotxt8wa/logs/debug-internal.log']. This may take a bit of time if the files are large. +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible