diff --git "a/wandb/run-20220503_172048-zotxt8wa/files/output.log" "b/wandb/run-20220503_172048-zotxt8wa/files/output.log" --- "a/wandb/run-20220503_172048-zotxt8wa/files/output.log" +++ "b/wandb/run-20220503_172048-zotxt8wa/files/output.log" @@ -83332,5 +83332,10469 @@ To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.6209, 'learning_rate': 0.0002907278605692402, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|█████████████▉ | 4001/19440 [11:58:52<4878:29:01, 1137.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▏ | 4002/19440 [11:58:56<3420:34:02, 797.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.2168, 'learning_rate': 0.0002907090383589223, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▏ | 4003/19440 [11:59:01<2399:48:48, 559.65s/it] + 21%|██████████████▏ | 4003/19440 [11:59:01<2399:48:48, 559.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▏ | 4004/19440 [11:59:05<1685:06:50, 393.00s/it] + 21%|██████████████▏ | 4004/19440 [11:59:05<1685:06:50, 393.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9345, 'learning_rate': 0.0002906525717279688, 'epoch': 0.62} + 21%|██████████████▏ | 4005/19440 [11:59:09<1184:47:20, 276.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▍ | 4006/19440 [11:59:13<834:34:43, 194.67s/it] + 21%|██████████████▍ | 4006/19440 [11:59:13<834:34:43, 194.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▍ | 4007/19440 [11:59:17<589:14:29, 137.45s/it] + 21%|██████████████▍ | 4007/19440 [11:59:17<589:14:29, 137.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▋ | 4008/19440 [11:59:21<417:33:08, 97.41s/it] + 21%|██████████████▋ | 4008/19440 [11:59:21<417:33:08, 97.41s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▋ | 4009/19440 [11:59:25<297:16:14, 69.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6812, 'learning_rate': 0.0002905772828866975, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7345, 'learning_rate': 0.00029055846067637964, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▋ | 4010/19440 [11:59:29<213:01:27, 49.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▋ | 4011/19440 [11:59:33<154:18:12, 36.00s/it] + 21%|██████████████▋ | 4011/19440 [11:59:33<154:18:12, 36.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6214, 'learning_rate': 0.000290520816255744, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▋ | 4012/19440 [11:59:36<112:44:53, 26.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5201, 'learning_rate': 0.00029050199404542613, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▊ | 4013/19440 [11:59:41<84:08:43, 19.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5148, 'learning_rate': 0.00029048317183510833, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▊ | 4014/19440 [11:59:44<63:35:35, 14.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.739, 'learning_rate': 0.0002904643496247905, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▊ | 4015/19440 [11:59:48<49:07:12, 11.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5955, 'learning_rate': 0.0002904455274144726, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▊ | 4016/19440 [11:59:51<39:03:05, 9.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4017/19440 [11:59:55<31:57:27, 7.46s/it] + 21%|██████████████▉ | 4017/19440 [11:59:55<31:57:27, 7.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3319, 'learning_rate': 0.00029040788299383697, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4018/19440 [11:59:59<27:03:20, 6.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3989, 'learning_rate': 0.00029038906078351917, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4019/19440 [12:00:02<23:30:11, 5.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5491, 'learning_rate': 0.0002903702385732013, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4020/19440 [12:00:06<20:59:33, 4.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4021/19440 [12:00:09<19:10:45, 4.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3435, 'learning_rate': 0.0002903514163628835, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5403, 'learning_rate': 0.00029033259415256566, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4022/19440 [12:00:13<17:45:23, 4.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3489, 'learning_rate': 0.0002903137719422478, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4023/19440 [12:00:16<16:44:54, 3.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4024/19440 [12:00:19<15:58:36, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1119, 'learning_rate': 0.00029029494973192995, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3606, 'learning_rate': 0.00029027612752161215, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4025/19440 [12:00:23<15:56:29, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0764, 'learning_rate': 0.00029025730531129435, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4026/19440 [12:00:26<15:25:16, 3.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4027/19440 [12:00:30<14:59:26, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2766, 'learning_rate': 0.0002902384831009765, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4028/19440 [12:00:33<14:37:15, 3.42s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.285, 'learning_rate': 0.0002902196608906587, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1086, 'learning_rate': 0.00029020083868034084, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4029/19440 [12:00:36<14:13:57, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4030/19440 [12:00:39<13:54:00, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1745, 'learning_rate': 0.000290182016470023, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2576, 'learning_rate': 0.00029016319425970513, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4031/19440 [12:00:42<13:42:07, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4032/19440 [12:00:45<13:21:28, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.047, 'learning_rate': 0.00029014437204938733, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████��███████▉ | 4033/19440 [12:00:48<12:59:58, 3.04s/it] + 21%|██████████████▉ | 4033/19440 [12:00:48<12:59:58, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8287, 'learning_rate': 0.0002901067276287517, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4034/19440 [12:00:51<12:45:17, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4035/19440 [12:00:53<12:33:07, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0954, 'learning_rate': 0.0002900879054184339, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9639, 'learning_rate': 0.000290069083208116, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4036/19440 [12:00:56<12:21:29, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4037/19440 [12:00:59<12:12:13, 2.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8288, 'learning_rate': 0.00029005026099779817, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6629, 'learning_rate': 0.0002900314387874803, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4038/19440 [12:01:02<12:33:21, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6601, 'learning_rate': 0.0002900126165771625, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4039/19440 [12:01:05<12:23:18, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4040/19440 [12:01:08<12:13:18, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7512, 'learning_rate': 0.00028999379436684466, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6249, 'learning_rate': 0.00028997497215652685, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4041/19440 [12:01:10<11:59:38, 2.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4042/19440 [12:01:13<11:49:49, 2.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4859, 'learning_rate': 0.000289956149946209, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4043/19440 [12:01:16<11:38:04, 2.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5689, 'learning_rate': 0.00028993732773589115, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.563, 'learning_rate': 0.00028991850552557335, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4044/19440 [12:01:18<11:27:47, 2.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.2026, 'learning_rate': 0.0002898996833152555, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4045/19440 [12:01:21<11:14:15, 2.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4046/19440 [12:01:23<11:02:04, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0934, 'learning_rate': 0.0002898808611049377, 'epoch': 0.62} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4047/19440 [12:01:26<10:51:46, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.319, 'learning_rate': 0.00028986203889461984, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4048/19440 [12:01:28<10:43:00, 2.51s/it] + 21%|██████████████▉ | 4048/19440 [12:01:28<10:43:00, 2.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8576, 'learning_rate': 0.0002898243944739842, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|██████████████▉ | 4049/19440 [12:01:30<10:30:14, 2.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4050/19440 [12:01:33<10:45:45, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.5397, 'learning_rate': 0.0002898055722636663, 'epoch': 0.62} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4051/19440 [12:01:38<13:41:32, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.5636, 'learning_rate': 0.00028978675005334847, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4052/19440 [12:01:42<15:07:51, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.3149, 'learning_rate': 0.00028976792784303067, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4053/19440 [12:01:46<15:49:33, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.2909, 'learning_rate': 0.00028974910563271287, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4054/19440 [12:01:50<16:09:03, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1745, 'learning_rate': 0.000289730283422395, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4055/19440 [12:01:54<16:13:49, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0612, 'learning_rate': 0.0002897114612120772, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4056/19440 [12:01:58<16:10:51, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0639, 'learning_rate': 0.00028969263900175936, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4057/19440 [12:02:02<16:13:18, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9767, 'learning_rate': 0.0002896738167914415, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7882, 'learning_rate': 0.00028965499458112365, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4058/19440 [12:02:05<16:08:19, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8028, 'learning_rate': 0.00028963617237080585, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4059/19440 [12:02:09<15:57:20, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9765, 'learning_rate': 0.000289617350160488, 'epoch': 0.63} + 21%|███████████████ | 4060/19440 [12:02:13<15:50:27, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4061/19440 [12:02:16<15:39:49, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5255, 'learning_rate': 0.0002895985279501702, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4062/19440 [12:02:20<15:26:55, 3.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5829, 'learning_rate': 0.0002895797057398524, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4063/19440 [12:02:24<15:45:46, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5827, 'learning_rate': 0.00028956088352953454, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4997, 'learning_rate': 0.0002895420613192167, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4064/19440 [12:02:27<15:30:09, 3.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4065/19440 [12:02:31<15:12:40, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7945, 'learning_rate': 0.00028952323910889883, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4066/19440 [12:02:34<14:57:10, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8563, 'learning_rate': 0.00028950441689858103, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6859, 'learning_rate': 0.0002894855946882632, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4067/19440 [12:02:37<14:55:10, 3.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4068/19440 [12:02:41<14:38:12, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5684, 'learning_rate': 0.0002894667724779454, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4069/19440 [12:02:44<14:21:05, 3.36s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7205, 'learning_rate': 0.0002894479502676275, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4070/19440 [12:02:47<14:13:34, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.409, 'learning_rate': 0.00028942912805730967, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4071/19440 [12:02:50<13:59:48, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5879, 'learning_rate': 0.00028941030584699187, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5773, 'learning_rate': 0.000289391483636674, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4072/19440 [12:02:53<13:49:26, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4073/19440 [12:02:57<13:42:01, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5505, 'learning_rate': 0.0002893726614263562, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2269, 'learning_rate': 0.00028935383921603836, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4074/19440 [12:03:00<13:34:37, 3.18s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4075/19440 [12:03:03<14:05:04, 3.30s/it] + 21%|███████████████ | 4075/19440 [12:03:03<14:05:04, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4076/19440 [12:03:06<13:53:14, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3612, 'learning_rate': 0.0002893161947954027, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2829, 'learning_rate': 0.00028929737258508485, 'epoch': 0.63} + 21%|███████████████ | 4077/19440 [12:03:09<13:36:33, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4078/19440 [12:03:12<13:22:31, 3.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2689, 'learning_rate': 0.00028927855037476705, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|█████████████��█ | 4079/19440 [12:03:15<13:12:41, 3.10s/it] + 21%|███████████████ | 4079/19440 [12:03:15<13:12:41, 3.10s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4080/19440 [12:03:18<13:04:49, 3.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1275, 'learning_rate': 0.0002892409059541314, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4081/19440 [12:03:21<13:00:58, 3.05s/it] + 21%|███████████████ | 4081/19440 [12:03:21<13:00:58, 3.05s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4082/19440 [12:03:24<12:51:32, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2545, 'learning_rate': 0.00028920326153349574, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████ | 4083/19440 [12:03:27<12:45:42, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1477, 'learning_rate': 0.0002891844393231779, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4084/19440 [12:03:30<12:34:11, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0396, 'learning_rate': 0.00028916561711286, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4085/19440 [12:03:33<12:27:23, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9855, 'learning_rate': 0.00028914679490254217, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9789, 'learning_rate': 0.00028912797269222437, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4086/19440 [12:03:36<12:17:46, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7781, 'learning_rate': 0.00028910915048190657, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4087/19440 [12:03:39<12:11:14, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7223, 'learning_rate': 0.0002890903282715887, 'epoch': 0.63} + 21%|███████████████▏ | 4088/19440 [12:03:42<12:37:21, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7373, 'learning_rate': 0.0002890715060612709, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4089/19440 [12:03:45<12:29:03, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6686, 'learning_rate': 0.00028905268385095306, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4090/19440 [12:03:47<12:14:14, 2.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6666, 'learning_rate': 0.0002890338616406352, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4091/19440 [12:03:50<12:00:30, 2.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7747, 'learning_rate': 0.00028901503943031735, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4092/19440 [12:03:53<11:52:12, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4093/19440 [12:03:56<11:44:12, 2.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5299, 'learning_rate': 0.00028899621721999955, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6523, 'learning_rate': 0.0002889773950096817, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4094/19440 [12:03:58<11:34:15, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2963, 'learning_rate': 0.0002889585727993639, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4095/19440 [12:04:01<11:24:40, 2.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0663, 'learning_rate': 0.00028893975058904604, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4096/19440 [12:04:03<11:12:56, 2.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.035, 'learning_rate': 0.0002889209283787282, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4097/19440 [12:04:06<11:01:06, 2.59s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9599, 'learning_rate': 0.0002889021061684104, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4098/19440 [12:04:08<10:48:18, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.586, 'learning_rate': 0.00028888328395809253, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4099/19440 [12:04:11<10:36:04, 2.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.6846, 'learning_rate': 0.00028886446174777473, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4100/19440 [12:04:13<10:50:12, 2.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.4493, 'learning_rate': 0.0002888456395374569, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4101/19440 [12:04:18<13:38:35, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2733, 'learning_rate': 0.0002888268173271391, 'epoch': 0.63} + 21%|███████████████▏ | 4102/19440 [12:04:22<15:06:43, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1913, 'learning_rate': 0.0002888079951168212, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4103/19440 [12:04:26<15:50:34, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1478, 'learning_rate': 0.00028878917290650337, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4104/19440 [12:04:30<16:11:09, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1309, 'learning_rate': 0.00028877035069618557, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4105/19440 [12:04:34<16:15:03, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8023, 'learning_rate': 0.0002887515284858677, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4106/19440 [12:04:38<16:15:05, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9508, 'learning_rate': 0.0002887327062755499, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4107/19440 [12:04:42<16:18:54, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0762, 'learning_rate': 0.00028871388406523206, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4108/19440 [12:04:46<16:08:00, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9367, 'learning_rate': 0.00028869506185491426, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4109/19440 [12:04:49<15:55:37, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9071, 'learning_rate': 0.0002886762396445964, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4110/19440 [12:04:53<15:45:00, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8528, 'learning_rate': 0.00028865741743427855, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4111/19440 [12:04:56<15:33:59, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5382, 'learning_rate': 0.0002886385952239607, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4112/19440 [12:05:00<15:24:41, 3.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7237, 'learning_rate': 0.0002886197730136429, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4113/19440 [12:05:04<15:45:47, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7397, 'learning_rate': 0.0002886009508033251, 'epoch': 0.63} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4114/19440 [12:05:07<15:30:29, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7425, 'learning_rate': 0.00028858212859300724, 'epoch': 0.63} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4115/19440 [12:05:11<15:08:55, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7312, 'learning_rate': 0.00028856330638268944, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4116/19440 [12:05:14<14:52:03, 3.49s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.688, 'learning_rate': 0.0002885444841723716, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▏ | 4117/19440 [12:05:17<14:38:04, 3.44s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5864, 'learning_rate': 0.00028852566196205373, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4118/19440 [12:05:21<14:28:55, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4728, 'learning_rate': 0.0002885068397517359, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4119/19440 [12:05:24<14:41:15, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5923, 'learning_rate': 0.00028848801754141807, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4120/19440 [12:05:28<14:45:06, 3.47s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4526, 'learning_rate': 0.00028846919533110027, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4121/19440 [12:05:31<14:42:41, 3.46s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4897, 'learning_rate': 0.0002884503731207824, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4122/19440 [12:05:34<14:26:07, 3.39s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4963, 'learning_rate': 0.00028843155091046456, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4123/19440 [12:05:38<14:24:53, 3.39s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4132, 'learning_rate': 0.0002884127287001467, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4124/19440 [12:05:41<14:34:37, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4125/19440 [12:05:45<14:57:53, 3.52s/it] + 21%|███████████████▎ | 4125/19440 [12:05:45<14:57:53, 3.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4254, 'learning_rate': 0.00028837508427951105, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4126/19440 [12:05:49<14:56:40, 3.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2935, 'learning_rate': 0.00028835626206919325, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4127/19440 [12:05:52<14:56:44, 3.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3533, 'learning_rate': 0.0002883374398588754, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4128/19440 [12:05:56<14:57:27, 3.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4129/19440 [12:05:59<14:53:30, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2586, 'learning_rate': 0.0002883186176485576, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2003, 'learning_rate': 0.00028829979543823974, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4130/19440 [12:06:02<14:46:25, 3.47s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.047, 'learning_rate': 0.0002882809732279219, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4131/19440 [12:06:06<14:27:52, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4132/19440 [12:06:09<14:22:42, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1578, 'learning_rate': 0.0002882621510176041, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0584, 'learning_rate': 0.00028824332880728623, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4133/19440 [12:06:12<14:07:16, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4134/19440 [12:06:15<13:54:42, 3.27s/it] + 21%|███████████████▎ | 4134/19440 [12:06:15<13:54:42, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.299, 'learning_rate': 0.0002882056843866506, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4135/19440 [12:06:19<13:51:06, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9251, 'learning_rate': 0.0002881868621763328, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4136/19440 [12:06:22<13:46:32, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0811, 'learning_rate': 0.0002881680399660149, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4137/19440 [12:06:25<13:41:29, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9194, 'learning_rate': 0.00028814921775569707, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4138/19440 [12:06:29<14:05:40, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6849, 'learning_rate': 0.00028813039554537927, 'epoch': 0.64} + 21%|███████████████▎ | 4139/19440 [12:06:32<13:52:52, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9783, 'learning_rate': 0.0002881115733350614, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4140/19440 [12:06:35<13:35:06, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7117, 'learning_rate': 0.0002880927511247436, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4141/19440 [12:06:38<13:49:34, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6391, 'learning_rate': 0.00028807392891442576, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4142/19440 [12:06:41<13:15:41, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.3183, 'learning_rate': 0.00028805510670410796, 'epoch': 0.64} + 21%|███████████████▎ | 4143/19440 [12:06:44<13:00:33, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4163, 'learning_rate': 0.0002880362844937901, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4144/19440 [12:06:47<12:51:09, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4145/19440 [12:06:50<12:41:03, 2.99s/it] + 21%|███████████████▎ | 4145/19440 [12:06:50<12:41:03, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.3176, 'learning_rate': 0.0002879986400731544, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4146/19440 [12:06:52<12:26:58, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0132, 'learning_rate': 0.0002879798178628366, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4147/19440 [12:06:55<12:15:19, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.8872, 'learning_rate': 0.0002879609956525188, 'epoch': 0.64} + 21%|███████████████▎ | 4148/19440 [12:06:58<11:54:35, 2.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.6997, 'learning_rate': 0.00028794217344220094, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4149/19440 [12:07:00<11:41:05, 2.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.7156, 'learning_rate': 0.0002879233512318831, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4150/19440 [12:07:03<11:37:38, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.6921, 'learning_rate': 0.00028790452902156523, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▎ | 4151/19440 [12:07:08<14:25:07, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4635, 'learning_rate': 0.00028788570681124743, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4152/19440 [12:07:13<15:48:41, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.4226, 'learning_rate': 0.0002878668846009296, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4153/19440 [12:07:17<16:37:21, 3.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4154/19440 [12:07:21<17:09:48, 4.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.3193, 'learning_rate': 0.0002878480623906118, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4155/19440 [12:07:25<17:15:35, 4.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.3413, 'learning_rate': 0.0002878292401802939, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4156/19440 [12:07:30<17:19:08, 4.08s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0342, 'learning_rate': 0.0002878104179699761, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4157/19440 [12:07:34<17:21:52, 4.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0477, 'learning_rate': 0.00028779159575965826, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4158/19440 [12:07:38<17:12:35, 4.05s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1651, 'learning_rate': 0.0002877727735493404, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4159/19440 [12:07:42<17:02:55, 4.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.688, 'learning_rate': 0.0002877539513390226, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4160/19440 [12:07:45<16:53:18, 3.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8952, 'learning_rate': 0.00028773512912870475, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6562, 'learning_rate': 0.00028771630691838695, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4161/19440 [12:07:49<16:38:40, 3.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6186, 'learning_rate': 0.0002876974847080691, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4162/19440 [12:07:53<16:32:45, 3.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5378, 'learning_rate': 0.0002876786624977513, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4163/19440 [12:07:57<16:54:56, 3.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.837, 'learning_rate': 0.00028765984028743344, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4164/19440 [12:08:01<16:45:39, 3.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7261, 'learning_rate': 0.0002876410180771156, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4165/19440 [12:08:05<16:33:55, 3.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6076, 'learning_rate': 0.0002876221958667978, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4166/19440 [12:08:09<16:28:00, 3.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4167/19440 [12:08:12<16:17:24, 3.84s/it] + 21%|███████████████▍ | 4167/19440 [12:08:12<16:17:24, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4168/19440 [12:08:16<16:03:26, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8272, 'learning_rate': 0.00028758455144616213, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5541, 'learning_rate': 0.0002875657292358443, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4169/19440 [12:08:20<15:49:50, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4321, 'learning_rate': 0.0002875469070255265, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4170/19440 [12:08:23<15:41:50, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.627, 'learning_rate': 0.0002875280848152086, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4171/19440 [12:08:27<15:34:01, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4172/19440 [12:08:31<15:27:55, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.555, 'learning_rate': 0.00028750926260489077, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4173/19440 [12:08:34<15:13:57, 3.59s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5621, 'learning_rate': 0.00028749044039457297, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3367, 'learning_rate': 0.0002874716181842551, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4174/19440 [12:08:37<15:01:41, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6232, 'learning_rate': 0.0002874527959739373, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4175/19440 [12:08:41<15:30:15, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4176/19440 [12:08:45<15:13:30, 3.59s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4695, 'learning_rate': 0.00028743397376361946, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4177/19440 [12:08:48<14:56:59, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2948, 'learning_rate': 0.0002874151515533016, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2307, 'learning_rate': 0.00028739632934298375, 'epoch': 0.64} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4178/19440 [12:08:52<14:45:28, 3.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 21%|███████████████▍ | 4179/19440 [12:08:55<14:33:32, 3.43s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.564, 'learning_rate': 0.00028737750713266595, 'epoch': 0.64} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3109, 'learning_rate': 0.0002873586849223481, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▍ | 4180/19440 [12:08:58<14:10:42, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▍ | 4181/19440 [12:09:01<13:59:19, 3.30s/it] + 22%|███████████████▍ | 4181/19440 [12:09:01<13:59:19, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▍ | 4182/19440 [12:09:04<13:33:56, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1662, 'learning_rate': 0.0002873210405017125, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▍ | 4183/19440 [12:09:07<13:14:54, 3.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9465, 'learning_rate': 0.00028730221829139464, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2281, 'learning_rate': 0.0002872833960810768, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▍ | 4184/19440 [12:09:10<12:58:07, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4185/19440 [12:09:13<12:43:02, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7681, 'learning_rate': 0.00028726457387075893, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1946, 'learning_rate': 0.00028724575166044113, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4186/19440 [12:09:16<12:32:57, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4187/19440 [12:09:19<12:22:43, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8592, 'learning_rate': 0.0002872269294501233, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7909, 'learning_rate': 0.0002872081072398055, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4188/19440 [12:09:22<12:49:58, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4189/19440 [12:09:25<12:37:23, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9072, 'learning_rate': 0.0002871892850294876, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4190/19440 [12:09:28<12:21:10, 2.92s/it] + 22%|███████████████▌ | 4190/19440 [12:09:28<12:21:10, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4191/19440 [12:09:30<12:08:01, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7992, 'learning_rate': 0.00028715164060885196, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4192/19440 [12:09:33<11:53:42, 2.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5843, 'learning_rate': 0.0002871328183985341, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4193/19440 [12:09:36<11:45:46, 2.78s/it] + 22%|███████████████▌ | 4193/19440 [12:09:36<11:45:46, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4194/19440 [12:09:38<11:35:57, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5064, 'learning_rate': 0.00028709517397789845, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4195/19440 [12:09:41<11:24:58, 2.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.32, 'learning_rate': 0.00028707635176758065, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.984, 'learning_rate': 0.0002870575295572628, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4196/19440 [12:09:43<11:14:17, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0051, 'learning_rate': 0.000287038707346945, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4197/19440 [12:09:46<11:04:29, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4198/19440 [12:09:48<10:51:06, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9895, 'learning_rate': 0.00028701988513662714, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4199/19440 [12:09:51<10:40:39, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0347, 'learning_rate': 0.0002870010629263093, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.4387, 'learning_rate': 0.0002869822407159915, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4200/19440 [12:09:54<10:55:43, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2514, 'learning_rate': 0.00028696341850567363, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4201/19440 [12:09:58<13:43:10, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4202/19440 [12:10:03<14:59:56, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2392, 'learning_rate': 0.00028694459629535583, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1267, 'learning_rate': 0.000286925774085038, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4203/19440 [12:10:07<15:39:08, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4204/19440 [12:10:11<15:59:55, 3.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0179, 'learning_rate': 0.0002869069518747201, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4205/19440 [12:10:15<16:08:02, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0629, 'learning_rate': 0.00028688812966440227, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4206/19440 [12:10:18<16:12:30, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0559, 'learning_rate': 0.00028686930745408447, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.868, 'learning_rate': 0.0002868504852437666, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4207/19440 [12:10:22<16:13:01, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4208/19440 [12:10:26<16:02:04, 3.79s/it] + 22%|███████████████▌ | 4208/19440 [12:10:26<16:02:04, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4209/19440 [12:10:30<15:53:15, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9825, 'learning_rate': 0.000286812840823131, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4210/19440 [12:10:33<15:45:43, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6333, 'learning_rate': 0.00028679401861281316, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4211/19440 [12:10:37<15:34:45, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7773, 'learning_rate': 0.0002867751964024953, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8487, 'learning_rate': 0.00028675637419217745, 'epoch': 0.65} + 22%|███████████████▌ | 4212/19440 [12:10:40<15:22:25, 3.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4213/19440 [12:10:44<15:48:13, 3.74s/it] + 22%|███████████████▌ | 4213/19440 [12:10:44<15:48:13, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4214/19440 [12:10:48<15:34:00, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5263, 'learning_rate': 0.0002867187297715418, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4215/19440 [12:10:51<15:18:02, 3.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5562, 'learning_rate': 0.000286699907561224, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.478, 'learning_rate': 0.0002866810853509062, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4216/19440 [12:10:55<14:59:26, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4217/19440 [12:10:58<14:43:41, 3.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7222, 'learning_rate': 0.00028666226314058834, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▌ | 4218/19440 [12:11:01<14:33:00, 3.44s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.596, 'learning_rate': 0.0002866434409302705, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3337, 'learning_rate': 0.00028662461871995263, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4219/19440 [12:11:05<14:21:55, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4220/19440 [12:11:08<14:11:59, 3.36s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6255, 'learning_rate': 0.00028660579650963483, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4221/19440 [12:11:11<14:04:41, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4429, 'learning_rate': 0.000286586974299317, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4222/19440 [12:11:14<13:56:25, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2977, 'learning_rate': 0.0002865681520889992, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4223/19440 [12:11:18<13:49:35, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5354, 'learning_rate': 0.0002865493298786813, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4683, 'learning_rate': 0.0002865305076683635, 'epoch': 0.65} + 22%|███████████████▋ | 4224/19440 [12:11:21<13:43:46, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4225/19440 [12:11:24<14:08:36, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3014, 'learning_rate': 0.00028651168545804567, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4226/19440 [12:11:28<13:59:48, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3245, 'learning_rate': 0.0002864928632477278, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4227/19440 [12:11:31<13:41:17, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2822, 'learning_rate': 0.00028647404103741, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4228/19440 [12:11:34<13:28:33, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.197, 'learning_rate': 0.00028645521882709216, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4229/19440 [12:11:37<13:19:41, 3.15s/it] + 22%|███████████████▋ | 4229/19440 [12:11:37<13:19:41, 3.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4230/19440 [12:11:40<13:09:26, 3.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3308, 'learning_rate': 0.0002864175744064565, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4231/19440 [12:11:43<13:01:24, 3.08s/it] + 22%|███████████████▋ | 4231/19440 [12:11:43<13:01:24, 3.08s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4232/19440 [12:11:46<12:56:09, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0691, 'learning_rate': 0.0002863799299858208, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4233/19440 [12:11:49<12:45:27, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0584, 'learning_rate': 0.000286361107775503, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████��██████████▋ | 4234/19440 [12:11:52<12:39:32, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9509, 'learning_rate': 0.0002863422855651852, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4235/19440 [12:11:55<12:32:29, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0422, 'learning_rate': 0.00028632346335486734, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4236/19440 [12:11:58<12:26:13, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7829, 'learning_rate': 0.00028630464114454954, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4237/19440 [12:12:00<12:19:53, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9136, 'learning_rate': 0.0002862858189342317, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4238/19440 [12:12:04<12:42:36, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.6418, 'learning_rate': 0.0002862669967239138, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4239/19440 [12:12:07<12:46:01, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7955, 'learning_rate': 0.00028624817451359597, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4240/19440 [12:12:10<12:29:15, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7788, 'learning_rate': 0.00028622935230327817, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4241/19440 [12:12:12<12:13:36, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4462, 'learning_rate': 0.0002862105300929603, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4242/19440 [12:12:15<11:57:35, 2.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.5667, 'learning_rate': 0.0002861917078826425, 'epoch': 0.65} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4243/19440 [12:12:18<11:45:00, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2706, 'learning_rate': 0.0002861728856723247, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4244/19440 [12:12:20<11:31:27, 2.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3584, 'learning_rate': 0.00028615406346200686, 'epoch': 0.65} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4245/19440 [12:12:23<11:20:08, 2.69s/it] + 22%|███████████████▋ | 4245/19440 [12:12:23<11:20:08, 2.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9434, 'learning_rate': 0.00028611641904137115, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4246/19440 [12:12:25<11:10:14, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4247/19440 [12:12:28<10:58:37, 2.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0573, 'learning_rate': 0.00028609759683105335, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4248/19440 [12:12:30<10:49:03, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.0695, 'learning_rate': 0.0002860787746207355, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4249/19440 [12:12:33<10:39:52, 2.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.7565, 'learning_rate': 0.0002860599524104177, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4250/19440 [12:12:36<11:01:07, 2.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.5958, 'learning_rate': 0.00028604113020009984, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4251/19440 [12:12:40<13:42:56, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.5248, 'learning_rate': 0.00028602230798978204, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▋ | 4252/19440 [12:12:45<14:58:50, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1209, 'learning_rate': 0.0002860034857794642, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4253/19440 [12:12:49<15:38:38, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1982, 'learning_rate': 0.00028598466356914633, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4254/19440 [12:12:53<15:58:10, 3.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1273, 'learning_rate': 0.00028596584135882853, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.2013, 'learning_rate': 0.0002859470191485107, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4255/19440 [12:12:57<16:07:42, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1029, 'learning_rate': 0.0002859281969381929, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4256/19440 [12:13:00<16:11:40, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8648, 'learning_rate': 0.000285909374727875, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4257/19440 [12:13:04<16:16:51, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8455, 'learning_rate': 0.00028589055251755717, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4258/19440 [12:13:08<16:09:27, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4259/19440 [12:13:12<16:01:15, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9413, 'learning_rate': 0.00028587173030723937, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4260/19440 [12:13:15<15:50:05, 3.76s/it] + 22%|███████████████▊ | 4260/19440 [12:13:15<15:50:05, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.9434, 'learning_rate': 0.0002858340858866037, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4261/19440 [12:13:19<15:39:22, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8853, 'learning_rate': 0.00028581526367628586, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4262/19440 [12:13:23<15:30:34, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6908, 'learning_rate': 0.00028579644146596806, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4263/19440 [12:13:27<15:52:03, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6282, 'learning_rate': 0.0002857776192556502, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4264/19440 [12:13:30<15:35:52, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7923, 'learning_rate': 0.00028575879704533235, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4265/19440 [12:13:34<15:19:37, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.776, 'learning_rate': 0.0002857399748350145, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4266/19440 [12:13:37<15:03:06, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4515, 'learning_rate': 0.0002857211526246967, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4267/19440 [12:13:41<14:53:25, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6726, 'learning_rate': 0.0002857023304143789, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4268/19440 [12:13:44<14:44:11, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6717, 'learning_rate': 0.00028568350820406104, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4269/19440 [12:13:47<14:33:13, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3182, 'learning_rate': 0.00028566468599374324, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4270/19440 [12:13:51<14:20:13, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6537, 'learning_rate': 0.0002856458637834254, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4271/19440 [12:13:54<14:10:52, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4246, 'learning_rate': 0.0002856270415731075, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4272/19440 [12:13:57<14:01:56, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4273/19440 [12:14:00<13:53:51, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4053, 'learning_rate': 0.00028560821936278967, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2072, 'learning_rate': 0.00028558939715247187, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4274/19440 [12:14:04<13:45:01, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3586, 'learning_rate': 0.000285570574942154, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4275/19440 [12:14:07<14:10:50, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3332, 'learning_rate': 0.0002855517527318362, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4276/19440 [12:14:10<13:55:46, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3546, 'learning_rate': 0.0002855329305215184, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4277/19440 [12:14:13<13:39:57, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4278/19440 [12:14:16<13:26:47, 3.19s/it] + 22%|███████████████▊ | 4278/19440 [12:14:16<13:26:47, 3.19s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0989, 'learning_rate': 0.0002854952861008827, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4279/19440 [12:14:20<13:19:53, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4280/19440 [12:14:23<13:09:10, 3.12s/it] + 22%|███████████████▊ | 4280/19440 [12:14:23<13:09:10, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2311, 'learning_rate': 0.00028545764168024705, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4281/19440 [12:14:26<13:00:06, 3.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4282/19440 [12:14:29<12:52:53, 3.06s/it] + 22%|███████████████▊ | 4282/19440 [12:14:29<12:52:53, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1158, 'learning_rate': 0.0002854199972596114, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4283/19440 [12:14:32<12:45:46, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0874, 'learning_rate': 0.00028540117504929354, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4284/19440 [12:14:35<12:39:16, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1191, 'learning_rate': 0.0002853823528389757, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4285/19440 [12:14:37<12:30:22, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▊ | 4286/19440 [12:14:40<12:23:01, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0425, 'learning_rate': 0.0002853635306286579, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.9252, 'learning_rate': 0.00028534470841834003, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4287/19440 [12:14:43<12:24:31, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4288/19440 [12:14:47<12:51:30, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7023, 'learning_rate': 0.00028532588620802223, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.7283, 'learning_rate': 0.0002853070639977044, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4289/19440 [12:14:49<12:41:38, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4290/19440 [12:14:52<12:26:03, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9422, 'learning_rate': 0.0002852882417873866, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4291/19440 [12:14:55<12:13:29, 2.91s/it] + 22%|███████████████▉ | 4291/19440 [12:14:55<12:13:29, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.527, 'learning_rate': 0.00028525059736675087, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4292/19440 [12:14:58<12:01:15, 2.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4293/19440 [12:15:01<11:49:36, 2.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4352, 'learning_rate': 0.000285231775156433, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4294/19440 [12:15:03<11:36:42, 2.76s/it] + 22%|███████████████▉ | 4294/19440 [12:15:03<11:36:42, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4627, 'learning_rate': 0.0002851941307357974, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4295/19440 [12:15:06<11:28:41, 2.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4296/19440 [12:15:09<11:31:27, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.2579, 'learning_rate': 0.00028517530852547956, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4297/19440 [12:15:11<11:13:00, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2978, 'learning_rate': 0.00028515648631516176, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0348, 'learning_rate': 0.0002851376641048439, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4298/19440 [12:15:14<11:00:53, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.5885, 'learning_rate': 0.00028511884189452605, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4299/19440 [12:15:16<10:47:34, 2.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4300/19440 [12:15:19<11:01:33, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.4657, 'learning_rate': 0.0002851000196842082, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.382, 'learning_rate': 0.0002850811974738904, 'epoch': 0.66} + 22%|███████████████▉ | 4301/19440 [12:15:24<13:42:04, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.3075, 'learning_rate': 0.0002850623752635726, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4302/19440 [12:15:28<15:02:10, 3.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0618, 'learning_rate': 0.00028504355305325474, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4303/19440 [12:15:32<15:41:31, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0622, 'learning_rate': 0.00028502473084293694, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4304/19440 [12:15:36<15:59:10, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8395, 'learning_rate': 0.0002850059086326191, 'epoch': 0.66} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4305/19440 [12:15:40<16:04:16, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4306/19440 [12:15:44<16:06:34, 3.83s/it] + 22%|███████████████▉ | 4306/19440 [12:15:44<16:06:34, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4307/19440 [12:15:48<16:13:45, 3.86s/it] + 22%|███████████████▉ | 4307/19440 [12:15:48<16:13:45, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4308/19440 [12:15:51<16:00:05, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8225, 'learning_rate': 0.0002849494420016656, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4309/19440 [12:15:55<15:48:11, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0129, 'learning_rate': 0.0002849306197913477, 'epoch': 0.66} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.863, 'learning_rate': 0.0002849117975810299, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4310/19440 [12:15:59<15:40:10, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.764, 'learning_rate': 0.00028489297537071206, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4311/19440 [12:16:02<15:31:58, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4312/19440 [12:16:06<15:21:16, 3.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8436, 'learning_rate': 0.0002848741531603942, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4313/19440 [12:16:10<15:40:29, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.523, 'learning_rate': 0.0002848553309500764, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4314/19440 [12:16:13<15:23:21, 3.66s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7952, 'learning_rate': 0.00028483650873975855, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6414, 'learning_rate': 0.00028481768652944075, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4315/19440 [12:16:17<15:06:31, 3.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4316/19440 [12:16:20<14:50:16, 3.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5973, 'learning_rate': 0.0002847988643191229, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4317/19440 [12:16:23<14:38:03, 3.48s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5193, 'learning_rate': 0.0002847800421088051, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4631, 'learning_rate': 0.00028476121989848724, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4318/19440 [12:16:27<14:22:21, 3.42s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|███████████████▉ | 4319/19440 [12:16:30<14:10:50, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4428, 'learning_rate': 0.0002847423976881694, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4392, 'learning_rate': 0.0002847235754778516, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4320/19440 [12:16:33<14:06:04, 3.36s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4321/19440 [12:16:36<13:59:54, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5841, 'learning_rate': 0.00028470475326753373, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4322/19440 [12:16:40<13:55:05, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4355, 'learning_rate': 0.00028468593105721593, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4323/19440 [12:16:43<13:45:28, 3.28s/it] + 22%|████████████████ | 4323/19440 [12:16:43<13:45:28, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4324/19440 [12:16:46<13:40:25, 3.26s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3263, 'learning_rate': 0.0002846482866365803, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4325/19440 [12:16:50<14:09:04, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4187, 'learning_rate': 0.0002846294644262624, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4326/19440 [12:16:53<13:55:36, 3.32s/it] + 22%|████████████████ | 4326/19440 [12:16:53<13:55:36, 3.32s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4327/19440 [12:16:56<13:38:51, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3352, 'learning_rate': 0.0002845918200056267, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3707, 'learning_rate': 0.0002845729977953089, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4328/19440 [12:16:59<13:29:13, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4329/19440 [12:17:02<13:18:03, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3534, 'learning_rate': 0.0002845541755849911, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4330/19440 [12:17:05<13:12:17, 3.15s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1429, 'learning_rate': 0.00028453535337467326, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4331/19440 [12:17:08<13:04:57, 3.12s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2422, 'learning_rate': 0.00028451653116435546, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4332/19440 [12:17:11<12:52:03, 3.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1972, 'learning_rate': 0.0002844977089540376, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4333/19440 [12:17:14<12:43:06, 3.03s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1686, 'learning_rate': 0.00028447888674371975, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9584, 'learning_rate': 0.0002844600645334019, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4334/19440 [12:17:17<12:31:55, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4335/19440 [12:17:20<12:24:36, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8531, 'learning_rate': 0.0002844412423230841, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1043, 'learning_rate': 0.00028442242011276624, 'epoch': 0.67} + 22%|████████████████ | 4336/19440 [12:17:23<12:18:02, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4337/19440 [12:17:26<12:12:29, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9298, 'learning_rate': 0.00028440359790244844, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0258, 'learning_rate': 0.0002843847756921306, 'epoch': 0.67} + 22%|████████████████ | 4338/19440 [12:17:29<12:38:56, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4339/19440 [12:17:32<12:28:46, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9781, 'learning_rate': 0.0002843659534818128, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4340/19440 [12:17:35<12:14:40, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7355, 'learning_rate': 0.00028434713127149493, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5455, 'learning_rate': 0.0002843283090611771, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4341/19440 [12:17:37<12:01:05, 2.87s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4342/19440 [12:17:40<11:50:10, 2.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7042, 'learning_rate': 0.0002843094868508593, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4343/19440 [12:17:43<11:42:14, 2.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5147, 'learning_rate': 0.0002842906646405414, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6026, 'learning_rate': 0.0002842718424302236, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4344/19440 [12:17:46<11:33:30, 2.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4345/19440 [12:17:48<11:24:53, 2.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4591, 'learning_rate': 0.00028425302021990576, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4346/19440 [12:17:51<11:12:45, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.1803, 'learning_rate': 0.0002842341980095879, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0917, 'learning_rate': 0.0002842153757992701, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4347/19440 [12:17:53<11:00:40, 2.63s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4348/19440 [12:17:56<10:49:15, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0735, 'learning_rate': 0.00028419655358895225, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4349/19440 [12:17:58<10:36:03, 2.53s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.9516, 'learning_rate': 0.00028417773137863445, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4350/19440 [12:18:01<10:57:47, 2.62s/it] + 22%|████████████████ | 4350/19440 [12:18:01<10:57:47, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4351/19440 [12:18:06<13:35:22, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4178, 'learning_rate': 0.0002841400869579988, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4352/19440 [12:18:10<15:05:45, 3.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.3719, 'learning_rate': 0.00028412126474768094, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████ | 4353/19440 [12:18:14<15:41:43, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.1603, 'learning_rate': 0.0002841024425373631, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4354/19440 [12:18:18<16:00:30, 3.82s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.3677, 'learning_rate': 0.0002840836203270453, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4355/19440 [12:18:22<16:05:13, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0087, 'learning_rate': 0.00028406479811672743, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4356/19440 [12:18:26<16:02:23, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.029, 'learning_rate': 0.00028404597590640963, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4357/19440 [12:18:30<16:11:25, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.0173, 'learning_rate': 0.0002840271536960918, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4358/19440 [12:18:34<16:03:47, 3.83s/it] + 22%|████████████████▏ | 4358/19440 [12:18:34<16:03:47, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4359/19440 [12:18:37<15:54:46, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7304, 'learning_rate': 0.0002839895092754561, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4360/19440 [12:18:41<15:42:59, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9642, 'learning_rate': 0.00028397068706513827, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4361/19440 [12:18:45<15:35:22, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8203, 'learning_rate': 0.0002839518648548204, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4362/19440 [12:18:48<15:23:35, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8348, 'learning_rate': 0.0002839330426445026, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4363/19440 [12:18:52<15:42:56, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.5467, 'learning_rate': 0.0002839142204341848, 'epoch': 0.67} +{'loss': 6.6483, 'learning_rate': 0.00028389539822386696, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4364/19440 [12:18:56<15:29:14, 3.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6724, 'learning_rate': 0.00028387657601354916, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4365/19440 [12:18:59<15:09:45, 3.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7574, 'learning_rate': 0.0002838577538032313, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4366/19440 [12:19:03<14:54:06, 3.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6572, 'learning_rate': 0.00028383893159291345, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4367/19440 [12:19:06<14:39:45, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5554, 'learning_rate': 0.0002838201093825956, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4368/19440 [12:19:09<14:26:33, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5292, 'learning_rate': 0.0002838012871722778, 'epoch': 0.67} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4369/19440 [12:19:13<14:13:15, 3.40s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6218, 'learning_rate': 0.00028378246496195994, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4370/19440 [12:19:16<14:06:07, 3.37s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3892, 'learning_rate': 0.00028376364275164214, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4371/19440 [12:19:19<13:57:04, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4372/19440 [12:19:22<13:51:46, 3.31s/it] + 22%|████████████████▏ | 4372/19440 [12:19:22<13:51:46, 3.31s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4826, 'learning_rate': 0.00028372599833100643, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4373/19440 [12:19:26<13:44:48, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4973, 'learning_rate': 0.00028370717612068863, 'epoch': 0.67} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 22%|████████████████▏ | 4374/19440 [12:19:29<13:36:46, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4835, 'learning_rate': 0.0002836883539103708, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4375/19440 [12:19:32<14:00:38, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4639, 'learning_rate': 0.000283669531700053, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4376/19440 [12:19:35<13:48:52, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1843, 'learning_rate': 0.0002836507094897351, 'epoch': 0.68} + 23%|████████████████▏ | 4377/19440 [12:19:39<13:32:16, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1794, 'learning_rate': 0.0002836318872794173, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4378/19440 [12:19:42<13:14:37, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1124, 'learning_rate': 0.00028361306506909946, 'epoch': 0.68} + 23%|████████████████▏ | 4379/19440 [12:19:45<13:06:27, 3.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.097, 'learning_rate': 0.0002835942428587816, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4380/19440 [12:19:48<12:54:51, 3.09s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4381/19440 [12:19:51<12:47:05, 3.06s/it] + 23%|████████████████▏ | 4381/19440 [12:19:51<12:47:05, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0145, 'learning_rate': 0.00028355659843814596, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4382/19440 [12:19:54<12:41:50, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4383/19440 [12:19:57<12:35:07, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0747, 'learning_rate': 0.00028353777622782815, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.064, 'learning_rate': 0.0002835189540175103, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4384/19440 [12:19:59<12:28:36, 2.98s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4385/19440 [12:20:02<12:25:38, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1891, 'learning_rate': 0.0002835001318071925, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.955, 'learning_rate': 0.00028348130959687465, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4386/19440 [12:20:05<12:19:40, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▏ | 4387/19440 [12:20:08<12:19:05, 2.95s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.1617, 'learning_rate': 0.0002834624873865568, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.0265, 'learning_rate': 0.00028344366517623894, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4388/19440 [12:20:12<12:44:53, 3.05s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4389/19440 [12:20:14<12:31:51, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8785, 'learning_rate': 0.00028342484296592114, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8677, 'learning_rate': 0.00028340602075560333, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4390/19440 [12:20:17<12:17:24, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7424, 'learning_rate': 0.0002833871985452855, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4391/19440 [12:20:20<12:02:24, 2.88s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4392/19440 [12:20:23<11:48:44, 2.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4964, 'learning_rate': 0.0002833683763349677, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3925, 'learning_rate': 0.0002833495541246498, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4393/19440 [12:20:25<11:38:21, 2.78s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3425, 'learning_rate': 0.00028333073191433197, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4394/19440 [12:20:28<11:26:13, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4395/19440 [12:20:31<11:18:27, 2.71s/it] + 23%|████████████████▎ | 4395/19440 [12:20:31<11:18:27, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3654, 'learning_rate': 0.0002832930874936963, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4396/19440 [12:20:33<11:10:21, 2.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0972, 'learning_rate': 0.0002832742652833785, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4397/19440 [12:20:36<10:57:56, 2.62s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.7758, 'learning_rate': 0.00028325544307306066, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4398/19440 [12:20:38<10:42:03, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4399/19440 [12:20:41<10:31:12, 2.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8536, 'learning_rate': 0.0002832366208627428, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.7333, 'learning_rate': 0.00028321779865242495, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4400/19440 [12:20:43<10:48:23, 2.59s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.3497, 'learning_rate': 0.00028319897644210715, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4401/19440 [12:20:48<13:30:08, 3.23s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.5032, 'learning_rate': 0.0002831801542317893, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4402/19440 [12:20:52<14:54:49, 3.57s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2021, 'learning_rate': 0.0002831613320214715, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4403/19440 [12:20:56<15:36:18, 3.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.1762, 'learning_rate': 0.00028314250981115364, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4404/19440 [12:21:00<15:55:26, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9428, 'learning_rate': 0.00028312368760083584, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4405/19440 [12:21:04<16:02:54, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9994, 'learning_rate': 0.000283104865390518, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|███████████��████▎ | 4406/19440 [12:21:08<16:06:45, 3.86s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8015, 'learning_rate': 0.00028308604318020013, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4407/19440 [12:21:12<16:03:08, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8915, 'learning_rate': 0.00028306722096988233, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4408/19440 [12:21:16<15:51:47, 3.80s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.825, 'learning_rate': 0.0002830483987595645, 'epoch': 0.68} + 23%|████████████████▎ | 4409/19440 [12:21:19<15:40:50, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4410/19440 [12:21:23<15:29:05, 3.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6613, 'learning_rate': 0.0002830295765492467, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4411/19440 [12:21:27<15:39:26, 3.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9066, 'learning_rate': 0.0002830107543389288, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.703, 'learning_rate': 0.000282991932128611, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4412/19440 [12:21:30<15:23:49, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8037, 'learning_rate': 0.00028297310991829317, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4413/19440 [12:21:34<15:40:41, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.8159, 'learning_rate': 0.0002829542877079753, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4414/19440 [12:21:38<15:22:46, 3.68s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4415/19440 [12:21:41<15:03:26, 3.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6913, 'learning_rate': 0.0002829354654976575, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4209, 'learning_rate': 0.00028291664328733966, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4416/19440 [12:21:45<14:49:48, 3.55s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.635, 'learning_rate': 0.00028289782107702186, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4417/19440 [12:21:48<14:37:20, 3.50s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4418/19440 [12:21:51<14:22:57, 3.45s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.537, 'learning_rate': 0.000282878998866704, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4847, 'learning_rate': 0.0002828601766563862, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4419/19440 [12:21:55<14:07:09, 3.38s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3915, 'learning_rate': 0.00028284135444606835, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4420/19440 [12:21:58<13:58:12, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▎ | 4421/19440 [12:22:01<13:55:26, 3.34s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4875, 'learning_rate': 0.0002828225322357505, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4845, 'learning_rate': 0.00028280371002543264, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4422/19440 [12:22:04<13:45:28, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4423/19440 [12:22:08<13:38:48, 3.27s/it] + 23%|████████████████▍ | 4423/19440 [12:22:08<13:38:48, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3802, 'learning_rate': 0.00028276606560479704, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4424/19440 [12:22:11<13:31:01, 3.24s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3802, 'learning_rate': 0.0002827472433944792, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4425/19440 [12:22:14<13:53:56, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4426/19440 [12:22:18<13:43:04, 3.29s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4259, 'learning_rate': 0.0002827284211841613, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4765, 'learning_rate': 0.00028270959897384347, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4427/19440 [12:22:21<13:27:27, 3.23s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4428/19440 [12:22:24<13:13:26, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3512, 'learning_rate': 0.00028269077676352567, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3195, 'learning_rate': 0.0002826719545532078, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4429/19440 [12:22:27<13:02:26, 3.13s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4430/19440 [12:22:30<12:55:40, 3.10s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.5599, 'learning_rate': 0.00028265313234289, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1994, 'learning_rate': 0.00028263431013257216, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4431/19440 [12:22:33<12:45:20, 3.06s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4432/19440 [12:22:36<12:41:20, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3642, 'learning_rate': 0.00028261548792225436, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1019, 'learning_rate': 0.0002825966657119365, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4433/19440 [12:22:39<12:34:44, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4434/19440 [12:22:42<12:28:03, 2.99s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8301, 'learning_rate': 0.00028257784350161865, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2299, 'learning_rate': 0.00028255902129130085, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4435/19440 [12:22:45<12:21:37, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4436/19440 [12:22:47<12:14:32, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8144, 'learning_rate': 0.000282540199080983, 'epoch': 0.68} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9624, 'learning_rate': 0.0002825213768706652, 'epoch': 0.68} + 23%|████████████████▍ | 4437/19440 [12:22:50<12:08:43, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4438/19440 [12:22:54<12:34:00, 3.02s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.8442, 'learning_rate': 0.00028250255466034734, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7148, 'learning_rate': 0.00028248373245002954, 'epoch': 0.68} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4439/19440 [12:22:56<12:21:55, 2.97s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4440/19440 [12:22:59<12:09:40, 2.92s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.8377, 'learning_rate': 0.0002824649102397117, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4441/19440 [12:23:02<12:01:41, 2.89s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7189, 'learning_rate': 0.00028244608802939383, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.4886, 'learning_rate': 0.00028242726581907603, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4442/19440 [12:23:05<11:46:09, 2.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4443/19440 [12:23:07<11:36:11, 2.79s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5278, 'learning_rate': 0.0002824084436087582, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4444/19440 [12:23:10<11:28:30, 2.75s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.4609, 'learning_rate': 0.0002823896213984404, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.2016, 'learning_rate': 0.0002823707991881225, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4445/19440 [12:23:13<11:16:41, 2.71s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3335, 'learning_rate': 0.0002823519769778047, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4446/19440 [12:23:15<11:01:28, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4447/19440 [12:23:18<10:51:29, 2.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.8768, 'learning_rate': 0.00028233315476748687, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4448/19440 [12:23:20<10:40:34, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 4.766, 'learning_rate': 0.000282314332557169, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.622, 'learning_rate': 0.0002822955103468512, 'epoch': 0.69} + 23%|████████████████▍ | 4449/19440 [12:23:23<10:28:01, 2.51s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4450/19440 [12:23:25<10:43:36, 2.58s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.5717, 'learning_rate': 0.00028227668813653336, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4451/19440 [12:23:30<13:19:52, 3.20s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4307, 'learning_rate': 0.00028225786592621556, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4452/19440 [12:23:34<14:39:20, 3.52s/it] + 23%|████████████████▍ | 4452/19440 [12:23:34<14:39:20, 3.52s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4453/19440 [12:23:38<15:20:49, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.4297, 'learning_rate': 0.00028222022150557985, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▍ | 4454/19440 [12:23:42<15:40:52, 3.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.2927, 'learning_rate': 0.000282201399295262, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4455/19440 [12:23:46<15:58:22, 3.84s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.3165, 'learning_rate': 0.0002821825770849442, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4456/19440 [12:23:50<15:55:58, 3.83s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 7.011, 'learning_rate': 0.00028216375487462634, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4457/19440 [12:23:54<16:00:47, 3.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8995, 'learning_rate': 0.00028214493266430854, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4458/19440 [12:23:58<15:50:35, 3.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.8864, 'learning_rate': 0.00028212611045399074, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.9029, 'learning_rate': 0.0002821072882436729, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4459/19440 [12:24:01<15:39:59, 3.76s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4460/19440 [12:24:05<15:30:19, 3.73s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 7.0762, 'learning_rate': 0.00028208846603335503, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4461/19440 [12:24:09<15:20:49, 3.69s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.612, 'learning_rate': 0.00028206964382303717, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4462/19440 [12:24:12<15:07:56, 3.64s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6068, 'learning_rate': 0.00028205082161271937, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4463/19440 [12:24:16<15:29:16, 3.72s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6853, 'learning_rate': 0.0002820319994024015, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7391, 'learning_rate': 0.0002820131771920837, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4464/19440 [12:24:20<15:17:12, 3.67s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.7068, 'learning_rate': 0.00028199435498176586, 'epoch': 0.69} + 23%|████████████████▌ | 4465/19440 [12:24:23<14:59:43, 3.60s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4466/19440 [12:24:26<14:43:43, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.7741, 'learning_rate': 0.00028197553277144806, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4822, 'learning_rate': 0.0002819567105611302, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4467/19440 [12:24:30<14:43:32, 3.54s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4404, 'learning_rate': 0.00028193788835081235, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4468/19440 [12:24:33<14:26:22, 3.47s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4469/19440 [12:24:36<14:09:55, 3.41s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4524, 'learning_rate': 0.00028191906614049455, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4224, 'learning_rate': 0.0002819002439301767, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4470/19440 [12:24:40<13:56:18, 3.35s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4471/19440 [12:24:43<13:44:29, 3.30s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.443, 'learning_rate': 0.0002818814217198589, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4472/19440 [12:24:46<13:38:05, 3.28s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.6026, 'learning_rate': 0.00028186259950954104, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.3445, 'learning_rate': 0.00028184377729922324, 'epoch': 0.69} + 23%|████████████████▌ | 4473/19440 [12:24:49<13:30:40, 3.25s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4474/19440 [12:24:52<13:23:14, 3.22s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.6012, 'learning_rate': 0.0002818249550889054, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4475/19440 [12:24:56<13:50:34, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.4879, 'learning_rate': 0.00028180613287858753, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4476/19440 [12:24:59<13:50:47, 3.33s/it] + 23%|████████████████▌ | 4476/19440 [12:24:59<13:50:47, 3.33s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4477/19440 [12:25:02<13:34:16, 3.27s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2811, 'learning_rate': 0.0002817684884579519, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.4109, 'learning_rate': 0.0002817496662476341, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4478/19440 [12:25:06<13:20:20, 3.21s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4479/19440 [12:25:09<13:10:21, 3.17s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.2919, 'learning_rate': 0.0002817308440373162, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2755, 'learning_rate': 0.00028171202182699837, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4480/19440 [12:25:12<13:02:34, 3.14s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4481/19440 [12:25:15<12:55:08, 3.11s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.0807, 'learning_rate': 0.0002816931996166805, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.3649, 'learning_rate': 0.0002816743774063627, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4482/19440 [12:25:18<12:46:02, 3.07s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4483/19440 [12:25:21<12:36:43, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.2409, 'learning_rate': 0.00028165555519604486, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9959, 'learning_rate': 0.00028163673298572706, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4484/19440 [12:25:24<12:27:14, 3.00s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4485/19440 [12:25:26<12:17:01, 2.96s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.184, 'learning_rate': 0.00028161791077540926, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4486/19440 [12:25:29<12:09:45, 2.93s/it] + 23%|████████████████▌ | 4486/19440 [12:25:29<12:09:45, 2.93s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.1528, 'learning_rate': 0.00028158026635477355, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▌ | 4487/19440 [12:25:32<12:06:09, 2.91s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 5.9171, 'learning_rate': 0.0002815614441444557, 'epoch': 0.69} + 23%|████████████████▌ | 4488/19440 [12:25:36<12:38:46, 3.04s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4489/19440 [12:25:38<12:29:20, 3.01s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 6.167, 'learning_rate': 0.0002815426219341379, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4490/19440 [12:25:41<12:13:13, 2.94s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.6568, 'learning_rate': 0.00028152379972382004, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 6.119, 'learning_rate': 0.00028150497751350224, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4491/19440 [12:25:44<12:01:54, 2.90s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4492/19440 [12:25:47<11:49:34, 2.85s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.7433, 'learning_rate': 0.00028148615530318444, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4493/19440 [12:25:49<11:40:01, 2.81s/it] + 23%|████████████████▋ | 4493/19440 [12:25:49<11:40:01, 2.81s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5607, 'learning_rate': 0.00028144851088254873, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4494/19440 [12:25:52<11:31:11, 2.77s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4495/19440 [12:25:55<11:22:04, 2.74s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.5264, 'learning_rate': 0.0002814296886722309, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4496/19440 [12:25:57<11:11:33, 2.70s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0947, 'learning_rate': 0.0002814108664619131, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.3343, 'learning_rate': 0.0002813920442515952, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4497/19440 [12:26:00<11:01:09, 2.65s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4498/19440 [12:26:02<10:49:12, 2.61s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +{'loss': 4.9274, 'learning_rate': 0.0002813732220412774, 'epoch': 0.69} +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4499/19440 [12:26:05<10:38:33, 2.56s/it]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +{'loss': 5.0647, 'learning_rate': 0.00028135439983095956, 'epoch': 0.69} +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`... +Could not estimate the number of tokens of the input, floating-point operations will not be computed + 23%|████████████████▋ | 4500/19440 [12:26:08<10:53:00, 2.62s/it] + 23%|████████████████▋ | 4500/19440 [12:26:08<10:53:00, 2.62s/it]The following columns in the evaluation set don't have a corresponding argument in `SpeechEncoderDecoderModel.forward` and have been ignored: length, lang. If length, lang are not expected by `SpeechEncoderDecoderModel.forward`, you can safely ignore this message. +***** Running Evaluation ***** + Num examples = 14760 + Batch size = 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Configuration saved in ./checkpoint-4500/config.json +{'eval_loss': 16.721336364746094, 'eval_bleu': 0.0, 'eval_runtime': 3683.9491, 'eval_samples_per_second': 4.007, 'eval_steps_per_second': 1.002, 'epoch': 0.69} +Model weights saved in ./checkpoint-4500/pytorch_model.bin +Feature extractor saved in ./checkpoint-4500/preprocessor_config.json +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +Feature extractor saved in ./preprocessor_config.json +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... +To disable this warning, you can either: + - Avoid using `tokenizers` before the fork if possible + - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) +huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible